Friday 19 September 2008

Comparing two files, one on a hard drive, and one from a web site

In my quest to create an effective and lean search engine service for the company I work for, I used to crawl all of our websites and download every file from them, and write them to a local cache. When the cache was detected as updated, my indexing system (based on Lucene.Net) detects the changes, and reindexes those sites.

You may well be able to imagine, that with over 800 sites, this is an awfully large amount of data to writing to the cache, and then an even bigger job for the index to reindex all these files!

So, what I wanted was a mechanism whereby I could compare the local copy, to a streamed copy of the file from the server into the local memory, and see there were any changes.

To do this, I immediate thought of looking for something like a checksum check, but a colleague recommended looking at doing a MD5 hash on them, using a tool like CipherLite by Obivex. Looking into this, i found an easier way, by using a webrespone to download a memory stream, opening the local file using a filestream, and performing hash on them to see whether the contents were the same:

//get a memory stream to hold the data that is downloaded
MemoryStream msFile = new MemoryStream();
writer = new BinaryWriter(msFile);
byte[] RecvBuffer = new byte[10240];
int nBytes, nTotalBytes = 0;
// loop to receive response buffer
while((nBytes = response.socket.Receive(RecvBuffer, 0, 10240, SocketFlags.None)) > 0)
{
// increment total received bytes
nTotalBytes += nBytes;
// write received buffer to file
writer.Write(RecvBuffer, 0, nBytes);
// check if the uri type not binary to can be parsed for refs
if(bParse == true)
// add received buffer to response string
strResponse += Encoding.ASCII.GetString(RecvBuffer, 0, nBytes);
// update view text
// check if connection Keep-Alive to can break the loop if response completed
if(response.KeepAlive && nTotalBytes >= response.ContentLength && response.ContentLength > 0)
break;
}
bool bContinue = false;
FileStream fStream = null;
try
{
//check to see if the file exists on the local file system
if(File.Exists(PathName))
{
//open the file, and read in the stream
fStream = File.Open(PathName, FileMode.Open,FileAccess.Read,FileShare.Read);

//compare the two streams, to see if they are the same (see later)
bContinue = compareFiles(msFile,fStream);
}
else
{
//file doesn't exist, download anyway
bContinue = true;
}
}
catch(Exception ex)
{
bContinue = true;
LogError(ex.Message,"");
}
finally
{
if(fStream!=null)
fStream.Close();
fStream = null;
}
if(bContinue)
{
//create a stream to create the new file
streamOut = File.Open(PathName, FileMode.Create, FileAccess.Write, FileShare.ReadWrite);
//create the new copy
msFile.WriteTo(streamOut);
//close up
streamOut.Close();
}
msFile.Close();


So, to how the streams are compared. I adapted my solution to one I found via a google search on hashing:

bool compareFiles(MemoryStream file1, FileStream file2)
{
using (HashAlgorithm hashAlg = HashAlgorithm.Create())
{
// Calculate the hash for the files.
byte[] hashBytesA = hashAlg.ComputeHash(file1);
byte[] hashBytesB = hashAlg.ComputeHash(file2);
// Compare the hashes.
if (BitConverter.ToString(hashBytesA) == BitConverter.ToString(hashBytesB))
{
//they are the same
return true;
}
else
{
//they are different
return false;
}
}
}

Hope this helps.

2 comments:

  1. I have no idea what any of that means. Is it anything to do with good beer? :-)
    Gor-Gor

    ReplyDelete
  2. Sort of ;-). After a few "good beers", sometimes it becomes clearer.

    ReplyDelete