I have been using a c# web crawler written by Hatem Mostafa, which is available on code project, and I have found extremely useful.
I have altered this now to work with our database of websites, and hey-presto, I can now crawl all our websites as if I were a user in the external world.
However, the limitation is this - it is all well and good crawling thousands of pages, but downloading them is costly. So how can you overcome this? use the HTTP Header Entity Tag which is a hash to identify changes in the file. Check the ETag against a record of files already downloaded, and see if the Etag has changed, if it has, proceed to download. Simple really, and reduces a lot of external traffic on the server as well.
Of course, this presumes that the web server is configured to send ETag headers, and that the pages aren't so dynamic, the the ETag becomes irrelevant, or is ommitted!
No comments:
Post a Comment