Simple picture – complications
Web crawling isn’t feasible with one machine
All of the above steps distributed
Even non-malicious pages pose challenges
Latency/bandwidth to remote servers vary
Webmasters’ stipulations
How “deep” should you crawl a site’s URL
hierarchy?
Site mirrors and duplicate pages
Malicious pages
Spam pages
Spider traps – incl dynamically generated
Politeness – don’t hit a server too often