Primary functions/operations of web link crawlers are:
1. Retrieve seed web page
2. Extract all valid URLs/links
3. Visit every link extracted in step 2.
4. Stop if depth has reached maximum depth.
The following Python script is an extremely simple and effective web crawler! It can be configured to use different seed URLs and also different depth.
A partial output of the above script with http://www.cnn.com as the seed URL is given below:
No comments:
Post a Comment