Sunday 4 May 2014

An Extremely Simple and Effective Web Crawler in Python

Web crawlers also known as web spiders are used in retrieving web links/pages by following links starting from the seed/initial web page. Crawlers are widely used in building search engines. The retrieved links/pages have numerous applications. 

Primary functions/operations of web link crawlers are:

1. Retrieve seed web page
2. Extract all valid URLs/links
3. Visit every link extracted in step 2. 
4. Stop if depth has reached maximum depth. 

The following Python script is an extremely simple and effective web crawler! It can be configured to use different seed URLs and also different depth.






A partial output of the above script with http://www.cnn.com as the seed URL is given below:

No comments:

Post a Comment