There are many instances in which we would want to compare HTML templates programmatically. One of the simplest methods or one of the factors that could be used to rule out the similarity of two HTML pages is to count the number of HTML tags in those pages and compare the same.
The following Python script helps to get the HTML tag count of an HTML webpage given a URL:
A sample output of the above script is given below:
This blog is all about informatory articles based on my experience on various areas, including information technology, communications research, stock trading, traveling and cooking.
Monday, 5 May 2014
Sunday, 4 May 2014
An Extremely Simple and Effective Web Crawler in Python
Web crawlers also known as web spiders are used in retrieving web links/pages by following links starting from the seed/initial web page. Crawlers are widely used in building search engines. The retrieved links/pages have numerous applications.
The following Python script is an extremely simple and effective web crawler! It can be configured to use different seed URLs and also different depth.
A partial output of the above script with http://www.cnn.com as the seed URL is given below:
Primary functions/operations of web link crawlers are:
1. Retrieve seed web page
2. Extract all valid URLs/links
3. Visit every link extracted in step 2.
4. Stop if depth has reached maximum depth.
The following Python script is an extremely simple and effective web crawler! It can be configured to use different seed URLs and also different depth.
A partial output of the above script with http://www.cnn.com as the seed URL is given below:
Labels:
crawler,
easy,
efficient,
extract urls,
http,
https,
python,
regex,
seed url,
simple,
spider,
urllib,
web,
web crawler,
web parser,
web spider
Subscribe to:
Posts (Atom)