Remo Talks.....: May 2014

Monday, 5 May 2014

Compare HTML Pages: HTML Tags Counter

There are many instances in which we would want to compare HTML templates programmatically. One of the simplest methods or one of the factors that could be used to rule out the similarity of two HTML pages is to count the number of HTML tags in those pages and compare the same.

The following Python script helps to get the HTML tag count of an HTML webpage given a URL:

A sample output of the above script is given below:

Sunday, 4 May 2014

An Extremely Simple and Effective Web Crawler in Python

Web crawlers also known as web spiders are used in retrieving web links/pages by following links starting from the seed/initial web page. Crawlers are widely used in building search engines. The retrieved links/pages have numerous applications.

Primary functions/operations of web link crawlers are:

1. Retrieve seed web page

2. Extract all valid URLs/links

3. Visit every link extracted in step 2.

4. Stop if depth has reached maximum depth.

The following Python script is an extremely simple and effective web crawler! It can be configured to use different seed URLs and also different depth.

A partial output of the above script with http://www.cnn.com as the seed URL is given below: