Remo Talks.....: May 2014

There are many instances in which we would want to compare HTML templates programmatically. One of the simplest methods or one of the factors that could be used to rule out the similarity of two HTML pages is to count the number of HTML tags in those pages and compare the same.

The following Python script helps to get the HTML tag count of an HTML webpage given a URL:

	import lxml.html
	import urllib2

	def proc_root(root, tag_count):
	try:
	tag_count[root.tag] += 1
	except:
	tag_count[root.tag] = 1

	for child in root:
	proc_root(child, tag_count)
	return tag_count

	def get_tag_count(url):
	tag_count = {}
	res = urllib2.urlopen(url).read()
	root = lxml.html.fromstring(res)
	proc_root(root, tag_count)
	return tag_count

	def main():
	url = 'http://www.google.com'
	tag_count = get_tag_count(url)
	for tag, count in tag_count.items():
	print '%s\t%s' % (tag, count)

	if __name__ == "__main__":
	main()

view raw html_tag_count.py hosted with ❤ by GitHub

A sample output of the above script is given below:

	meta 1
	table 1
	font 1
	style 2
	span 8
	script 5
	tr 1
	html 1
	input 7
	td 3
	body 1
	head 1
	form 1
	nobr 2
	br 6
	a 20
	b 1
	center 1
	textarea 1
	title 1
	p 1
	u 1
	div 17

view raw html_tag_count_out.txt hosted with ❤ by GitHub

Web crawlers also known as web spiders are used in retrieving web links/pages by following links starting from the seed/initial web page. Crawlers are widely used in building search engines. The retrieved links/pages have numerous applications.

Primary functions/operations of web link crawlers are:

1. Retrieve seed web page

2. Extract all valid URLs/links

3. Visit every link extracted in step 2.

4. Stop if depth has reached maximum depth.

The following Python script is an extremely simple and effective web crawler! It can be configured to use different seed URLs and also different depth.

	import re, urllib
	from urlparse import urlparse

	depth_max = 1 #determines the maximum depth the crawler goes from the seed website
	urls = [] #stores the unique urls retrieved while crawling
	url_visited = [] #stores the urls of web pages visited


	#clean url; add protocol and host if absent; remove query section
	def getCleanURL(_cURL, _baseHost):
	try:
	oURL = urlparse(_cURL)
	except:
	return None
	if oURL.scheme == '':
	scheme = 'http'
	else:
	scheme = oURL.scheme

	if oURL.netloc == '':
	host = _baseHost
	else:
	host = oURL.netloc

	url = scheme + '://' + host + oURL.path
	return url

	def crawl(_baseURL, fh, _depth=0):
	oBase = urlparse(_baseURL)
	baseHost = oBase.netloc

	if _depth > depth_max: #if depth exceed maximum depth stop crawling further
	return
	elif _baseURL in url_visited: #web page already visited
	return
	try:
	res = urllib.urlopen(_baseURL).read()
	url_visited.append(_baseURL)
	except: #error visiting url/web page
	return

	res = res.replace('\n', '')
	for url in re.findall('''href=["'](/[^"']+)["']''', res, re.I):
	url = getCleanURL(url, baseHost)
	if url not in urls and url != None: #check url validity and uniqueness
	urls.append(url)
	fh.write(url + '\n')
	for url in urls:
	crawl(url, fh, _depth+1)

	def main():
	seed_url = 'http://www.cnn.com'
	fh_urls = open('urls.txt','w')
	crawl(seed_url, fh_urls)
	fh_urls.close()

	if __name__ == "__main__":
	main()

view raw simple_crawl.py hosted with ❤ by GitHub

A partial output of the above script with http://www.cnn.com as the seed URL is given below:

	http://www.cnn.com/tools/search/cnncom.xml
	http://www.cnn.com/tools/search/cnncomvideo.xml
	http://www.cnn.com/CNN/Programs
	http://www.cnn.com/cnn/programs/
	http://www.cnn.com/cnni/
	http://www.cnn.com/video/
	http://www.cnn.com/trends/
	http://www.cnn.com/US/
	http://www.cnn.com/WORLD/
	http://www.cnn.com/POLITICS/
	http://www.cnn.com/JUSTICE/
	http://www.cnn.com/SHOWBIZ/
	http://www.cnn.com/TECH/
	http://www.cnn.com/HEALTH/
	http://www.cnn.com/LIVING/
	http://www.cnn.com/TRAVEL/
	http://www.cnn.com/OPINION/
	http://www.cnn.com/2014/05/04/us/circus-accident-rhode-island/index.html
	http://www.cnn.com/2014/05/03/politics/washington-correspondents-dinner/index.html
	http://www.cnn.com/2014/05/04/world/europe/ukraine-crisis/index.html
	http://www.cnn.com/2014/05/04/us/clippers-shelly-sterling/index.html
	http://www.cnn.com/2014/05/04/us/rocky-top-tennessee/index.html
	http://www.cnn.com/2014/05/04/us/condoleeza-rice-rutgers-protest

view raw simple_crawl_out.txt hosted with ❤ by GitHub

Remo Talks.....

Monday, 5 May 2014

Compare HTML Pages: HTML Tags Counter

Sunday, 4 May 2014

An Extremely Simple and Effective Web Crawler in Python