How to write a very simplistic Web Crawler in Python for fun.
Recently I decided to take on a new project, a Python based
that I am dubbing Breakdown. Why? I have always been interested in web crawlers
and have written a few in the past, one previously in Python and another before
that as a class project in C++. So what makes this project different?
For starters I want to try and store and expose different information about the
web pages it is visiting. Instead of trying to analyze web pages and develop a
ranking system (like
that allows people to easily search for pages based on keywords, I instead want to
just store the information that is used to make those decisions and allow people
to use them how they wish.
For example, I want to provide an API for people to be able to search for specific
web pages. If the page is found in the system, it will return back an easy to use
data structure that contain the pages
keyword histogram, list of links to other pages and more.
Overview of Web Crawlers
What is a web crawler? We can start with the simplest definition of a web crawler.
It is a program that, starting from a single web page, moves from web page to web
page by only using urls that are given in each page, starting with only those
provided in the original page. This is how search engines like
obtain the content they need for their search sites.
But a web crawler is not just about moving from site to site (even though this
can be fun to watch). Most web crawlers have a higher purpose, like (in the case
of search engines) to rank the relativity of a web page based on the content
provided within the pages content and html meta data to allow people easier
searching of content on the internet. Other web crawlers are used for more
invasive purposes like to obtain e-mail addresses to use for marketing or spam.
So what goes into making a web crawler? A web crawler, again, is not just about
moving from place to place how ever it feels. Web sites can actually dictate how
web crawlers access the content on their sites and how they should move around on
their site. This information is provided in the
file that can be found on most websites
(here is wikipedia’s).
A rookie mistaken when building a web crawler is to ignore this file. These
robots.txt files are provided as a set of guidelines and rules that web crawlers
must adhere by for a given domain, otherwise you are liable to get your IP and/or
User Agent banned. Robots.txt files tell crawlers which pages or directories to
ignore or even which ones they should consider.
Along with ensuring that you follow along with robots.txt please be sure to
provide a useful and unique
This is so that sites can identify that you are a robot and not a human.
For example, if you see a User Agent of “breakdown” on your website, hi, it’s me.
Do not use know User Agents like:
“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/12.04 Chromium/18.0.1025.168 Chrome/18.0.1025.168 Safari/535.19″,
this is, again, an easy way for you to get your IP address banned on many sites.
Lastly, it is important to consider adding in rate limiting to your crawler. It is
wonderful to be able to crawl websites and between them very quickly (no one likes
to wait for results), but this is another sure fire way of getting your IP banned
by websites. Net admins do not like bots to tie up all of their networks
resources, making it difficult for actual users to use their site.
Prototype of Web Crawler
So this afternoon I decided to take around an hour or so and prototype out the
code to crawl from page to page extracting links and storing them in the database.
All this code does at the moment is download the content of a url, parse out all
of the urls, find the new urls that it has not seen before, append them to a queue
for further processing and also inserting them into the database.This process has
2 queues and 2 different thread types for processing each link.
There are two different types of processes within this module, the first is a
Grabber, which is used to take a single url from a queue and download the text
content of that url using the
Python module. It then passes the content along to a queue that the Parser uses
to get new content to process. The Parser takes the content from the queue that
has been retrieved from the Grabber process and simply parses out all the links
contained within the sites html content. It then checks MongoDB to see if that
url has been retrieved already or not, if not, it will append the new url to the
queue that the Grabber uses to retrieve new content and also inserts this url
into the database.
The unique thing about using multiple threads per process (X for Grabbers and Y
for Parsers) as well as having two different queues to share information between
the two allows this crawler to be self sufficient once it gets started with a
single url. The Grabbers help feed the queue that the Parsers work off of and the
Parsers feed the queue that the Grabbers work from.
For now, this is all that my prototype does, it only stores links and crawls from
site to site looking for more links. What I have left to do is expand upon the
Parser to parse out more information from the html including things like meta
data, page title, keywords, etc, as well as to incorporate
robots.txt into the
processing (to keep from getting banned) and automated rate limiting
(right now I have a 3 second pause between each web request).
How Did I Do It?
So I assume at this point you want to see some code? The code it not up on
GitHub just yet, I have it hosted on my own private git repo for now and will
gladly open source the code once I have a better prototype.
Lets just take a very quick look at how I am sharing code between the different
def __init__(self, content_queue, url_queue):
self.c_queue = content_queue
self.u_queue = url_queue
data = self.c_queue.get()
for link in links:
def __init__(self, url_queue, content_queue):
self.c_queue = content_queue
self.u_queue = url_queue
next_url = self.u_queue.get()
#data = requests.get(next_url)
from breakdown import Parser, Grabber
from Queue import Queue
num_threads = 4
max_size = 1000
url_queue = Queue()
content_queue = Queue(maxsize=max_size)
parsers = [Parser.Thread(content_queue, url_queue) for i in xrange(num_threads)]
grabbers = [Grabber.Thread(url_queue, content_queue) for i in xrange(num_threads)]
for thread in parsers+grabbers:
thread.daemon = True
Lets talk about this process quick. The Breakdown code is provided as a binary
script to start the crawler. It creates “num_threads” threads for each process
(Grabber and Parser). It starts each thread and then appends the starting point
for the crawler, http://brett.is/. One of the Grabber threads will then pick up on
the single url, make a web request to get the content of that url and append it
to “content_queue”. Then one of the Parser threads will pick up on the content
data from “content_queue”, it will process the data from the web page html,
parsing out all of the links and then appending those links onto “url_queue”. This
will then allow the other Grabber threads an opportunity to make new web requests
to get more content to pass to the Parsers threads. This will continue on and on
until there are no links left (hopefully never).
I ran this script for a few minutes, maybe 10-15, and I ended up with over 11,000
links ranging from my domain,
and many many more. Now that I have a decent base prototype I can continue forward
and expand upon the processing and logic that goes into each web request.
Look forward to more posts about this in the future.