A web crawler, spider, or search engine bot downloads and indexes content from all over the Internet. The goal of such a bot is to learn what (almost) every webpage on the web is about, so that the information can be retrieved when it’s needed. They’re called “web crawlers” because crawling is the technical term for automatically accessing a website and obtaining data via a software program. Content can be in the form of a web page, image, video, document, etc.
Web Crawler is not just used by search engines but has multiple purposes such as:
- Indexed reviews from a food aggregator app
- Information for academic research
- Market research to find the most popular trends
- Best services or locations for personal use
- Jobs or opportunities in business
Web crawler uses in the field of business intelligence include:
- Tracking changes in content
- Detecting malicious websites
- Automated price retrieval from competitor websites for pricing strategy
- Identifying potential bestsellers for an e-commerce platform by accessing data from the competition
- Ranking popularity of leaders or film stars
- Access data feeds from thousands of similar brands
- Indexing most frequently shared links on social networks
- Access and index job listings based on employee reviews and salaries
- Zip code-based price benchmarking and cataloging for retailers
- Building a service review database by aggregating scattered reviews across multiple sources
- Extracting data from news agencies and social feeds for breaking news, used for generating automated content
- Accessing market and social data to build a financial recommendation engine
- Discovering terrorist-related chat rooms
In this post we will learn about designing a web crawler with following properties:
- Web crawler will be used for search engine indexing
- Content to be crawled would be HTML pages
- It should store crawled pages for ~5 years
- Page with duplicate content should be ignored
- Web Crawler should additionally be highly scalable, flexible and robust in nature.
Based upon the above requirements lets perform Back of the envelope estimations:
- If 1 Billion web pages are downloaded in a month
- Queries per second would be 1,000,000,000 / 30 days / 24 hours / 60 mins / 60 secs : ~ 400 pages per seconds, peak queries per second would be : 2 * normal QPS i.e. 800 pages per second
- If average page size is 500k bytes then 1 billion pages would need : 500k * 1 billion : ~500 TB / month.
- Data will stored for 5 years so total storage required would be: 500 TB * 12 * 5 -> 30PB
Components required in a Web Crawler
- Seed URLs: Seed URL is used by web crawlers as a start point for crawl process to start. Divide entire URL space into smaller ones to get started. Then dividing URLs based on geography is another option as different countries will have different popular websites.
- URL Frontier: Crawl state can be divided into two parts : to be downloaded and already downloaded. URL frontier stores the URLs to be downloaded.
- HTML Downloader: URLs provided by URL frontier will be downloaded by HTML downloader.
- DNS Resolver: To download a web page, URL must be converted into an IP Address. HTML downloader will call DNS resolver to get IP address.
- Content Parser: Once web page is downloaded by HTML downloader content will be parsed by content parser. It also validates if web pages are malformed or not. As such pages will provoke problems and result in wastage of storage.
- Content Seen Before?: To avoid storing duplicate content we’ll “Content Seen Before” data structure which will eliminate redundant data and will compare hash values of two pages.
- Content Storage: HTML content will be stored at content storage and this will store data on disk as data is too big to fit in the memory. Only popular content is stored in memory to avoid latency.
- URL Extractor: It is used for extracting URLs from web pages.
- URL Filter: This is used to exclude blacklisted sites from getting stored.
- URL Seen Before?: It is used for keeping a track of URLs that visited before or already in the Frontier. This helps to avoid adding same URL multiple times in URL frontier which can potentially lead to infinite loops.
- URL Storage: It is used for storing already visited URLs
Lets discuss the above mentioned flow:
- Seed URLs are added to the URL Frontier
- HTML Downloader will fetch list of URLs from frontier
- DNS resolver will provide IP address of URLs to HTML Downloader post which it will start downloading.
- Content parser will parse HTLM pages and performs validations. Once validations pass and content is parsed it is passed to “Content Seen Before Component”
- Content seen before will check if the HTML page is already in the storage or not. If content is not present then it is passed to URL Extractor component.
- URL Extractor will extract URLs from the content received and forwards URLs to URLs filter.
- After URLs are filtered they are passed to “URLs seen before” component.
- “URLs Seen Before” will check if a URL is already in the URL Storage in case it is not present add URL to URL Frontier.
Deep Dive into Design
Web can be treated as a directed graph where web pages can be treated as nodes and URLs in it can be treated as edges. To traverse this graph which approach is better (BFS or DFS) and why?
BFS is used commonly in web crawlers by a FIFO queue (First In First Out). However this approach also has problems such as most links from same web page are linked back to the same host thus causing impoliteness. Priority of URL is not taken in consideration and URLs should be prioritized based upon their ranks, web traffic, frequency, etc.
Mix of both BFS and DFS: Iterative deepening or more specifically iterative deepening depth-first search (IDS or IDDFS) is a state space/graph search strategy in which a depth-limited version of depth-first search is run repeatedly with increasing depth limits until the goal is found.
Above mentioned problems can be addressed by URL Frontier. It used for storing URLs to be downloaded. URL Frontier can be used to ensure URL prioritization, freshness, politeness, etc. Politeness : web server should avoid sending too many requests to a web server with in short period of time as this will result in “denial of service”. The general way of enforcing politeness is to download one page at a time from the same host. Politeness is implemented by maintaining a mapping from website hostnames to download (worker) threads. Each downloader thread will have a separate FIFO queue and will download only URLs obtained from that particular queue. Priority: Not all web pages have the same level of importance. A new post on a reddit group has less priority than a new post on the associated press, for example. For managing priority just like politeness different queues represent different priority levels. And the queue selector is responsible for selecting URLs from the high priority queue more often. We’ll then put these two queues in sequence. First the URLs will be placed on priority queues, then placed on politeness queues. Freshness: Web pages are constantly updated and web crawler should constantly recrawl downloaded pages to keep data fresh. Recrawling can be based upon web pages update history or recrawl pages based upon priority.
HTML Downloader
It makes a request to DNS and receives an IP address. Then another request is made to the IP address to retrieve an HTML page.
There exists a file on most websites called robots.txt, Robots Exclusion Protocol, that is used by crawlers to communicate with website and it acts as a guide to which pages a crawler is allowed to parse. Example snippet of what the https://www.apple.com/robots.txt file looks like is added below. Before attempting to crawl a site, we should make sure to check for this file and follow its rules. This robots file is also a good candidate for caching to avoid duplicated downloads.
Disallow: /*retail/availability*
Disallow: /*retail/availabilitySearch*
Disallow: /*retail/pickupEligibility*
Disallow: /*shop/sign_in*
Disallow: /*shop/sign_out*
Disallow: /*/shop/1-800-MY-APPLE/*
There are few more performance enhancements that can be done:
- Instead of crawling URLs one by one inside URL frontier we can have multiple crawl job on multiple servers each having multiple threads to increase performance.
- DNS Resolver can take 10ms-200ms to respond so its a good idea to cache response for frequent visited URLs.
- Incase URL takes a lot of time to respond timeout should happen.
- Data can be distributed across multiple servers using consistent hashing.
- Your web crawler can support multiple data types by adding additional components inside Content Seen Before component such as PNG, HTML, Web Monitor, etc.
- Max URL length can be defined to avoid spider traps.