They are the brains of modern search engines. They allow you to archive web pages and index them in the various Google, Bing, and Yahoo databases! Google, Bing, Yahoo! and their counterparts adopt highly sophisticated information technologies to offer – almost instantly – thousands and thousands of results for every single search made by users. But what makes search engines work? Their brain, aka the web crawler spider.
The web crawler spider (sometimes abbreviated simply to ” spider ” or ” web crawler “) is an Internet bot that periodically scans the World Wide Web to create an index or, better still, a map. Search engines – and some other Internet services – use such software to update their content or update the web indexes in their databases. The spiders can copy the content of all the pages they visit and keep it to allow the search engine to analyze and index it or catalog it by identifying keywords and topics covered later. Doing this makes it possible to return search results quickly and accurately.
A spider begins its work with the so-called seeds. The sources are nothing more than a list of URLs corresponding to the many websites that the program will have to visit systematically. The content of these addresses will be analyzed and saved in memory to be indexed by the cataloging software associated with the search engine. In particular, the web crawler will search for hypertext links within the pages, adding them to the list of URLs to visit later. The URLs of this list, called crawl, are recursively visited by the spider to record any changes or updates.
Of course, the URLs and hyperlinks on the border pages will also be added to the general list and visited later. In this way, a real web of Internet pages is created and linked through hyperlinks. Hence the explanation of the name spider (“spider” in English) more or less, his Internet-based service. If the crawler acts in “archiving” mode, it copies and stores the contents of every single page it visits. The pages are saved as a snapshot to speed up the process while remaining legible and navigable.
The behavior of the spiders is the result of the combination of different regulatory policies. There are four, in particular, that have the most significant effects on the work of web crawlers: the selection policy, the policy for managing visits, the courtesy policy, and the parallelization policy.
Given the current size of the web, it is practically impossible for a spider to index all the websites and all the pages that compose them (even if it must be remembered, the indexable ones are the smallest part of the Net ). A web crawler can crawl and “work” between 40 and 70% of public pages, while this percentage was lower in the past. Since a bot will visit only a fraction of the WebSphere pages, it will be essential that those downloaded contain relevant information and are not just a random “sample.”
This is possible thanks to a priority scale assigned to the spider during the programming phase. The importance of a page will depend on its intrinsic quality, the popularity in terms of links that refer to you or the visits it receives, and, in exceptional cases, of the URLs that compose it. Developing a functional and functioning selection policy is anything but simple since scanning the spider will “know” only a tiny part of the web.
The web has a highly dynamic nature, and, however fast and efficient it may be, a web crawler will take tens of days, if not months, to fathom the portion of the network that has been assigned to it. It is likely that pages already visited and indexed have changed in this period, even substantially. It is, therefore, necessary to periodically return to see the pages already indexed so that the contents saved in the database are updated.
In the course of their crawling, crawlers can also have a substantial impact on a website’s performance. Even if their action is valuable and necessary, the spiders require the payment of a relatively high price in terms of resources used (network resources, server overload, etc.) for their purposes. The so-called robot exclusion protocol represents a partial solution to this problem.
A file called robot.txt can also be inserted among the files present in the folder of the homepage of a website. This document communicates to the web crawler a series of information on how to perform the indexing of the site, such as which parts of the site to be crawled and which are not and what is the minimum time interval that must elapse between the scanning of a page of the site and the next.
A parallel web crawler is a spider that performs several scans simultaneously. To prevent a single web page from being crawled several times by the same bot in a short time, a policy is needed that controls the assignment of new URLs discovered between seeds or border pages to avoid duplication.
Also Read: Marketing And CRM: Benefits Of Integration, Even In SMEs