Home IT companies Google tries to index the Invisible Web

Google tries to index the Invisible Web

by admin

Developers from the Crawling and Indexing Team reported about an important experiment that began recently.They upgraded the crawler and began testing the technology to process HTML forms intelligently. After the upgrade, the crawler should learn how to retrieve hidden URLs and web pages that are generated in response to form processing on various sites and that cannot be retrieved in any other way.
In practice, the technology works like this: when encountering an element, the form handler makes a series of test queries. For text fields, the words from the site where the form is located are automatically selected as queries. The values of the checkboxes and dropdown menus are taken directly from the page code. After that, the program tries to process the received URL. If the page really contains some content, then it is sent for indexing to the general search index.
Despite the seeming simplicity and obviousness, HTML-forms processing is a very important step in pulling out the so-called "Invisible Web" (Deep Web) – huge amounts of information that are hidden in large databases exposed to the world through HTML-forms interfaces. These are legal databases, a variety of directories (phones, addresses, prices) and other data sets. For some estimates , the Invisible Web contains hundreds of billions of pages and covers 90% of all Internet content. It should be noted that this is where the most valuable content hides, which is still not accessible through standard search engines.
True, in any case, a huge chunk of the Invisible Web will still remain beyond the reach of Google, because the crawler is forbidden to enter any passwords or other personal information in the form fields: this is the decision of developers and Google executives. And a lot of sites provide public access to information only after a free registration on the site. But legally Google robot has no right to create a fictitious identity specifically to register, because it’s fraudulent and contrary to the principles of always googlebot friendly
By the way, knowledgeable people have already explained where the new crowling technology comes from. It was most likely created by a team of developers from a small company called Transformic, which Google acquired in 2005. They’ve been working hard for the last two and a half years, perfecting their development and helping to integrate it into Google Crowler.

You may also like