Table of contents of the article:
Seeing that many of our customers have a hard time crawling and indexing their websites properly, we went through some Google documentation on crawling, rendering, and indexing to better understand the whole process.
Some of our results were extremely surprising, while others confirmed our previous theories.
Here are 5 things I've learned you may not know about how Googlebot works.
1. Googlebot skips some URLs
Googlebot won't visit every URL it finds on the web. The larger a website is, the more it is at risk of some of its URLs not being crawled and indexed.
Why doesn't Googlebot just visit every URL it can find on the web? There are two reasons for this:
- Google has limited resources. There is a lot of spam on the web, so Google needs to develop mechanisms to avoid visiting low-quality pages. Google prioritizes crawling the most important pages.
- Googlebot is designed to be a good citizen of the web. Limit the scan to avoid server crash.
The mechanism for choosing which URLs to visit is described in the Google patent "Method and apparatus for managing a backlog of pending URL crawls"
"The pending URL scan is rejected by the backlog if the priority of the pending URL scan does not exceed the priority threshold"
"Various criteria are applied to the requested URL scans, so that less important URL scans are rejected in advance by the backlog data structure. "
These quotes suggest that Google is assigning a crawl priority to each URL and may refuse to crawl some URLs that don't meet the priority criteria.
The priority assigned to URLs is determined by two factors:
- The popularity of a URL,
- The importance of crawling a given URL to keep the Google index fresh.
"The priority may be higher based on the popularity of the content or IP address / domain name e the importance of maintaining freshness rapidly changing content such as breaking news. Since scan capacity is a scarce resource, scan capacity is preserved with priority scores".
What exactly makes a URL popular? The Google patent " Minimize the visibility of outdated content in web search, including reviewing document web scan intervals ”Defines URL popularity as a combination of two factors: view rate and PageRank.
PageRank is also mentioned in this context in other patents, such as Scheduler for search engine crawlers .
But there is one more thing you should know. When your server responds slowly, the priority threshold that your URLs must meet increases.
"The priority threshold is adjusted, based on an updated probability estimate of satisfying the requested URL scans. This probability estimate is based on the estimated fraction of the requested URL scans that can be satisfied. The fraction of requested URL scans that can be satisfied has as its numerator the average request interval or the difference in arrival time between URL crawl requests."
To sum it up, Googlebot may skip crawling some of your URLs if they don't meet a priority threshold based on the URL's PageRank and the number of views it gets.
This has strong implications for any large website.
If a page is not crawled, it will not be indexed and will not appear in search results.
To do:
- Make sure your server and website are fast.
- Check your server logs. They provide you with valuable information on which pages of your website are crawled by Google.
2. Google divides pages into levels for re-crawling
Google wants search results to be as fresh and up to date as possible. This is only possible when a mechanism is in place to rescan already indexed content.
In the patent " Minimize the visibility of outdated content in web search ”I found information on how this mechanism is structured.
Google is by dividing pages into levels in based on how often the algorithm decides that they need to be repeated.
"In one embodiment, documents are partitioned in multiple levels, each level including a plurality of documents sharing similar web scan ranges."
Therefore, if your pages aren't scanned as often as you want, they are most likely in a document layer with a longer scan interval.
However, don't despair! Your pages don't need to stay in that layer forever - they can be moved.
Each time a page is crawled it is an opportunity for you to prove that it is worth re-crawling more frequently in the future.
"After each scan, the search engine re-evaluates the web scan range of a document and determines if the document should be moved from the current layer to another layer".
It is clear that if Google sees that a page changes frequently, it may be moved to a different level. But it's not enough to change some minor aesthetic elements: Google is analyzing both the quality and quantity of changes made to your pages.
To do:
- Use your server logs and Google Search Console to know if your pages are being crawled often enough.
- If you want to reduce the crawl interval of your pages, regularly improve the quality of your content.
3. Google doesn't re-index a page on every crawl
According to the patent Minimize the visibility of outdated content in web search, including reviewing web document scan intervals , Google does not re-index a page after each crawl.
"If the document has changed substantially since the last scan, the scheduler sends a warning to a content indexer (not shown), which replaces the index entries for the previous version of the document with index entries for the current version of the document. Next, the scheduler calculates a new web scan interval for the document based on its old interval and additional information, such as the importance of the document (measured by a score, such as PageRank), refresh rate, and / or percentage. of clicks. If the content of the document has not changed or if the changes to the content are not critical, there is no need to re-index the document. "
I have seen it in nature several times.
Also, I did some experiments on existing pages on Onely.com. I noticed that if I was only changing a smart part of the content, Google wasn't re-indexing it.
To do:
If you have a news website and you frequently update your posts, check if Google re-indexes it quickly enough. If not, you can rest assured that there is untapped potential in Google News for you.
4. Click-through rate and internal link
In the previous quote, did you notice how click-through rate was mentioned?
"Next, the scheduler calculates a new web scan interval for the document based on its old interval and additional information, such as the importance of the document (measured by a score, such as PageRank), refresh rate and / or click-through rate."
This quote suggests that click-through rate affects a URL's crawl rate.
Let's imagine we have two URLs. One is visited by Google users 100 times a month, another is visited 10000 times a month. All other things being equal, Google should revisit the one with 10000 visits per month more frequently.
According to the patent, PageRank is also an important part of this. This is one more reason to make sure you are using internal links correctly to connect various parts of your domain.
To do:
- Can Google and users easily access the most important sections of your website?
- Is it possible to reach all the important URLs? Having all your URLs available in the sitemap may not be enough.
5. Not all links are created equal
We just explained how, according to Google's patents, PageRank heavily affects crawl.
The first implementation of the PageRank algorithm was unsophisticated, at least judging by current standards. It was relatively simple: if you received a link from an * important * page, you would rank higher than other pages.
However, the first PageRank implementation was released over 20 years ago. Google has changed a lot since then.
I found interesting patents, such as i Ranking documents based on user behavior and / or feature data , which show that Google is well aware that some links on a given page are more important than others. Besides, Google might treat these links differently.
“This reasonable navigation model reflects the fact that not all links associated with a document are equally likely to be followed. Examples of unlikely links may include links to "Terms of Service", banner ads and unrelated links to the document. "
So Google is analyzing the links based on their various characteristics. For example, they can examine the font size and position of the link.
" For example, the model build unit may generate a rule indicating that links with anchor text greater than a given font size are more likely to be selected than links with anchor text less than the particular font size. Also, or alternatively, generating the unit model can generate a rule indicating that links positioned closer to the top of a document are more likely to be selected than links positioned towards the bottom of the document. "
It even appears that Google can create rules for evaluating links at the website level. For example, Google can see that links in "More Top News" are clicked more frequently so they can give them more weight.
“(…) The model build unit may generate a rule indicating that a link placed under the 'More Top Stories' heading on the cnn.com website has a high probability of being selected. Additionally, or alternatively, the model build unit may generate a rule indicating that a link associated with a destination URL that contains the word “domainpark” has a low probability of being selected. Also, or alternatively, the model generation unit may generate a rule indicating that a link associated with a source document containing a popup has a low probability of being selected. "
As a side note, in conversation with Barry Schwartz and Danny Sullivan in 2016 , Gary IIIyes confirmed that Google labels links as the footer or penguin.
"Basically, we have tons of link labels; for example, it is a footer link, in practice, which has a much lower value than a link in the content. So another label would be a realtime Penguin label".
Summarizing the key points:
- Google is prioritizing every crawled page
- The faster the website, the faster Google crawls.
- Google will not crawl and index all URLs. Only URLs with priority assigned above the threshold will be crawled.
- Links are treated differently depending on their characteristics and positioning
- Google doesn't re-index a page after every crawl. It depends on the severity of the changes made.
In conclusion
As you can see, crawling is anything but a simple process of following any links that Googlebot can find. It's really complicated and has a direct impact on the search visibility of any website. I hope this article has helped you understand crawling a little better and that you will be able to use this knowledge to improve how Googlebot crawls your website and ranks better as a result and how it matters beyond. than having a site with a correct tree and structure and a good internal and external link building process, it is essential more than ever to have fast and efficient hosting and servers in order to manage the crawling process of Google Bots in the best possible way and therefore maximize the profitability of the crawling budget.