Google Crawl
Google Search Bar | Image credit: Solen Feyissa/Unsplash

Three uncommon ways to manage Google’s crawl behavior

Author at TechGenyz SEO

If you are new to SEO then you may only be slightly familiar with how Google crawls and indexes sites. You may not realize just how important it is for you to guide Google when it comes to doing this on your sites. From crawl budget to issuing authority, you have to take a proactive stance to achieve certain results.

Relevance is extremely important when it comes to ranking for certain keywords. The problem is that Google’s bots are not very good at determining what your site is about if there are too many pages and URLs that cause confusion. It’s up to you to help Google get it right and allow your site to flourish as it should.

In this article, we will go over three ways that you can manage Google when it crawls your site.


Google, or any other search engine, will check your robots.txt file for instructions on how to crawl your site. If it encounters a robots.txt disallow directive then it knows precisely how to proceed. As you can see, this is within your hands and you control where and how your site is crawled by using this system.

Google would rather not waste time crawling pages like your wp-admin directory. It wastes your crawl budget as well. By adding these directives the spider will know how to proceed.

You also want to avoid duplicate content issues on your site. Having a duplicate content event can confuse Google and lead to the wrong page ranking for a keyword, or neither page ranking at all since Google doesn’t know what to do with it.

The way the file is organized is with a grouping of directives. For instance, there is a user agent directive that addresses specific search engines so you can hone in on different results for different platforms. Another directive is the aforementioned disallow. Then there some like crawl delay which can save your server bandwidth when you have a big crawler coming to the site.

A problem with robots.txt as a set of instructions is that it is not a mandate. Some of the spiders will simply ignore the requests. And as Google has its own agenda, you will have to keep a close eye on the latest news coming out of them.

Robots directives

If robots.txt is a file with a set of instructions for Google, then the robots directives are the actual list. These are the specifics about how the site should be crawled and indexed. Which pages get indexed and crawled need to be specified so that Google understands.

The directives mainly deal with the instructions on which pages to not crawl or to not index. In the past, people would usually just use the canonical URL to signal to Google which page was the one with relevance and authority. This works well, but the signal from the robots directives is much stronger.

An example of when to have these directives on your site would be when you have a few landing pages for the different keywords you are targeting. For example, if you have an eCommerce store and are using PPC then you may have a few landing pages depending on the campaign.

The reason for that is that people coming to the site will have different motivations so you want a landing page that serves the needs of each visitor. One problem you may face when you have several landing pages is that the text is very similar with just a few variations. Not enough difference for Google to treat this as different text, in other words.

Using the robots ‘noindex’ directive will help Google move on without indexing those pages individually.

Another benefit is when you have a big site with thousands of different products, you can save a lot of your crawl budget by using those directives. Rather than have it crawl every URL, you can have it just crawl certain pages or even just categories.

Canonical URLs

When you have very similar pages on your site, there is the real threat that Google gets confused about which page should be the one that ranks for the keywords they both share. What happens is sometimes it ranks the wrong page. In worst-case scenarios, it decided to not rank any of them so you end up so deep in the SERPs that you’re unlikely to rank for your target keyword.

Using an HTML link tag, with the attribute of rel=”canonical” will allow Google to understand which page actually has precedence and will make sure that the correct page ranks.

Canonical URLs are a great way to keep things simple and organized. And simple and organized is how Google likes things so it is a best practice to make sure to do things in a way that Google likes. And this ultimately is good for the readers too since that is what Google strives for.

Lots of bloggers like to syndicate content which is essentially allowing another site with a different domain to republish something from your domain. It is clear that this is very confusing to Google and it could end up ranking the syndicated article from the other site over yours. To avoid this, the blogger should stipulate that the article should be tagged with a rel=”noncanonical” directive.

Another technique to make sure that Google knows what to do with your content is to put in some parameters on the Google Search Console. There is also some debate over whether you should bother with a canonical URL tag instead of doing a 301 redirect. It is not easy to answer this as each has its pros and cons, however. One thing that’s for sure is that it is much easier to do a canonical tag than a 301.


These are some of the easiest ways to make sure that you are taking control of how Google treats your site. Never take it for granted that Google will know the best way to treat your site.