Scott Services: The ultimate guide to bot herding and spider wrangling

Friday, 4 May 2018

The ultimate guide to bot herding and spider wrangling — Part Two

In Part One of our three-part series, we learned what bots are and why crawl budgets are important. Let’s take a look at how to let the search engines know what’s important and some common coding issues.

How to let search engines know what’s important

When a bot crawls your site, there are a number of cues that direct it through your files.

Like humans, bots follow links to get a sense of the information on your site. But they’re also looking through your code and directories for specific files, tags and elements. Let’s take a look at a number of these elements.

Robots.txt

The first thing a bot will look for on your site is your robots.txt file.

For complex sites, a robots.txt file is essential. For smaller sites with just a handful of pages, a robots.txt file may not be necessary — without it, search engine bots will simply crawl everything on your site.

There are two main ways you can guide bots using your robots.txt file.

1. First, you can use the “disallow” directive. This will instruct bots to ignore specific uniform resource locators (URLs), files, file extensions, or even whole sections of your site:

User-agent: Googlebot
Disallow: /example/

Although the disallow directive will stop bots from crawling particular parts of your site (therefore saving on crawl budget), it will not necessarily stop pages from being indexed and showing up in search results, such as can be seen here:

The cryptic and unhelpful “no information is available for this page” message is not something that you’ll want to see in your search listings.

The above example came about because of this disallow directive in census.gov/robots.txt:

User-agent: Googlebot
Crawl-delay: 3
Disallow: /cgi-bin/

2. Another way is to use the noindex directive. Noindexing a certain page or file will not stop it from being crawled, however, it will stop it from being indexed (or remove it from the index). This robots.txt directive is unofficially supported by Google, and is not supported at all by Bing (so be sure to have a User-agent: * set of disallows for Bingbot and other bots other than Googlebot):

User-agent: Googlebot
Noindex: /example/
User-agent: *
Disallow: /example/

Obviously, since these pages are stil…

[Read the full article on Search Engine Land.]

Opinions expressed in this article are those of the guest author and not necessarily Marketing Land. Staff authors are listed here.

About The Author

Stephan Spencer is the creator of the 3-day immersive SEO seminar Traffic Control; an author of the O’Reilly books The Art of SEO, Google Power Search, and Social eCommerce; founder of the SEO agency Netconcepts (acquired in 2010); inventor of the SEO proxy technology GravityStream; and the host of two podcast shows The Optimized Geek and Marketing Speak.

This marketing news is not the copyright of Scott.Services – please click here to see the original source of this article. Author: Stephan Spencer

For more SEO, PPC, internet marketing news please check out https://news.scott.services

Why not check out our SEO, PPC marketing services at https://www.scott.services

We’re also on:
https://www.facebook.com/scottdotservices/
https://twitter.com/scottdsmith
https://plus.google.com/112865305341039147737

The post The ultimate guide to bot herding and spider wrangling — Part Two appeared first on Scott.Services Online Marketing News.

source https://news.scott.services/the-ultimate-guide-to-bot-herding-and-spider-wrangling-part-two/