Overview
Search engines find information about your site by crawling them. According to Google "The web is like an ever-growing library with billions of books and no central filing system. We use software known as web crawlers to discover publicly available webpages. Crawlers look at webpages and follow links on those pages, much like you would if you were browsing content on the web. They go from link to link and bring data about those webpages back to Google’s servers."
But not everything is crawled. Some pages might be blocked from crawling on your site, which make them non-indexable. Why is a page non-indexable?:
- Blocked due to status: If a page returns a non 2xx reponse to a search engine, it cannot be crawled and indexed. The response may be 3XX (redirect), 4XX (Client Side Error) or 5XX (Server Side Errors).
- Blocked via Robots.txt, Robots Meta, or X-Robots Header: Google's Search Console gives site owners granular choices about how Google crawls their site: they can provide detailed instructions about how to process pages on their sites, can request a re-crawl, or can opt out of crawling altogether using a file called “robots.txt”. Google also obeys instructions not to crawl and index a page when the Robots Meta tag is set to noindex or it the X-Robots-Tag in the HTTP header response for the page is set as noindex
- Canonical on the page is pointing to another page: This tells google that it shouldn't index the page but instead it should index the canonical url.
Checking Indexability in seoClarity
The indexability tab in seoClarity Site Health is designed to help you audit the indexability of your site in a single view. In the summary boxes, it shows you the count of pages that are found to be Indexable and Non-Indexable. If they are non-indexable, the summary boxes show the number of pages found to be blocked, either due to response status, or because of blocking via
Robots.txt, Robots Meta, Robots Header, OR because the page is Canonicalized to another page. Each of these numbers can be selected to filter the non-indexable pages in the table below based on a specific blocked reason for more granular analysis..
Indexability Summary Box
Indexable Pages: Displays a count of URLs that are indexable by search engines. Clicking on this takes you to the details tab filtered by the indexable pages found in a crawl.
Non Indexable Pages: Displays a count of URLs that are indexable by search engines. Clicking on this filters Site Health by the non-indexable pages found in a crawl.
Error Reasons: This displays the count of error reasons found during the crawl. 3xx means Redirection, 4xx means Client error and 5xx means Server error.
Blocked Reasons: This displays the count of blocked reasons found during the crawl.
By Robots.txt: This indicates the count of pages that are disallowed by Robots.txt
By Robots Meta Tag: This indicates the count of pages that are blocked by the Robots Meta Tag on the page.
By X-Robots Header: This indicates the count of pages that are blocked by a X-Robots Header on the page.
Canonical: This indicates the count of pages that are not indexable because of a canonical on the page pointing to another page.
Indexability by Depth Summary Box
This bar chart contains details of the pages found to be Indexable or Blocked based on the depth in which they are found. This chart is also useful in identifying the depth in which the pages were found. Since the search engine's crawler traverses through a site by following links, it's a good view to see how quickly pages can be found on your site. Having pages that are difficult to reach at higher depths may result in those pages being missed by the search engine crawler.
Non Indexable Pages Table
The Table on the Indexability tab contains the details of all Non Indexable pages found in the crawl. This is useful to get a detailed url view on why a page is blocked from crawling. Below is what each column of the table contains.
Title/URL: Displays the URL and Title of the page.
Status Code: Displays the status code found for that page on the date of the crawl.
Blocked by Robots Meta Tag: Displays Yes if the robots directive found on the URL is noindex.
Blocked by Robots.txt: Displays Yes if the robots directive for the URL is noindex.
Blocked by X-Robots header: Displays Yes if the X-Robots header directive for the URL is noindex.
Robots Meta Tag Value: Displays the value of the robots meta tag where available.
Canonical Type: Displays the canonical URL for the page.