Allowing The seoClarity Crawler To Crawl Your Site

Overview

The seoClarity crawler can only crawl your site if you allow it to. With the volume of bad bots increasing day by day, most sites are enhancing their security to block unknown bots from accessing their site. In that respect, it is important that the seoClarity crawler be added to your site's allow list for crawling.

Please go through the FAQ’s around crawling and the common questions on how to add the seoClarity crawler to your site’s allow list.

Allowing The seoClarity Crawler To Crawl Your Site Video Overview

Q: When running a crawl using Site Audits, what are the different indicators of being blocked by the site?

There are several indicators that the crawler is being blocked by the site.

1. HTTP Status Codes: Check the HTTP status codes returned by the server. Common status codes indicating blocking include:

403 Forbidden: This status code is often returned when the server refuses to fulfill the request, indicating that the request is understood but not allowed.

429 Too Many Requests: Indicates that the user has sent too many requests in a given amount of time ("rate limiting").

503 Service Unavailable: Indicates that the server is temporarily unable to handle the request due to maintenance or overloading.

302 Temporary Redirect: Large Number of 302s in a crawl may indicate blocking. Instead of outright blocking the crawler, the site is responding with a 302 redirect loop

2. Empty Responses or Unexpected Content: If you're receiving empty responses or content that doesn't match what you expect, the site may be serving different content to your crawler or blocking access to specific pages or resources.

The simplest way to check this would be to review the titles received for the pages in the crawl and compare it with the on-page title in the browser. Some examples of titles returned by pages that are being blocked or rate limited are:

a. Generic Error Titles:

"Access Denied"

"403 Forbidden"

"404 Not Found"

"Service Unavailable"

"Rate Limit Exceeded"

b. CAPTCHA Titles:

"Please Complete the CAPTCHA"

"CAPTCHA Verification Required"

"Prove You're Not a Robot"

"Security Check Required"

c. Custom Block Titles:

"Whoops, we couldn't find that."

"Access Blocked Due to Suspicious Activity"

"You Have Been Blocked"

"Blocked by Security Firewall"

"Access Denied: Excessive Requests"

"Unauthorized Access Detected"

d. Rate Limiting Titles:

"Rate Limit Exceeded: Too Many Requests"

"API Rate Limit Reached"

"Rate Limit Error"

"Too Many Requests: Please Try Again Later"

"Rate Limit Exceeded: Access Restricted"

e. Empty or Null Titles:

In some cases, the title tag of the blocked page may be empty or null, indicating that the server did not provide a meaningful title for the response.

f. Redirect Titles:

"Redirected: Please Wait"

"Redirect Notice"

"Redirected: Access Denied"

"Redirected: CAPTCHA Verification Required"

3. Connection Errors or Timeouts: If your crawler experiences frequent connection errors or timeouts when trying to access the site, it could indicate that the site's servers are actively blocking or throttling your requests.

4. Changes in Response Time: Significant changes in response times or patterns compared to previous scraping attempts may also indicate that the site is actively blocking or throttling your access.

5. IP Blocking: If our crawler's IP address is blocked by the website, you may receive explicit error messages indicating that access is denied.

Q: What are the different options for allowing the seoClarity crawler to crawl your site?

There are multiple ways this can be done.

User Agent Allow listing:

You can add seoClarity’s desktop and mobile user agent to your site’s allow list OR just allow for user agent that contains the string “Claritybot.” If you would like to add the specific user agents to your site’s allow list, they are listed below:

Desktop User Agent : Mozilla/5.0 (compatible; ClarityBot/9.0; + https://www.seoclarity.net/bot.html )
Mobile User Agent : Mozilla/5.0 (Linux; Android 9; SM-G960F Build/PPR1.180610.011; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/74.0.3729.157 Mobile Safari/537.36 (compatible; ClarityBot/9.0; + https://www.seoclarity.net/bot.html )
Custom User Agent : We also allow for entering a custom user agent that is only known to admins, if that is what you prefer. This can be changed every time you start a crawl or if you would just prefer to re-use the same custom user agent, it can be entered into the settings and will auto-populate when starting a crawl.

Custom Headers:

Another option preferred by some sites is adding a custom header that is sent with every crawl request to your site. The header can be used to recognize and allow the crawl request from accessing the site. Simply provide the Custom Key:Value pair you would like to add as a header and send to support@seoclarity.net . This header would then be added to the crawler and used for all crawls in the platform.

Default Headers used by seoClarity Crawler:
Below are the default request headers used by the seoClarity Crawler.
"Accept": "*/*",
"Accept-Encoding": "*"
"Accept-Language": "*"

Crawler IPs
Add our crawler IPs to your site's allow list if:
You want our crawler to crawl your site
Your site does not support allowing our crawler to access your site using a Useragent/Custom header
Crawler IPs
65.60.32.94
65.60.32.158
65.60.17.146
69.175.98.230
173.231.185.180
64.46.110.58
64.46.110.82
64.46.110.18

Q: What does seoClarity recommend?

Our crawler is a cloud based crawler that by default pulls from a pool of many IPs. By choosing to add a User Agent or Custom Header to your site’s allow list, you can make full use of all the benefits of cloud based crawling at NO EXTRA COST.

Benefits include:

Run Simultaneous crawls on your site.
Crawl from different Regions.
Run crawls based on any speed from 1-100 pages per second. (Crawling more than 8 page/second would automatically use multiple IPS to crawl your site.
No need to add IPs to the allow list used for Javascript Crawls and Page crawls separately. Just adding the user agent/custom header to your site's allow list would do.

Related Articles
Site Audit Projects
Site Audit Projects Overview The Site Audit Projects List gives you a high level view of the different crawls that have been setup for the domain. Watch the video below: "How to Create a Clarity Audit Project" Background & Requirements for Site Audit ...
Why did my crawl return an error?
Crawl Error Video Malformed or Incorrect URL Check that the starting URL is correct and is accessible (200 response) to the user agent setup for the crawl. User Agent We recommend adding our crawler as an allowed crawler based on the user agent name, ...
Site Audit Details
Site Audit Details Overview Site Audit Details provides a variety of reports and analysis based on a crawl, that can impact the health of a site. Site Audit Details Use Cases Identify and fix potential user experience issues. Learn more Identify ...
Site Audit Settings
Site Audit Settings Overview Site Audits provides a variety of reports and analysis based on a crawl, that can impact the health of a site. Site Audit Settings allow for the customization of those Site Audit reports by enabling the ability to ...
How do I configure Cloudflare to allow the Claritybot user agent to crawl my site?
Please note that the following settings are part of the Cloudflare platform and could be changed by them without notice. In https://dash.cloudflare.com,/ select the domain name that will be the crawl target Navigate to the “Firewall” tab and click ...

Allowing The seoClarity Crawler To Crawl Your Site

Allowing The seoClarity Crawler To Crawl Your Site

Overview

Q: What are the different options for allowing the seoClarity crawler to crawl your site?

User Agent Allow listing:

Custom Headers:

Crawler IPs

Q: What does seoClarity recommend?

Benefits include:

Related Articles

Site Audit Projects

Why did my crawl return an error?

Site Audit Details

Site Audit Settings

How do I configure Cloudflare to allow the Claritybot user agent to crawl my site?