Overview
The seoClarity crawler can only crawl your site if you allow it to. With the volume of bad bots increasing day by day, most sites are enhancing their security to block unknown bots from accessing their site. In that respect, it is important that the seoClarity crawler be added to your site's allow list for crawling.
Please go through the FAQ’s around crawling and the common questions on how to add the seoClarity crawler to your site’s allow list.
Allowing The seoClarity Crawler To Crawl Your Site Video Overview
Q: When running a crawl using Site Audits, what are the different indicators of being blocked by the site?
There are several indicators that the crawler is being blocked by the site.
1. HTTP Status Codes: Check the HTTP status codes returned by the server. Common status codes indicating blocking include:
403 Forbidden: This status code is often returned when the server refuses to fulfill the request, indicating that the request is understood but not allowed.
429 Too Many Requests: Indicates that the user has sent too many requests in a given amount of time ("rate limiting").
503 Service Unavailable: Indicates that the server is temporarily unable to handle the request due to maintenance or overloading.
302 Temporary Redirect: Large Number of 302s in a crawl may indicate blocking. Instead of outright blocking the crawler, the site is responding with a 302 redirect loop
2. Empty Responses or Unexpected Content: If you're receiving empty responses or content that doesn't match what you expect, the site may be serving different content to your crawler or blocking access to specific pages or resources.
The simplest way to check this would be to review the titles received for the pages in the crawl and compare it with the on-page title in the browser. Some examples of titles returned by pages that are being blocked or rate limited are:
a. Generic Error Titles:
"Access Denied"
"403 Forbidden"
"404 Not Found"
"Service Unavailable"
"Rate Limit Exceeded"
b. CAPTCHA Titles:
"Please Complete the CAPTCHA"
"CAPTCHA Verification Required"
"Prove You're Not a Robot"
"Security Check Required"
c. Custom Block Titles:
"Whoops, we couldn't find that."
"Access Blocked Due to Suspicious Activity"
"You Have Been Blocked"
"Blocked by Security Firewall"
"Access Denied: Excessive Requests"
"Unauthorized Access Detected"
d. Rate Limiting Titles:
"Rate Limit Exceeded: Too Many Requests"
"API Rate Limit Reached"
"Rate Limit Error"
"Too Many Requests: Please Try Again Later"
"Rate Limit Exceeded: Access Restricted"
e. Empty or Null Titles:
In some cases, the title tag of the blocked page may be empty or null, indicating that the server did not provide a meaningful title for the response.
f. Redirect Titles:
"Redirected: Please Wait"
"Redirect Notice"
"Redirected: Access Denied"
"Redirected: CAPTCHA Verification Required"
3. Connection Errors or Timeouts: If your crawler experiences frequent connection errors or timeouts when trying to access the site, it could indicate that the site's servers are actively blocking or throttling your requests.
4. Changes in Response Time: Significant changes in response times or patterns compared to previous scraping attempts may also indicate that the site is actively blocking or throttling your access.
5. IP Blocking: If our crawler's IP address is blocked by the website, you may receive explicit error messages indicating that access is denied.
Q: What are the different options for allowing the seoClarity crawler to crawl your site?
There are multiple ways this can be done.
-
User Agent Allow listing:
You can add seoClarity’s desktop and mobile user agent to your site’s allow list OR just allow for user agent that contains the string “Claritybot.” If you would like to add the specific user agents to your site’s allow list, they are listed below:
-
- Mobile User Agent : Mozilla/5.0 (Linux; Android 9; SM-G960F Build/PPR1.180610.011; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/74.0.3729.157 Mobile Safari/537.36 (compatible; ClarityBot/9.0; + https://www.seoclarity.net/bot.html )
- Custom User Agent : We also allow for entering a custom user agent that is only known to admins, if that is what you prefer. This can be changed every time you start a crawl or if you would just prefer to re-use the same custom user agent, it can be entered into the settings and will auto-populate when starting a crawl.
-
Another option preferred by some sites is adding a custom header that is sent with every crawl request to your site. The header can be used to recognize and allow the crawl request from accessing the site. Simply provide the Custom Key:Value pair you would like to add as a header and send to
support@seoclarity.net . This header would then be added to the crawler and used for all crawls in the platform.
Default Headers used by seoClarity Crawler:
Below are the default request headers used by the seoClarity Crawler.
"Accept": "*/*",
"Accept-Encoding": "*"
"Accept-Language": "*"
Crawler IPs
Add our crawler IPs to your site's allow list if:
- You want our crawler to crawl your site
- Your site does not support allowing our crawler to access your site using a Useragent/Custom header
- Crawler IPs
64.46.115.134
173.231.185.181
65.60.32.158
65.60.17.146
69.175.98.230
173.231.185.180
64.46.110.58
64.46.110.82
64.46.110.18
Q: What does seoClarity recommend?
Our crawler is a cloud based crawler that by default pulls from a pool of many IPs. By choosing to add a User Agent or Custom Header to your site’s allow list, you can make full use of all the benefits of cloud based crawling at NO EXTRA COST.
-
Benefits include:
-
Run Simultaneous crawls on your site.
-
Crawl from different Regions.
-
Run crawls based on any speed from 1-100 pages per second. (Crawling more than 8 page/second would automatically use multiple IPS to crawl your site.
-
No need to add IPs to the allow list used for Javascript Crawls and Page crawls separately. Just adding the user agent/custom header to your site's allow list would do.
The above benefits are NOT POSSIBLE with using Reserved IPs.
- Since the IP(s) needs to be reserved, you are limited by the availability of the reserved IPs. For example, if you already have a crawl running and only 1 reserved IP then you can run only 1 crawl at a time.
- Similarly, the IP would be reserved in 1 region and that cannot be switched.
- Confirm the speed you would prefer to run crawls on your site with. This would be used to determine the number of reserved IPs you need.
- Confirm the number you domains you want to use dedicated IPs on. Each domain would need separate IPs reserved.
- Confirm the cost. Each reserved IP would be charged at a rate of $50 per IP per month.