Allowing The seoClarity Crawler To Crawl Your Site

Allowing The seoClarity Crawler To Crawl Your Site

Overview

The seoClarity crawler can only crawl your site if you allow it to. With the volume of bad bots increasing day by day, most sites are enhancing their security to block unknown bots from accessing their site. In that respect, it is important that the seoClarity crawler be added to your site's allow list for crawling.
Please go through the FAQ’s around crawling and the common questions on how to add the seoClarity crawler to your site’s allow list.

Allowing The seoClarity Crawler To Crawl Your Site Video Overview

Q: When running a crawl using Site Audits, what are the different indicators of being blocked by the site?
There are several indicators that the crawler is being blocked by the site. 

      1. HTTP Status Codes: Check the HTTP status codes returned by the server. Common status codes indicating blocking include:
            403 Forbidden: This status code is often returned when the server refuses to fulfill the request, indicating that the request is understood but not allowed.
            429 Too Many Requests: Indicates that the user has sent too many requests in a given amount of time ("rate limiting").
            503 Service Unavailable: Indicates that the server is temporarily unable to handle the request due to maintenance or overloading.
            302 Temporary Redirect: Large Number of 302s in a crawl may indicate blocking. Instead of outright blocking the crawler, the site is responding with a 302 redirect loop

      2. Empty Responses or Unexpected Content: If you're receiving empty responses or content that doesn't match what you expect, the site may be serving different content to your crawler or blocking access to             specific pages or resources.
                 The simplest way to check this would be to review the titles received for the pages in the crawl and compare it with the on-page title in the browser. Some examples of titles returned by pages that are being blocked or rate limited are: 
                        a. Generic Error Titles:
                                    "Access Denied"
                                    "403 Forbidden"
                                    "404 Not Found"
                                    "Service Unavailable"
                                    "Rate Limit Exceeded"
                        b. CAPTCHA Titles:
                                    "Please Complete the CAPTCHA"
                                    "CAPTCHA Verification Required"
                                    "Prove You're Not a Robot"
                                    "Security Check Required"
                        c. Custom Block Titles:
                                    "Whoops, we couldn't find that."
                                    "Access Blocked Due to Suspicious Activity"
                                    "You Have Been Blocked"
                                    "Blocked by Security Firewall"
                                    "Access Denied: Excessive Requests"
                                    "Unauthorized Access Detected"
                        d. Rate Limiting Titles:
                                    "Rate Limit Exceeded: Too Many Requests"
                                    "API Rate Limit Reached"
                                    "Rate Limit Error"
                                    "Too Many Requests: Please Try Again Later"
                                    "Rate Limit Exceeded: Access Restricted"
                       e. Empty or Null Titles:
                                    In some cases, the title tag of the blocked page may be empty or null, indicating that the server did not provide a meaningful title for the response.
                      f. Redirect Titles:
                                    "Redirected: Please Wait"
                                    "Redirect Notice"
                                    "Redirected: Access Denied"
                                    "Redirected: CAPTCHA Verification Required"
     
            3. Connection Errors or Timeouts: If your crawler experiences frequent connection errors or timeouts when trying to access the site, it could indicate that the site's servers are actively blocking or throttling                   your requests.
            
            4. Changes in Response Time: Significant changes in response times or patterns compared to previous scraping attempts may also indicate that the site is actively blocking or throttling your access.

            5. IP Blocking: If our crawler's IP address is blocked by the website, you may receive explicit error messages indicating that access is denied.


Q: What are the different options for allowing the seoClarity crawler to crawl your site?

There are multiple ways this can be done.
  1. User Agent Allow listing:

You can add seoClarity’s desktop and mobile user agent to your site’s allow list OR just allow for user agent that contains the string “Claritybot.” If you would like to add the specific user agents to your site’s allow list, they are listed below:
    1. Desktop User Agent : Mozilla/5.0 (compatible; ClarityBot/9.0; + https://www.seoclarity.net/bot.html )
    2. Mobile User Agent : Mozilla/5.0 (Linux; Android 9; SM-G960F Build/PPR1.180610.011; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/74.0.3729.157 Mobile Safari/537.36 (compatible; ClarityBot/9.0; + https://www.seoclarity.net/bot.html )
    3. Custom User Agent : We also allow for entering a custom user agent that is only known to admins, if that is what you prefer. This can be changed every time you start a crawl or if you would just prefer to re-use the same custom user agent, it can be entered into the settings and will auto-populate when starting a crawl.
  1. Custom Headers:

Another option preferred by some sites is adding a custom header that is sent with every crawl request to your site. The header can be used to recognize and allow the crawl request from accessing the site. Simply provide the Custom Key:Value pair you would like to add as a header and send to  support@seoclarity.net . This header would then be added to the crawler and used for all crawls in the platform.

Default Headers used by seoClarity Crawler: 
      Below are the default request headers used by the seoClarity Crawler. 
"Accept": "*/*",
"Accept-Encoding": "*"
"Accept-Language": "*" 

Crawler IPs

Add our crawler IPs to your site's allow list if:
  1. You want our crawler to crawl your site
  2. Your site does not support allowing our crawler to access your site using a Useragent/Custom header 
    1. Crawler IPs 
      64.46.115.134
      173.231.185.181
      65.60.32.158
      65.60.17.146
      69.175.98.230
      173.231.185.180
      64.46.110.58
      64.46.110.82
      64.46.110.18 

Q: What does seoClarity recommend?

Our crawler is a cloud based crawler that by default pulls from a pool of many IPs. By choosing to add a User Agent or Custom Header to your site’s allow list, you can make full use of all the benefits of cloud based crawling at NO EXTRA COST.
  1. Benefits include:

    1. Run Simultaneous crawls on your site. 
    2. Crawl from different Regions.
    3. Run crawls based on any speed from 1-100 pages per second. (Crawling more than 8 page/second would automatically use multiple IPS to crawl your site.
    4. No need to add IPs to the allow list used for Javascript Crawls and Page crawls separately. Just adding the user agent/custom header to your site's allow list would do. 
The above benefits are NOT POSSIBLE with using Reserved IPs.
  1. Since the IP(s) needs to be reserved, you are limited by the availability of the reserved IPs. For example, if you already have a crawl running and only 1 reserved IP then you can run only 1 crawl at a time.
  2. Similarly, the IP would be reserved in 1 region and that cannot be switched.


Q: What information do I need to provide set up a reserved IP?

  1. Confirm the speed you would prefer to run crawls on your site with. This would be used to determine the number of reserved IPs you need.
  2. Confirm the number you domains you want to use dedicated IPs on. Each domain would need separate IPs reserved.
  3. Confirm the cost. Each reserved IP would be charged at a rate of $50 per IP per month.
Send a support ticket to support@seoClarity.net or your CSM with the confirmation.



 

    • Related Articles

    • Site Audit Projects

      Site Audit Projects Overview The Site Audit Projects List gives you a high level view of the different crawls that have been setup for the domain. Watch the video below: "How to Create a Clarity Audit Project" Background & Requirements Some sites ...
    • Why did my crawl return an error?

      Crawl Error Video Malformed or Incorrect URL Check that the starting URL is correct and is accessible (200 response) to the user agent setup for the crawl. User Agent We recommend adding our crawler as an allowed crawler based on the user agent name, ...
    • Site Audit Details

      Site Audit Details Overview Site Audit Details is a new version of Site Health. The UI is designed with a similar look and feel of the earlier Site Health but it has been rebuilt using our Clarity Grid Infrastructure. This page provides a variety of ...
    • Site Audit Settings

      Site Audit Settings Overview Site Audits provides a variety of reports and analysis based on a crawl, that can impact the health of a site. Site Audit Settings allow for the customization of Site Audit reports. The settings enable prioritizing issues ...
    • How do I configure Cloudflare to allow the Claritybot user agent to crawl my site?

      Please note that the following settings are part of the Cloudflare platform and could be changed by them without notice. In https://dash.cloudflare.com,/ select the domain name that will be the crawl target Navigate to the “Firewall” tab and click ...