Site Audit Projects

Site Audit Projects

Site Audit Projects Overview

The Site Audit Projects List gives you a high level view of the different crawls that have been setup for the domain. Watch the video below: "How to Create a Clarity Audit Project"



Background & Requirements

Some sites require a bot to be added to an allow list prior to crawling. You can choose between seoClarity's Desktop or Mobile User Agent or Google's Desktop and Smartphone User agent. Other user agents can be specified when setting up the crawl as well. Learn More

(NOTE: It may take up to 48 hours before pages added to Managed Pages are crawled by the Site Audit.)

Site Audit Report Use Cases

  1. Uncover pages that deliver the highest audience engagement. Analyzing them deeper will signal what factors make those pages so popular, and how to update the other blog posts and other assets to achieve a similar effect. 
  2. Reveal what your best-performing topics are, and what you should focus on moving forward.
  3. Discover ideas for future content and fill in your marketing strategy. Learn more

Site Audit Projects Frequently Asked Questions

  1. How do recurring crawls work? Learn More
  2. How to cancel scheduled crawls? Learn more
  3. What type of crawl should I set up: Standard or Javascript? Learn More
  4. Can we use Site Audits on our development environment?  Learn More



Project Options

Project Name: Selecting the project name will navigate to the Site Audit page for that project.

Pencil (Edit): This allows for project names to be renamed.

Trash (Delete): This will remove the crawl project and all crawls within the project.

Gear (Settings): This will display the starting URL, depth, speed and exclusions for the project.

Lightning Bolt (Alerts)  This will display the Site Audit alerts configuration for the project.


How To Setup A New Site Audit Project

The New Site Audit button brings up a popup allowing for a new project and crawl to be setup or to run a crawl within an existing project.



Basic Settings tab

The bare essential information needed to initiate a new Site Audit can be found in this tab.

Project Type: Select Existing Project if you want to re-use a previously setup project including the same custom settings or New Project to setup a fresh crawl with no inherited settings.

Project Name: Selecting an existing project will display that project's name or create a new project by specifying a name in the text field.

Language: This field is used to tokenize and store the crawl data based on the language entered that will allow for efficient broad match searching for the title, meta description, h1 and h2. The default language used is English.

Choose what to crawl: Crawls can be based on a specific URL, sitemap(s), an RSS Feed, or an upload CSV list. 

    Starting URL: Select the protocol (http or https) and input the URL where the crawl should begin. Subdomains are allowed in the Starting URL field when match type Broad Match is selected in the Domain Settings, Ranking Configuration. A crawl can be started from any of a domain's subdomains as long as the root domain is the same.

A validation string will appear to confirm the current status code of the URL. If the validation returns an error code, it is often due to an issue with the site such as a slow response (>25 seconds), or the user agent being used is blocked. 

    Sitemap(s): Select the protocol (http or https) and input the URL where the sitemap is located. 

  1. For crawls relating to multiple subdomains of a single site, we can filter for each subdomain in site audits and generate individual sitemap for each subdomain separately.
  2. If the sitemaps for each subdomain are going to be hosted in a single domain then the sitemap index would need to be edited to point to each individual sitemap.

    RSS: Select the protocol (http or https) and input the URL where the RSS feed is located.

    Upload CSV: If you already know what URLs you want to crawl, place them in a column list of the URLs in a .csv format to upload.

Crawler Type: The Standard Crawl is the most common crawler and functions like most crawlers out there. The Javascript Enabled crawl renders JS when crawling similar to how a browser would.  

      Standard Crawl: This would crawl the source of the page without any rendering. Crawls your website similar to the vast majority of crawlers out there. Use this to check maximum compatibility.
      JavaScript Crawl: An advanced version of our crawler that renders every page exactly as it would appear in a browser. Use this to check for issues that Google may encounter with it's own JavaScript crawl capabilities. Crawling speed will be slightly slower since the crawler has to wait for JavaScript to finish rendering on each page.                  
      Block Resources: The resource URLs passed in this option are blocked from rendering when loading the javascript on the page. This field accepts multiple url patterns (one per line). Learn More

Warning
      JAVASCRIPT WARNING: By enabling a Javascript crawl of a target website, you agree that:
            - You are authorized to run a crawl on the target website.
            - You are responsible for any issues created by the crawl.
            - Javascript crawls will crawl the pages on the target website and render each page as it would within a browser.
            - Javascript crawls will trigger all javascript and load all resources as a browser would.
            - If the site contains any resources or javascript from sources that charge by the number of times it is loaded/displayed/triggered, etc, you will incur and are solely liable for such costs.

If you disagree with any of the above, DO NOT ENABLE the Javascript crawl option. Starting a crawl with the Javascript enabled indicates that you agree with all of the above.
Crawl Speed: This is the number of concurrent requests we are requesting from the site and is equivalent to the number of pages crawled per second. The time it takes to crawl a site will depend on how the site handles these requests, the size of the page, download time and the number of URLs. Speeds greater than 8 pages per second will establish a cluster crawl where multiple pages are crawled simultaneously. 

    Advanced: Limit the number of pages crawled per day

Crawl Depth: Custom is the number of links (levels) away from the starting URL the crawl will look for pages. Full Site Crawl will crawl all URLs found for that domain (depending on configuration, this could take a significant amount of time). Crawl only pages uploaded/found will crawl just the URLs that are specified. 

Description: This optional text field allows for any additional notes to be entered related to the crawl project.



Advanced Setting tab

Configure commonly used advanced crawl options in this tab.

User Agent: A custom user agent can be set here. By default or if left blank, 'ClarityBot' will be used. This can require the bot to be added to an allow list for some domains, so that our bot is not blocked from crawling the site. 

    User Agent Options:

        Google Desktop - Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

        Google Mobile - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

        ClarityBot - Mozilla/5.0 (compatible; ClarityBot/9.0; +https://www.seoclarity.net/bot.html)

        Claritybot (Mobile) - Mozilla/5.0 (Linux; Android 9; SM-G960F Build/PPR1.180610.011; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/74.0.3729.157 Mobile Safari/537.36 (compatible; ClarityBot/9.0; +https://www.seoclarity.net/bot.html)

Obey Robots.txt: If no is selected, the crawl bypasses the settings in the robots.txt file of the site to be crawled. By default, the crawler obeys the robots protocol.

    Store Blocked Links: If, yes is selected, the crawler stores the links blocked by robots.txt. By default, this is set to no. 

Enable Cookies: Enabling this option tells the crawler to keep track of cookies sent by web servers, and then send them back on subsequent requests. This is typically used to crawl sites that redirect based on persisting cookies.  

Select Region: Optional setting to crawl from a location closer to where the site is hosted. 

Link Parameter Handling

    Enter URL parameter(s) to remove: Enter URLs parameters to remove automatically when crawling in a comma separated format. Enter a * (asterisk) to remove all URL parameters on discovered URLs before attempting to crawl them. URL parameters that don't change the content of your pages can interfere with efficient site crawls as they result in the same page content being available via multiple unique URLs. 

    Internal Links Analysis: Enabling this checkbox option will crawl the Internal Links found on the page. 

    HREFLang Crawl: By default, all hreflang found while crawling is captured and displayed in the Hreflang Audit tab of Site Health. Enabling this checkbox option will crawl rel="alternate" hreflang URLs. If enabled these URLs can also be crawled if the Validate option is enabled.

    Canonical Crawl: By default, all canonicals found while crawling is captured and displayed in the Canonical Audit tab of Site Health. Enabling this checkbox option will crawl all canonical URLs. 



Crawling Rules tab

Customize what pages are crawled and how query parameters found in URLs are handled in this tab.

Domain Crawling Rules: Enter one string match pattern per line to allow or deny domains from being crawled.

    Allow domain(s):  Enter a list of the domains and sub-domains to allow. Any links discovered for the allowed domains will be crawled further. If By default, the crawler will pick up and crawl the subdomain of the starting url. If it is a csv or sitemap crawl only the subdomain of the first url found is crawled. However, if you need to crawl all subdomains you can do so by entering the root domain. For example, entering xyz.com here will crawl www.xyz.com, support.xyz.com, blog.xyz.com ... etc. 
  1. To ensure the crawling of a subdomain, it must either be discovered on one of the pages already crawled or designated as an additional starting url.
  1. For crawls relating to multiple subdomains of a single site, we can filter for each subdomain in site audits and generate individual sitemap for each subdomain separately.
  2. If the sitemaps for each subdomain are going to be hosted in a single domain then the sitemap index would need to be edited to point to each individual sitemap.   
Deny domain(s): Enter a list of the domains and sub-domains to deny. Any links discovered for the deny domains will not be crawled further.

Link Crawling Rules: 

    Follow nofollow links: If yes is selected, links with rel="nofollow" will be crawled. By default, nofollow links are not crawled.

    URL pattern(s) to allow: You can enter a regex pattern in here which when found in the URL will include it in the list of URLs to be crawled and followed further. The URL Pattern to disallow take precedence over any patterns entered here. 

    URL pattern(s) to disallow: You can enter a regex pattern in here which when found in the URL will automatically cause it to be excluded from being crawled and followed further. These patterns take precedence over any URL patterns specified in the allow field above.

    URLs to crawl but not index: If URLs match the regex pattern specified here, the URLs will be crawled and new links discovered from the same, but the content of the URL itself will not be crawled and indexed.

    URLs to index but not crawl:If URLs match the regex pattern specified here, the content of the URL will be indexed but none of the links found on the page will be followed or added to the crawl list.

Link Discovery: Enter one string match pattern per line to restrict the region from where links should be found and crawled. 

    Restrict to Xpath: Specify an XPath (or list of XPath's) which defines regions inside the response where links should be extracted form. If given, only the content selected by those Xpath will be scanned for links.

    Restrict to CSS: Specify a CSS selector (or list of selectors) which defines regions inside the URL being crawled from where links should be extracted. Has the same behaviour as restrict Xpath.




Custom tab

Configure any additional content or custom search for a page to be crawled and stored for analysis.

Content extraction:  If there is additional content that should be crawled beyond the standard HTML elements (Title, Meta Description, H1, H2) input it here. More additional content can be specified via the Content Extraction button. If the same custom content element specified exists multiple times on a page, only the first instance will be retrieved.

    XPATH: Enter the specific XPATH found on pages that you would like the crawler to retrieve and analyze. Make sure to specify a XPATH that uniquely identifies the content you want to retrieve.

    CSS: Enter the specific CSS found on pages that you would like the crawler to retrieve and analyze. Make sure to specify CSS that uniquely identifies the content you want to retrieve. 

    DIV_ID: Enter the specific Div ID found on pages that you would like the crawler to retrieve and analyze. Make sure to specify Div ID that uniquely identifies the content you want to retrieve.

    DIV_CLASS: Enter the specific Div class found on pages that you would like the crawler to retrieve and analyze. Make sure to specify Div class that uniquely identifies the content you want to retrieve.

Content Match: This section allows you to capture and display pages in the Custom Search Tab in Site Health based on the input entered here. There are 3 options:

    Contains: This would capture the pages and the count of occurrences per page that matches the string entered.

    Does Not Contain: This returns the pages that do not contain the string entered. 

    Regex: This returns the pages and count of occurrences based on the regex pattern entered here. 



Start Audit tab

Choose when to start the Audit.

Frequency: Choose to run a one time crawl or a Weekly Recurring, Bi-Weekly Recurring, or Monthly Recurring crawl. Choosing a recurring crawl will allow you to select how many months you want to schedule the recurring crawls for. 

Launch Crawl: Selection for when the crawl should launch

   Start Now: The crawl queues up and begins crawling shortly after start site audit is selected.

    Start Later: Use this option to schedule a crawl to start at a later date and time

        Start Date: Select the date and time to schedule the crawl.

Schedule Interval: Scheduling allows full control over the hours of the day and the days of the week that the crawler should run. This can be used to ensure that crawl activity takes place only during off-peak hours or during times of low server load. This takes precedence over the Launch Crawl settings. 

Crawl Alerts: You can create email alerts when setting up a new Site Audit Project. Designated recipients can be alerted when:
  1. A crawl has been initiated, completed, or has an error.
  2. The crawl detects an increase in selected issues.
  3. There is a change to any additional custom content (if any was designated).
Update Crawl Config:
      This option allows you to apply the current settings to all future scheduled crawls within this project.


The crawl initiation, completion, and error emails are only sent to the user setting up the crawl.  

Enable Alerts
  1. Crawl Launch: Receive an email alert only when the crawl launches and completes. 
  2. Increase in issues: 
    Click on the pencil icon to customize in the right side window. Select the specific Issues you would like to receive an alert for.
  3. Changes to Additional Content/Custom Content:  If Changes to Additional Content/Custom Content were entered, this option will be enabled to select.
When any of the conditions you designated occurs during an audit crawl, the email addresses you stipulate will receive an email alerting the user to the specific change.
  1. Send Crawl alerts to: Enter the email addresses of other team members that need to receive the alerts under the Enable Alerts section. 



Archiving Old Crawls: Crawl data is available up to a period of 12 months, after which the data is archived. Summary Data in Site Audit Projects and Site Audit Reports for archived crawls would still be available. Archived crawl data can be extracted by sending a request to support@seoclarity.net

Pausing Crawls: Crawls can be temporarily paused. After 7 days, paused crawls are automatically stopped.  

How to cancel scheduled crawls?

In case you need to cancel any scheduled crawls, you can do so by going to your Site Audit projects page and selecting the "Crawl log" tab. Once you are in the "Crawl log" section, you can select the scheduled crawls and choose to remove some or all future crawls by clicking on the Action dropdown and selecting "Remove crawls".



Please note that existing Site Audit projects with a recurring frequency are not editable. However, you can cancel future scheduled crawls and set up a new site audit project with the updated settings that you require. This will allow you to make necessary changes while maintaining the integrity of your existing project.

Why my crawl is too slow?

If your crawl is taking longer than expected, there are a few factors that could contribute to this. Here are some possible reasons and steps you can take to address them:

Crawl Speed: Check the crawl speed settings in your site audit project. If the crawl speed is set too low, it can significantly slow down the crawl process. Adjust the crawl speed to a higher value if necessary.

Page Size and Complexity: Large or complex pages can take longer to crawl. If your website has pages with heavy content, multimedia elements, or complex code, it can slow down the crawl. 

Crawl Depth: The depth of the crawl can also affect the duration. If you have set a deep crawl depth, it may take longer to crawl all the pages on your website. Consider adjusting the crawl depth to focus on the most important pages or sections first.

Crawl Schedule: If you have scheduled your crawl to run during peak hours or when server load is high, it can slow down the crawl. Consider rescheduling the crawl to off-peak hours or times of low server activity.

If you have tried these steps and your crawl is still taking an unusually long time, please reach out to support@seoclarity.net for further assistance.

    • Related Articles

    • Setting up a Site Audit

      Overview A Site Audit will crawl pages on your site and return a summary report of the audit results through Site Audit Reports along with a detailed analysis of of pages crawled, redirect chain analysis, audits for duplicate content, canonical, ...
    • Site Audit Report

      Site Audit Report Overview This overview will help you understand exactly what Site Audit Reports displays, which is a summary of the most recently completed crawls. It contains a summarized view of site health scores of crawls run within a project, ...
    • Site Audit Details

      Site Audit Details Overview Site Audit Details is a new version of Site Health. The UI is designed with a similar look and feel of the earlier Site Health but it has been rebuilt using our Clarity Grid Infrastructure. This page provides a variety of ...
    • Site Audit Settings

      Site Audit Settings Overview Site Audits provides a variety of reports and analysis based on a crawl, that can impact the health of a site. Site Audit Settings allow for the customization of Site Audit reports. The settings enable prioritizing issues ...
    • Sitemap settings

      Overview View the sitemaps discovered within your Google Search Console and their status.  Use Cases Auditing Localized Versions of Your Page Discover errors in your sitemaps that are causing Google Search Console to not process the sitemap correctly ...