Site Audit - Regex Operators

Site Audit - Regex Operators

Site Audit Filter Operations

This KB article aims at providing an explanation of the operators and regex patterns supported by Site Audits. 


Standard operators

Anchoring

Most regular expression engines allow you to match any part of a string. If you want the regexp pattern to start at the beginning of the string or finish at the end of the string, then you have to anchor it specifically, using ^ to indicate the beginning or $ to indicate the end.

Unlike other regex structures, the patterns used in Site Health are always anchored. The pattern provided must match the entire string. For string "abcde":

ab.*     # match
abcd     # no match
Allowed characters

Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped. The standard reserved characters are:

. ? + * | { } [ ] ( ) " \

If you enable optional features (see below) then these characters may also be reserved:

# @ & < >  ~

Any reserved character can be escaped with a backslash "\*" including a literal backslash character: "\\"

Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes:

john"@smith.com"
Match any character

The period "." can be used to represent any character. For string "abcde":

ab...   # match
a.c.e   # match
One-or-more

The plus sign "+" can be used to repeat the preceding shortest pattern once or more times. For string "aaabbb":

a+b+        # match
aa+bb+      # match
a+.+        # match
aa+bbb+     # match
Zero-or-more

The asterisk "*" can be used to match the preceding shortest pattern zero-or-more times. For string "aaabbb":

a*b*        # match
a*b*c*      # match
.*bbb.*     # match
aaa*bbb*    # match
Zero-or-one

The question mark "?" makes the preceding shortest pattern optional. It matches zero or one times. For string "aaabbb":

aaa?bbb?    # match
aaaa?bbbb?  # match
.....?.?    # match
aa?bb?      # no match
Min-to-max

Curly brackets "{}" can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:

{5}     # repeat exactly 5 times
{2,5}   # repeat at least twice and at most 5 times
{2,}    # repeat at least twice

For string "aaabbb":

a{3}b{3}        # match
a{2,4}b{2,4}    # match
a{2,}b{2,}      # match
.{3}.{3}        # match
a{4}b{4}        # no match
a{4,6}b{4,6}    # no match
a{4,}b{4,}      # no match
Grouping

Parentheses "()" can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group. For string "ababab":

(ab)+       # match
ab(ab)+     # match
(..)+       # match
(...)+      # no match
(ab)*       # match
abab(ab)?   # match
ab(ab)?     # no match
(ab){3}     # match
(ab){1,2}   # no match
Alternation

The pipe symbol "|" acts as an OR operator. The match will succeed if the pattern on either the left-hand side OR the right-hand side matches. The alternation applies to the longest pattern, not the shortest. For string "aabb":

aabb|bbaa   # match
aacc|bb     # no match
aa(cc|bb)   # match
a+|b+       # no match
a+b+|b+a+   # match
a+(b|c)+    # match
Character classes

Ranges of potential characters may be represented as character classes by enclosing them in square brackets "[]". A leading ^ negates the character class. The allowed forms are:

[abc]   # 'a' or 'b' or 'c'
[a-c]   # 'a' or 'b' or 'c'
[-abc]  # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
[^abc]  # any character except 'a' or 'b' or 'c'
[^a-c]  # any character except 'a' or 'b' or 'c'
[^-abc]  # any character except '-' or 'a' or 'b' or 'c'
[^abc\-] # any character except '-' or 'a' or 'b' or 'c'

Note that the dash "-" indicates a range of characters, unless it is the first character or if it is escaped with a backslash.

For string "abcd":

ab[cd]+     # match
[a-d]+      # match
[^a-d]+     # no match

Optional operatorsedit

These operators are available by default as the flags parameter defaults to ALL. Different flag combinations (concatenated with "|") can be used to enable/disable specific operators:

{
    "regexp": {
        "username": {
            "value": "john~athon<1-5>",
            "flags": "COMPLEMENT|INTERVAL"
        }
    }
}
Complement

The complement is probably the most useful option. The shortest pattern that follows a tilde "~" is negated. For instance, `"ab~cd" means:

  • Starts with a
  • Followed by b
  • Followed by a string of any length that it anything but c
  • Ends with d

For the string "abcdef":

ab~df     # match
ab~cf     # match
ab~cdef   # no match
a~(cb)def # match
a~(bc)def # no match

Enabled with the COMPLEMENT or ALL flags.

Interval

The interval option enables the use of numeric ranges, enclosed by angle brackets "<>". For string: "foo80":

foo<1-100>     # match
foo<01-100>    # match
foo<001-100>   # no match

Enabled with the INTERVAL or ALL flags.

Intersection

The ampersand "&" joins two patterns in a way that both of them have to match. For string "aaabbb":

aaa.+&.+bbb     # match
aaa&bbb         # no match

Using this feature usually means that you should rewrite your regular expression.

Enabled with the INTERSECTION or ALL flags.

Any string

The at sign "@" matches any string in its entirety. This could be combined with the intersection and complement above to express everything except. For instance:

@&~(foo.+)      # anything except string beginning with "foo"

Enabled with the ANYSTRING or ALL flags.


If you need help setting up a regex pattern for Site Health, please send in a ticket to support@seoclarity.net with details of what you are trying to create and we can help set it up for you. 


    • Related Articles

    • Site Audit Projects

      Site Audit Projects Overview The Site Audit Projects List gives you a high level view of the different crawls that have been setup for the domain. Watch the video below: "How to Create a Clarity Audit Project" Background & Requirements Some sites ...
    • Site Audit Report

      Site Audit Report Overview This overview will help you understand exactly what Site Audit Reports displays, which is a summary of the most recently completed crawls. It contains a summarized view of site health scores of crawls run within a project, ...
    • Setting up a Site Audit

      Overview A Site Audit will crawl pages on your site and return a summary report of the audit results through Site Audit Reports along with a detailed analysis of of pages crawled, redirect chain analysis, audits for duplicate content, canonical, ...
    • Site Audit Details

      Site Audit Details Overview Site Audit Details is a new version of Site Health. The UI is designed with a similar look and feel of the earlier Site Health but it has been rebuilt using our Clarity Grid Infrastructure. This page provides a variety of ...
    • Site Audit Settings

      Site Audit Settings Overview Site Audits provides a variety of reports and analysis based on a crawl, that can impact the health of a site. Site Audit Settings allow for the customization of Site Audit reports. The settings enable prioritizing issues ...