Regex Basics

Regular expressions are an immensely complicated topic. Entire books have been written about them, and have yet only scratched the surface.

However, they can be extremely helpful for any person who has to analyze marketing data. When used at a basic level regular expressions do not have to be overly complicated.

Knowing just a small amount about how to use regular expressions, often noted as regex, can really make the process of data analysis faster and easier.

A simple regex

A very simple regular expression: Google

Yep, just a standard word is a regular expression. Not a very useful one, true, but a good place to start.

So lets look at the word Google in a little more detail.

Really, Google is just text pattern that we have learned to recognize. A capital G followed by two lowercase o’s and so forth. We would likely also recognize the same text pattern in slightly different forms: GOOGLE or google or even gOogLe. Though this last version is a little difficult to parse visually, and shows how unusual shapes affect our recognition of text patterns.

As a standalone regular expression Google matches only one thing: itself.

This is what makes it have limited value as a regular expression: its matching power is very limited.

The power of regular expressions comes from the ability to create strings of text that match, and do not match, increasingly complex patterns.

This is done through the use of metacharacters.

Metacharacters

A metacharacter is simply a character that stands for or represents a particular character or sequence of characters.

The simplest metacharacter is a ..

That’s right: a dot/period is a metacharacter in regular expressions.

The dot/period, when used as a metacharacter, stands for any character

Therefore:

G..gle

would match Google or Gaagle of G12gle.

Which again isn’t the most useful way of matching things, but it is a good example of how metacharacters work.

Escaping Metacharacters

When you need a metacharacter to not be a metacharacter, you usually need to escape the metacharacter.

Escaping is simply a fancy term for telling the application that that the following character should not be given any special meaning.

In regex, escaping is done with a \.

So in a regular expression a . by itself indicates any character, while a \. indicates just a standard dot/period.

So lets say you had a bunch of pdf files that you were offering as free downloads on your site in exchange for an email address, and the files were numbered file_01.pdf file_02.pdf all the way through file_10.pdf. instead of looking for all those file names individually you could use a regular expression to match all of them:

file_..\.pdf

Note that this regex is not the best way to find the match in question. There are many different ways to construct pattern matching expressions with regular expressions and the more you understand the more efficient and effective your pattern will be.

Testing your Regex

Use any good text editor.

Almost any good text editor these days has the option to find content using regular expressions. Some like Visual Studio Code will even indicate incorrect regex syntax which can be helpful.

Here is a screenshot using Visual Studio Code:

Testing regex in VS Code