Regular expressions are both an immensely simple and yet immensely complicated topic. Entire books have been written about regular expressions and have yet only scratched the surface.
But at the same time, even very simple regular expressions can be extremely useful when working with data. And as marketers we work with a lot of data. Additionally most analytics tools support matching with regular expressions which makes the importance of learning how to construct simple regular expressions can be very helpful as well as time-saving.
This series of posts will attempt to simplify the use of regular expressions into very useful but not overly complicated usage patterns.
And knowing even just a small amount about how to use regular expressions, often noted as regex, can really make the process of data analysis faster and easier.
This first post in the series will cover the most basic of regular expressions and a little of the what and why of regular expression usage.
A simple regex
A very simple regular expression:
Yep, just a standard word is a regular expression. Not a very useful one, true, but a good place to start.
So lets look at the word
As a standalone regular expression Google matches only one thing: itself.
This is what makes it have limited value as a regular expression: its matching power is very extremely literal.
The power of regular expressions comes from the ability to create strings of text that match, and, often more importantly, do not match, increasingly complex patterns.
This more complex non-literal matching is done through the use of metacharacters.
A metacharacter is simply a character that stands for or represents a particular character or sequence of characters. A meta character is a non-literal character that matches an abstraction rather a specific text symbol.
The simplest metacharacter is a
That’s right: a dot/period is a metacharacter in regular expressions.
Not always though. Sometimes a dotperiod is just a dotperiod. More on that difference later.
The dotperiod, when used as a metacharacter, stands for any character. As a metacharacter the dotperiod is an abstraction of the concept of any character.
would match Google or Gaagle of G12gle.
Which again isn’t the most useful way of matching things, but it is a good example of how metacharacters work.
When you need a metacharacter to not be a metacharacter, you usually need to escape the metacharacter.
Escaping is simply a fancy term for telling the application that that the following character should not be given any special meaning.
In regex, escaping is usually done with a
\. However different regex engines may use different escape sequences so it is important to read the documentation for your particular application.
But the important take away is not how to escape a metacharacter but rather understanding why it must be done.
So in a regular expression a
. by itself indicates any character, while a
\. would indicatejust a standard dot/period. Again depending on your application.
Here’s an example:
Let’s say you had a bunch of pdf files that you were offering as free downloads on your site in exchange for an email address, and the files were numbered file_01.pdf file_02.pdf all the way through file_10.pdf. instead of looking for all those file names individually you could use a regular expression to match all of them:
Note that this regex is not the best way to find the match in question. There are many different ways to construct pattern matching expressions with regular expressions and the more you understand the more efficient and effective your pattern will be.
Testing your Regex
The easiest way to test your regular expressions is in a text editor. Any good text editor these days has the ability to search using regular expressions.
Some quality text editors to consider are:
You can use any of these text editors to practice and test your understanding of regular expressions. VSCode will even indicate incorrect regex syntax which can be helpful.
Here is a screenshot using Visual Studio Code: