R’s Regular Expressions are sequences of characters that define a search pattern. Typically the pattern is used by searching algorithms to find and replace specific sequences of characters within a source string or document. Check out this page for more information on Regular Expressions and various use cases.

In this post, I’ll demonstrate how to extract a specific string of characters using R’s builtin gregexpr, and how to replace a string of characters with gsub.

An excellent feature of R’s gregexpr is that by specifying perl = TRUE, it’s possible to use Python-style named capture groups, which means you can extract parts of the matching text and return only the substring(s) of interest. An example will clear up any ambiguity in description.

Suppose we have a SQL query that runs each month and retreives aggregate paid losses for a small group of insurance claims that can vary across months:

SELECT 
  POLICY_NBR,
  POL_EFF_DATE,
  CLAIM_NBR,
  SUM(PAID_CLAIMS)
FROM CLAIMS_TBL
WHERE CLAIM_NBR IN ('WEATHER-0065009', 'FIRE-011005', 'LIABILITY-013', 'WINTER-000098', 'WATER-0765801')
AND POL_EFF_DATE <= TO_DATE('20170501', 'YYYYMMDD') 
GROUP BY POLICY_NBR, POL_EFF_DATE, CLAIM_NBR;

The first task is to extract each of the claim numbers from the query and compile them into a list for further processing. Then we’ll demonstrate the use of capture groups to update the date string to reflect the current cut-off date.

Regular Expression Syntax

The following table lists the syntax and description of frequently used regular expressions:

Syntax Description
\d any digit 0-9
\D non-digits (punctuation, letters, etc…)
\w alphanumeric
\W non-alphanumeric
\s space
\S non-space
\t tab
\n newline
. any single character
^ match at beginning of string
$ match at end of string
c c occurs n times in sequence
c c occurs at least n times in sequence
c c occurs between ni and nf times
\ escape special characters, e.g. \+ matches ‘+’
c+ matches 1 or more instances of c
c* matches 0 or more instances of c

In R, backslashes have special meaning inside strings. R escapes every character that follows a backslash and then removes the backslashes. R also replaces two consecutive backslashes with a single backslash, so when using any of the regular expressions in the table above, be sure to use double-backslashes. For example, to match an email address, something@domain.com, you’d specify:

regex <- "\\S+\\@\\S+"

We’ve escaped \S+ (one or more instances of non-space characters) and also escaped the literal @ with two backslashes. When parsed by the regular expression engine, patterns in the text matching \S+@\S+ will be returned.

To utilize Python-style capture groups, surround the regular expression pattern with parens. Then, if any matches are found, gregexpr will contain the starting position and capture length of all matches as attributes of the gregexpr object. Here’s an example that extracts claim numbers from the sql above:

# our pattern matches a literal `-`, surround by 1 or more alpha
# characters to the left and one or more numeric characters to the right =>
pattern        <- "(\\w+\\-\\d+)"
matchedClaims  <- gregexpr(pattern, sql, perl=TRUE)
captureStart   <- as.vector(attr(matchedClaims[[1]], "capture.start"))
captureLength  <- as.vector(attr(matchedClaims[[1]], "capture.length"))

# initialize list to hold claim numbers =>
claimNbrs <- list()

for (i in 1:length(matchStart)) {

    iterStart  <- captureStart[[i]]
    iterLength <- captureLength[[i]]
    iterStop   <- iterStart + iterLength - 1
    iterClaim  <- substr(sql, start=iterStart, stop=iterStop)
    claimNbrs[[length(claimNbrs)+1]] <- iterClaim
}

print(claimNbrs)

[[1]]
[1] "WEATHER-0065009"

[[2]]
[1] "FIRE-011005"

[[3]]
[1] "LIABILITY-013"

[[4]]
[1] "WINTER-000098"

[[5]]
[1] "WATER-0765801"

Updating the Date

In the query above, to extract the date used as a filter POL_EFF_DATE (‘20170501’), one option is to use the sub function. Its header is:

`sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)`

sub can accept regular expressions for the pattern argument. In the following example, we replace ‘20170501’ with ‘20170516’ :

sqlOriginal <- "SELECT 
                POLICY_NBR,
                POL_EFF_DATE,
                CLAIM_NBR,
                SUM(PAID_CLAIMS)
                FROM CLAIMS_TBL
                WHERE CLAIM_NBR IN ('WEATHER-0065009', 'FIRE-011005', 'LIABILITY-013', 
                                    'WINTER-000098', 'WATER-0765801')
                AND POL_EFF_DATE <= TO_DATE('20170501', 'YYYYMMDD') 
                GROUP BY POLICY_NBR, POL_EFF_DATE, CLAIM_NBR;"


# our regular expression pattern matches strings starting with `TO_DATE(`
# followed by any 8 consecutive numbers =>
pattern    <- "TO_DATE\\('\\d{8}'"
newDate    <- "20170516"
newDateStr <- paste("TO_DATE('", newDate, "'", sep="") 
sqlUpdated <- sub(pattern=pattern, replacement=newDateStr, x=sqlOriginal)

print(sqlUpdated)

# returns ->
    "SELECT 
    POLICY_NBR,
    POL_EFF_DATE,
    CLAIM_NBR,
    SUM(PAID_CLAIMS)
    FROM CLAIMS_TBL
    WHERE CLAIM_NBR IN ('WEATHER-0065009', 'FIRE-011005', 'LIABILITY-013', 
                        'WINTER-000098', 'WATER-0765801')
    AND POL_EFF_DATE <= TO_DATE('20170516', 'YYYYMMDD') 
    GROUP BY POLICY_NBR, POL_EFF_DATE, CLAIM_NBR;"

The replacement logic can be encapsulated in a function to simplify periodic updating of the claim cut-off date:

updateDate <- function(sqlStr, newDate) {

    pattern    <- "TO_DATE\\('\\d{8}'"
    newDateStr <- paste("TO_DATE('", newDate, "'", sep="") 
    return(sub(pattern=pattern, replacement=newDateStr, x=sqlStr))
}


newSQL <- updateDate(sqlStr, '20170731')

# returns updated SQL

Conslusion

This post provided a brief introduction to regular expressions in R, and how to use Python-style capture groups to extract substrings of interest from larger matches. Regular expressions are a powerful utility, but can result in a difficult-to-maintain codebase. I’ll leave you with a quote from Jamie Zawinski, an early Netscape engineer:

Some people, when confronted with a problem, think       
“I know, I'll use regular expressions.” Now they have two problems.

Until next time, Happy coding!