Regex Engine

Regex Engine

Understanding the Regex Engine

Regex engine is the part of the Ruby language’s operating environment that deals with interpreting regular expressions as they appear in a pattern. In this blog, we explain the Regex Engine with examples. 

 

Problem

Having some idea about how a regex engine works would help you understand the functioning of certain pattern constructs more clearly.

 

The phrase “regex engine” is in wide circulation. Getting familiar with the term is likely to help your further studies on the subject of regular expressions.

 

Solution

What is a regular expression engine? 

The engine in this sense refers to the part of the Ruby language’s operating environment (which may consist of a compiler/interpreter and an execution platform).

 

In this blog, it is referred to as ROE (Ruby Operating Environment), which deals with interpreting regular expressions as they appear in a pattern (check validity), trying to find one or more match(es) as requested, and capturing the group.

 

It may be helpful to think of this part of ROE as a separate executive performing due diligence on source strings and patterns, and handing out reports. 

 

And how does the engine work? There is more than one type of regular expression engine. However, because this is not a reference book, let’s not get into too much detail.

 

Besides, for day-to-day use, you probably won’t need a very deep level of understanding the engine. So I will try to keep it simple and try to provide an overview based on my understanding.

 

The following are two basic rules that apply in general, for the engine:

  1. The first match wins
  2. Patterns are inherently greedy The engine
  3. Runs from left to right for both the source string and for the pattern to be matched
  4. Tries to match the whole pattern at each possible position (as applicable) in the whole source string, before going to the next position.

If it finds a match, it stops without checking any further (unless otherwise specified in the pattern).

 

In the context of 'position (as applicable)', note that it is contextual (as per the pattern expression). For instance, if the pattern specifies a beginning anchor (^), if the match is not found in the very first position, there is no point in going ahead.

 

The possible positions in the source strings (in general) are the zero-length positions prior to each character, including each space, and the position at the end.

So, in a “Hello World” string, the positions (shown with '|' character) are |H|e|l|l|o| |W|o|r|l|d|.

 

Plane Forward Search

This following explanation is based on a pattern that is a set of plane characters, but the principle applies to other patterns too (as applicable).

 

When the engine picks up the source string, on one hand (in a metaphorical sense), and the pattern, on the other, it starts with the first possible place (the zero-length position before the first character) in the source string.

 

It tries to match that with the first character of the pattern. If a match is found, it goes ahead of one position in the source string and tries to match that with the second character of the pattern. If that succeeds, it goes ahead one position further in source string and tries with the third character in the pattern, and so on.

 

If at one stage a match fails, it abandons the basic starting position in the source string (in this case, the first position) and moves ahead to the next position in the source string and tries to match the whole pattern starting from there.

  1. If it succeeds in matching the whole pattern in a position, it returns success (or whatever the equivalent, as per implementation) and stops.
  2. An example of a search on the string “Raca the cat, jumped in the hat” for the pattern /cat/ will help.

 

The search starts at the position before R, and tries to first match 'R' with the very first character in the pattern (which is 'c'), and fails.

 

It advances to the next position, which is after 'R' and before 'a'. In its journey, at one point it comes to before (the first) 'c' in the source string.

 

This time, it tries to match 'c' (of 'Raca') with 'c' (from the pattern) and finds a match. It advances position and tries to match 'a' with 'a' (from the pattern) and succeeds.

 

However, the next match between a space (from the source string) and 't' (from the pattern) fails. It gives up the current basic starting position, which is the place before 'c' in the word 'Raca' and makes the position after the 'c' its basic starting position for the next search attempt.

 

This goes on until it comes to just before the 'c' of the word 'cat' in the source string (assuming that the computer does not crash). At this point, it matches 'c' with 'c', goes ahead and matches 'a' with 'a', still goes ahead and matches 't' with 't'. And it concludes its journey.

 

Checking User Input

Problem

Let’s continue with the problem from Recipe, except a name list should be sorted by the last name. (Make sure that you read the previous recipe before continuing with this recipe.)

 

Since we are taking input from the user, there may be a check for proper names (for now let’s define a proper name as any name that does not have a digit in it).

 

Also, what if someone inadvertently enters the first name and then presses the Return key (without entering the last name)? We can check for that too (reject it, but show the proper message in the next prompt, so that the user notices).

 

How about enhancing the last solution with all of these checks? Note that different parts of the name should be separated in the input by one or more spaces. And the names should be printed with the last name first and the rest of the name (e.g., “John M McCain” => McCain John M).

 

Solution

Before fully addressing the solution, I would like to present a bit of regex (or regular expressions). Although regular expressions are discussed in detail later in the book, for the current exercise a bit of discussion is necessary.

 

Note that a representation like /[0-9]/, when used as a pattern (to be matched or searched for), means any character (strictly speaking, any one character) that is between 0 and 9.

 

It can be searched in a string for a match (or otherwise) with the =~ operator (which you would have already seen).

For instance, this:


"ab11ed" =~ /[0-9]/

returns this:

2

t

It is the first index position of a digit (any digit, or any character between 0 and 9, both included) in the string. However, if the string does not have any digit, it returns nil, (which, as you might remember, is interpreted as 'false' in a logical context).

 

These kinds of expressions (e.g., [0-9]) are known as regular expressions (there are many forms possible). Encased between / and /, it forms a regex pattern (/ and / denote pattern boundaries in this case).

 

Note that a pattern can be an exact string (like /ab/) also. Equipped with that knowledge, let’s proceed to the solution. It would be a good idea to solve the missing pieces first. We also already have the non-digit check in place.

 

In order to check that the name given has at least the first name and the last name, we can split it in one or more spaces (again a regular expression). The resulting array from split has at least size 2; something like this:

names = name.split(/\s+/)

names.size > 1

 

The first line of code splits a string (named name) based on the regular expression \s+ (\s indicates whitespace characters and the + indicates one or more of that). The result of the split goes into the names array. The next line checks whether the size of the array is greater than 1.

 

The next piece of the puzzle is to get the last name and the rest of the names. (Note that the name may or may not have a middle name, so the array size after the split will not always be 2).

 

We could find the array size and then access the last element of the array based on the size. The following is an example.

a = ['John','M','McCain']

i = a.size

lastName = a[i-1]

 

Note that there is a function named last in the Array API that gives the last element of an array.

irb(main):004:0> ['John','M','McCain'].last => "McCain"

But we need to do more than that. We need to rearrange the name (with last_name coming at the beginning). An approach for index lookup based on size (i.e. lastName = a[i-1]) is more suitable.

 

In terms of conceptual design, the solution has three main functional components.

  1. Reading and validating the name (and converting valid names to bring the last name first).
  2. Asking for names in a loop, until END, and building up the name array for sorting.
  3. Sorting and printing.

Those activities or functionalities may be intertwined in the program and they may not necessarily come one after another. (For instance, validation should be called as part of a looping action, as each name needs to be validated).

 

Note also that the first component (as stated earlier) could be broken down into subfunctions. For example, validating the name and transforming a valid name could be two separate functions.

 

It could be a matter of design choice (and indeed another function may be written, which could be passed a valid name for transformation), but the transformation part is rather small, so in terms of design, I have put it in the same function as the validation.

 

Now let’s come back to solving individual functions (three of them), from our conceptual functional component model.

 

The third part (sorting and printing), in essence, is the same as the last task. Hence, more or less the same code may be used.

 

For the second part, we now have three essential cases.

  1. END: To signal the end of a looping
  2. A valid (and converted) name => add to the name_array
  3. An invalid name => ask again and signal an error to the user through the same message

This means that our case statement has three cases now—one for each class of action in the while loop.

 

It would be good if our validation function supports this classification (so that the case statement can work directly on the output of the function, and need not have anything to do with the raw name).

 

The following is the classification functionality to implement.

  1. Return END when that string is specified.
  2. Return a specific string that indicates that the name is invalid (and also indicates the type of error in the same output string— meaning that we may have more than one specific error string)
  3. Return the converted name (last name first) for all other (i.e., valid) cases.

You can start implementing from the first functionality, which is this classification.

The following code should work fine.

def validate(name)

if name == "END"

"END"

elsif name =~ /[0-9]/

"NOTVALID_NUMERIC"

else

names = name.split(/\s+/)

sz = names.size

if sz < 2

"NOTVALID_FIRSTNAMEONLY"

else #return the name with last part first last_name = names.delete_at(sz-1) names.insert(0,last_name) names.join(' ')

end

end

end

puts validate('John M McCain')

puts validate('END')

puts validate('FirstNameOnly')

puts validate('1abc 2def')

puts validate('123NumericAndFirstNameOnly')

When run, it should produce the following.

McCain John M

END

NOTVALID_FIRSTNAMEONLY

NOTVALID_NUMERIC

NOTVALID_NUMERIC

 

Note that the function first checks for digits and then the first name only. Hence, a case containing both issues is reported as a numeric case. 

 

Equipped with this function, the case part can be written without much trouble. The while loop (including the case statement) now becomes this:


while name = gets.chomp

processed_name = validate(name)

case processed_name

when "END"

puts "end of user input"

break # break from asking loop

when "NOTVALID_NUMERIC"

print "Not a valid name (no digit allowed). Name [enter END to end] : "

when "NOTVALID_FIRSTNAMEONLY"

print "Please provide full name. Name [enter END to end] : " else # valid (and converted) name

#append the name to the array

name_arr << processed_name

#print the prompt again for further input print "Name [enter END to end] : "

end

end

Two different types of error messages have been provided, so there are two cases for that, but those two cases could be merged if you want to provide one general error message. Those two cases are functionally not very different.

This is the current solution.

<function definition goes here>

print "Name [enter END to end] : "

name_arr = []

<while loop goes here>

puts

puts "--------------------------- "

puts "Sorted names (by last name)"

puts "--------------------------- "

#sort the array and print the result

name_arr.sort.each {|name| puts name}

Run with the inputs demonstrated next.

Name [enter END to end] : 123

Not a valid name (no digit allowed). Name [enter END to end] : abcd

Please provide full name. Name [enter END to end] : Tori Dean

Name [enter END to end] : John M McCain

Name [enter END to end] : Daly Moore

Name [enter END to end] : Santu Bose

Name [enter END to end] : David Bower

Name [enter END to end] : END

end of user input

It produces the following.

---------------------------

Sorted names (by last name)

---------------------------

Bose Santu

Bower David

Dean Tori

McCain John M

Moore Daly

This satisfies our specification.

Note the following.

The functional components do not necessarily represent the sequential part of the code. It represents breaking up overall processing into some main basic chunk of activities.

 

Backtracking

Things may get more interesting when the pattern is more complex. 

 

At a point, when the engine has more than one option to find a suitable match for the next character (or subpattern, etc.), it remembers the other options that it can try.

 

It also notes the position where the fork happened to get back and try the next alternative from that fork in the road, should the current alternative come to a dead end prior to completing a match. This is known as backtracking (coming back to the fork in the road and trying a previously untried alternative).

 

So, as it happens, the engine (or shall we say the car) sometimes runs in reverse (metaphorically speaking) as necessary.

Coming back to our example, although the search starts from the position prior to 'T', the real position of interest (for the current explanation) is the position prior to the 't' of 'trap'.

 

The 't' the 'r' and the 'a', matches one by one from the pattern. Just before 'p' the engine has two options. Let’s say it picks 'in' to try first and saves 'ck' for future search should 'in' fail.

 

(This does not necessarily happen left to right, and follows a LIFO structure for getting the next alternative to try). Eventually, 'p' does not match 'i' and the match fails.

It backtracks to the position before 'p' and tries to match that with 'c' of 'ck'.

 

Still fails. Now since it has no other untried alternative, it abandons the basic starting position (which is the position before the 't' of 'trap' in this case) and advances to the position after that 't' (i.e. before the 'r' of 'trap') and makes that it’s basic starting position for the next attempt.

 

Finding Repeated Patterns

Problem

You want to match patterns such as 'ab0ab0ab' or 'cd1cd1cd' or 'ef1ef1ef' and so on, but not 'ab1cd1cd' or 'cd2ab2ef' and so on. How do you specify that?

 

Solution

A backreference is a type of construct in regular expressions. It provides a convenient way to identify a repeated character or substring within the source string.

 

As far as backreferences are concerned, the keyword is repetition. Suppose you are looking for a /(ab/cd/ef)/ pattern that will match either 'ab' or 'cd' or 'ef'. However, you want to see whether the same pattern is repeating twice more after another character gap each time.

 

You can use (numbered) backreferences for the purpose. The following code

print "matched" if "ab1ab2ab".match(/(ab|cd|ef).\1.\1/)

prints this:

matched

And the following code

print "matched" if "cd1cd2cd".match(/(ab|cd|ef).\1.\1/)

prints this:

matched

But the following code does not.

print "matched" if "ab1cd2cd".match(/(ab|cd|ef).\1.\1/)

 

How It Works

In the last code, the initial pattern found is 'ab' (even though it was looking for either 'ab' or 'cd' or 'ef'), and hence, the value of back reference \1 is set to 'ab' only. In the next case, after a gap of one character (signified by .), 'ab' does not occur, and hence, the entire pattern fails to match.

 

If you removed the last backreference and the preceding dot from the pattern (i.e., it was defined as /(ab|cd|ef).\1/), it would pass, however.

The following code


print "matched" if "ab1cd2cd".match(/(ab|cd|ef).\1/)

prints this:

matched

The match is not between 'ab' and 'cd' but between the last two 'cd'’s. This becomes apparent upon running the following code.

print "ab1cd2cd".match(/(ab|cd|ef).\1/)

That prints the following.

cd2cd

You can have multiple backreferences in a single pattern, referring to captured groups, specified in that order in the pattern. Hence, the following code


print "matched" if "ab1cd2ef3cdabef".match(/(ab).(cd).(ef).\2\1\3/)

prints this:

matched

 

The three groups captured (in that order) set the values 'ab' to \1, 'cd' to \2 and 'ef' to \3, respectively. Hence, \2\1\3 translates to 'cdabef'. The rest should be easily understandable.

 

Any small deviation in this part of the string (assuming that you are not adding the same string elsewhere in the source string) interferes with the match. Note that backreferences happen only with captured groups. If you were to make the second group non capturing; for example,


print "matched" if "ab1cd2ef3cdabef".match(/(ab).(?:cd).(ef).\2\1\3/)

An error will occur.

... invalid backref number/name: /(ab).(?:cd).(ef).\2\1\3/

This is because the second group now being non-capturing. \2 actually gets the value captured from the third group, and \3 is meaningless (there are only two back references generated in this case). An observation from this is that you cannot refer to an invalid back reference (one that is not generated from the pattern).

 

Octal Codes and Backreferences

In this context, it is worth mentioning octal codes. In Ruby, an octal code is distinguished by a preceding backslash. Hence, the following code


print "\121"

prints this:

Q

 

The question is how the Ruby compiler/executor environment knows whether such a number is an octal code or a back reference.

The following rules apply.

  1. The expressions \1 through \9 are always interpreted as backreferences and not as octal codes.
  2. If the first digit of a multi-digit expression starts with 8 or 9, the expression is interpreted as a literal.
  3. Expressions from \10 onward (except the 8 and 9 beginning digits) are considered backreferences if there is a back reference corresponding to that number; otherwise, they are treated as octal codes.

 

So you cannot get away with an invalid back reference if it is between \1 and \9


As per the second point, the following code

print "\89"

prints this:

89

Recommend