An Intuitive Grasp of RegEx’s in Python
Regular expressions are used to define search patterns. Python provides regular expressions via the built in ‘re’ module, but they are hard to read, write, and understand. This talk will give you two tools conquer regex’s, a mental model, (demonstrated with props), of how they work, and a mini-language, “Simple Regex Language”, to create readable regex’s that easily translate into Python regex's.
Overview & Purpose
Regular expressions are used to define search patterns and are an important technique for validating data, scraping data, data wrangling, (i.e re-formatting.), the content of strings. Additionally, the’re used to enable syntax highlighting in some applications. Python provides regular expressions via the built in ‘re’ module, and there is a third party ‘regex’ module with added functionality.
The problem is, writing regex patterns to do what you want is hard, and even when you’ve got one, figuring out what it is or isn’t going to match can be baffling.
This talk will give you two tools conquer regex’s, a mental model, (demonstrated with props), of how they work, and a mini-language, “Simple Regex Language”, to create readable regex’s that easily translate into Python’s regex syntax.
A Physical Model of RegEx’s
- Picture the string we are searching as a line of tiles, (like those in scrabble), where the character each represent has been routed into its surface.
- This lets us talk about the two categories of places a regex start or continue a match:
- At a character specified in the regex: (modeled by a vacuum-formed sheet of plastic whose profile can nest in the character’s incised relief).
- At a position called an anchor, specified in the regex: (represented by the insertion of a strip of plastic into the crack between tiles)
- Note: whether a give ‘crack’ matches the given anchor is determined by what is to its left and right; more specifically, the categories they belong to, i.e. whitespace, printable, alphabetic, numeric, eol, buffer-wall, etc.
- This model lets us illustrate how the regex engine goes about making a match; e.g., if our pattern wants to match ‘ABC’, and our string contains ‘ABD’, we slide along a piece of plastic with a ‘A’ profile, from the start of the buffer to where we encounter the ‘A’ tile. The plastic will sink into ‘A’ tile, allowing us to swing down ‘B’ plastic that is taped the the right edge of ‘A’ overlay which also sinks down flush matching the ‘B’ tile. When we try to swing down the next taped on plastic overlay, ‘C’, it crashes into the surface of the ‘D’ tile and instead levers out the ‘B’ overlay, which levers out the ‘A’ overlay and gets us back to sliding along the ‘A’ overlay looking for the next place to pause and try for a match.
- At this point we introduce SRL, (below), then show how its patterns translate into Python regex’s, then we return to this model and extend it to cover all the different regex ‘atoms’ we can now write.
- This “Tile and Overlay” model provides a visual metaphor to see how the regex engine works, but there are no tiles and overlay chains in the computer, there are only strings of bytes and double-bytes, (if we are talking UTF-8), so we briefly introduce a model that use height to represent characters. This lets us talk about Unicode strings, and hints at the kind of optimizations compiling regex’s might allow Python to do.
- For the presentation, there will be a physical model to show the example in covered in point three above, but to make things manageable, we'll then switch to illustrations done in Skecthup, (maybe even animations).
SRL: Simple Regex Language
- SRL is what is known as a “Little Language”, or a “Domain Specific Language” which are built to handle a small problem area. In SRL’s case, the problem is that of the unreadability of regex’s, and that each language has a different way of writing them.
- You’d think we could skip this as we are only concerned with Python here, but it is useful to have this level of abstraction, even if you only do Python. You are likely to find that your editor uses a different flavor of regex’s.
- An overview, live demos, and documentation can be found at the project’s website, https://simple-regex.com. I won’t duplicate them here,I’ll just say that the exposition of what we need for this talk will follow this source material, and include a SRL to Python cheat sheet that covers their translation and how they are expressed in the “Tile and Overlay” model.
- To give reviewers a feel for what the illustration of SRL will look like, I intend either add them to the proposal, or provide a link to them on my github so you can look at them as they are created for this talk.