Leatherboard1

Linux Bash tutorial 4: The power of regular expressions

Regular expressions are omnipotent; regular expression is omnipotent; regular expression will be shortened regex in the sequel.

The idea of regex is pretty simple. You just use some characters to represent a kind of character. Like when you do cp, you use cp folder_a/* folder_b/ to copy everything. It is only a much more systematic approach that makes it much more powerful. Of course, our skepticism always warns us that this may not be the best system theoretically, but now that all that we use use this system, we better first study it :). Besides, the most famous Chinese martial art book says to win a battle, you should always first study your rival.

  • asterisk * : used following a character or meta-character (meta-characters are characters that represent more than one characters, for example a dot . is anything.) represents zero or more repeats of that character.
    So, 1* is 1 or 11 or 111... Similarly ? means 0 or 1 times of iteration of its preceding character, ie, ba? matches to b and ba. + means 1 or more times. {n} Matches the preceding character n times exactly, for example, [0-9]{3}-[0-9]{4} matches any number of the form XXX-XXXX. {n,m} n to m times.
  • dot . : any character as I said, but except a newline character.
    So, 1. can be 1 followed by any other character(s) except new line.
  • [] : any of the things inside. [abc] matches to a,b or c. [0-9] matches from 0 to 9 (and behold the dash means a range)
  • ^ has two usages. Inside [] it means the negation, ie, [^a] means anything but a. ^ outside a bracket means the beginning of a line. So ^A means a line that starts with an A.
  • $ means the end of a line. So a$ means a line that ends with an a. ^$ means an empty line.
  • | or. (Gilmoregirls|Smallville) Gilmoregirls or Smallville. (Which is your favorite TV show?)
Notes: What if you want to search for a real dot? You can't put dot where a regex is supposed to be because that will be interpreted as any character. Then you need to escape it putting \ in front of it. You've seen many of those in tutorial 3 and you can go back to refresh your memory and you should also see if you understand all the expressions I used. And of course, you can put normal characters, strings in regex and they just mean whatever they conventionally mean. The escaped parenthesis in tutorial 3 are just used to group things together.

One good feature of regex is that you can group things together and later make reference as I did in tutorial 3. There's also POSIX character classes and they look like this : [:digit:], [:alpha:] ... I don't want to talk about this, you can always look them up. And they'll show up in the next tutorial too.

Linux Bash tutorial 5: If you can't find it with grep, it's lost. (Under Construction)

No comments: