Tiny Economics: Linux Bash tutorial 3: Massive text-morphisms, tr, sed

I decided to put off grep for a little bit. After you learned sed, it's pretty easy to do grep. And after you learned grep, you wish could just grep things in life :).

OK, now let's first do tr and sed.

Run:
wget http://www.gutenberg.org/dirs/etext95/callw10.txt

Now run: less cal tab (don't type the full name, in Linux we use tab to auto complete remember?)

Now, run
wget -r -A ogg http://www.gutenberg.org/files/19678/ogg/

You should just got yourself a whole collection of the audio books.

We could do something much more complicated but more precise if we
wget http://www.gutenberg.org/files/19678/ogg/ , to get the index html. And that will help us practice our stream edit, ie, sed skills a lot.

Ok, here's the command that we use to deal with the index.html.
cat index.html, you see what the content is.
And we agree that we actually want the something.ogg files.

cat index.html | sed 's/.*href=\"\(.*\)\".*/\1/' , this edits out the .ogg files. (Let's don't mind the other craps first, if you want filter it through grep like this :
cat index.html | grep '\.ogg' | sed 's/.*href=\"\(.*\)\".*/\1/' )
Then we are able to separate the file names and we can further do :
cat index.html | grep '\.ogg' | sed 's/.*href=\"\(.*\)\".*/\1/'| xargs -i echo wget http://www.gutenberg.org/files/19678/ogg/{} and pipe it to bash, we get what we want.

You think the second command is so long, complicated and stupid but the truth is, it's much more flexible and much more widely used. The reason the first command works is because wget is kind of smart and it kind of figures out what to do if it's given a directory. But most people don't know -r parameter and I had to look that up too. So because the second command is less dependent upon other things and standard usage in Linux, so, let's keep that in mind and forget the first one :).

Now, let's come back to look at the command :
cat index.html | grep '\.ogg' | sed 's/.*href=\"\(.*\)\".*/\1/'| xargs -i echo wget http://www.gutenberg.org/files/19678/ogg/{}

You know what cat index.html does, you know what xargs -i paired with {} and echo does. Grep is a filter tool. sed is one of the subjects of this section. Please ignore all the \ because they're just used to tell the computer that we want a special dot or a special 1 that represents something instead of just a dot or 1.

Sed has much flexibility in text editing and grep has huge usage in searching all because they can use regular expressions. Regular expression is a thing to learn on it's own good so We'll postpone that to the next section.

Let's focus on the basic structure of tr and sed commands first. Tr is basically a character to character based text editing tool and sed is string based. That's the difference.

Now,
seq 1 100 > new.txt
cat new.txt | tr 1 2
You can see where it was 1 is now 2.
cat new.txt | tr [12] [34]
1 is changed to 2, 3 is changed to 4.
cat new.txt | tr 1 23
This is kind of a wrong command. It only translates 1 to 2.
you can also delete things with -d
cat new.txt | tr -d 1.
All ones are killed. 11 is gone. (Which is something to bear in mind because as I mentioned, sed needs some extra parameters to let it change strings "globally")
So that's enough for tr.

I'll only explain some of sed's usage here. More will be covered after regular expression.
The typical structure of a sed command is like this sed 's/x/y/p'
s is for substitution, which means that we're doing string editing. x is the pattern we want to match to change and y is what we want it to be like. p is a special parameter, we usually put g there to mean globally, ie, when multiple x's appear in a line, they'll all get changed.

Now, let's do something to our favorite txt file, call tab.
cat callw10.txt | grep 'Buck' to see all the lines with Buck. (I'm not cheating using grep. I just want you to see it more clearly. No worries, grep will be covered. And this should already seem kind of clear to you.)

I don't know if you like the name Buck or not, I like Paul better. So:
cat callw10.txt | grep 'Buck' | sed 's/Buck/Paul/g'
And the fact that they both have 4 characters is just a coincidence. You can absolutely pick your favorite name.

OK, let's do this:
I wanna plagiarize my own version of call of the wild. However, I'm not that good at writing. I just wanna change Buck to Paul, good to great, softly to gently. So I put them in a file.

cat > thesaurus.txt
Buck Paul
good great
softly gently

Now run this command:
cat thesaurus.txt | sed 's/\(.*\)\ \(.*\)/s\/\1\/\2\/g/g' | tr '\n' ';' | tr -d '\n' | xargs -i echo sed \'{}\'
see what you've got and pipe this to bash. Whatever pairs of words you've got. This generates a command that replaces the first word in the document to the second word in the document. sed translates word 1, word 2 into s/word1/word2/g to get them ready for sed arguments. tr translates new lines to ; and the rest you should know. Again, pleas ignore \, they are for the computer to see, actually first compile the command and then add \ to where there could be ambiguity to the computer.
So to apply this to our favorite txt file. call of the wild we can:
cat thesaurus.txt | sed 's/\(.*\)\ \(.*\)/s\/\1\/\2\/g/g' | tr '\n' ';' | tr -d '\n' | xargs -i echo cat callw10.txt \| sed \'{}\' | bash

Again, forget about the \ even though I spend most of the time writing this article dealing with those quotations. But this is not the point. So expand your thesaurus and plagiarize smart :) Hehe, kidding. I believe this command goes into trouble when we have too many words in the thesaurus, I mean basically, how long can we have a command. So it's probably better to effectuate it with a script. In a few sections, we'll talk about that.

Linux Bash Tutorial 4: the power of regular expressions.

Tiny Economics

Leatherboard1

Linux Bash tutorial 3: Massive text-morphisms, tr, sed

No comments: