A Weird Imagination

Shell script over characters

Posted in

The problem#

I wanted to find a specific Unicode character I had used somewhere in some previous blog post. But by the nature of not knowing exactly what character it was, I wasn't sure how to search for it.

The solution#

Instead, I wrote a script based on this post to simply list all of the characters appearing in any file in a given directory:

cat * | sed 's/./&\n/g' | sort -u

Although the output is small, to further reduce the noise, this version strips out the English letters, numbers, and common symbols:

cat * \
    | tr -d 'a-zA-Z0-9!@#$%^&*()_+=`~,./?;:"[]{}<>|\\'"'-" \
    | sed 's/./&\n/g' | sort -u

The details#

Separating characters#

"How to Loop Over Each Character in a Bash String" recommends a few different ways to iterate over the characters in string using a for loop in Bash. But that's a lot of typing for a shell one-liner, so I modified the suggestion to use sed 's/./& /g' which inserts spaces between every character to sed 's/./&\n/g' to instead insert newlines between every character. (The replacement & in sed is the entire matched string.)

Look ma, no uniq#

Since we don't care about how many times each character appears, we can just use the -u/--unique option to sort instead of writing sort | uniq. If you do want the counts, then write sort | uniq -c instead of sort -u.

What's that mess of symbols after tr?#

I just typed in all of the symbols across the top row of my keyboard, and then added a couple more that showed up in the output. One catch is that I wanted to include - in the list of symbols, but it's a special character for specifying ranges. Originally, I just put it in the list of symbols:

tr -d 'a-zA-Z0-9!@#$%^&*()-_`~"[]{}|'"'"

and wasn't realizing it was getting interpreted as range. This wasn't really a problem as the symbols between ) and _ are all on the keyboard anyway, but this can be avoided by making sure - is the last character in the list.

Unicode support#

Although there are warnings online that these tools may not support multi-byte characters, on my system they actually appear to handle them just fine. The output includes multi-byte characters without any mangling.


Have something to add? Post a comment by sending an email to comments@aweirdimagination.net. You may use Markdown for formatting.

There are no comments yet.