The problem#
I wanted to find a specific Unicode character I had used somewhere in some previous blog post. But by the nature of not knowing exactly what character it was, I wasn't sure how to search for it.
The solution#
Instead, I wrote a script based on this post to simply list all of the characters appearing in any file in a given directory:
cat * | sed 's/./&\n/g' | sort -u
Although the output is small, to further reduce the noise, this version strips out the English letters, numbers, and common symbols:
cat * \
| tr -d 'a-zA-Z0-9!@#$%^&*()_+=`~,./?;:"[]{}<>|\\'"'-" \
| sed 's/./&\n/g' | sort -u
The details#
Separating characters#
"How to Loop Over Each Character in a Bash String"
recommends a few different ways to iterate over the characters in
string using a for
loop in Bash. But that's a lot of typing for a
shell one-liner, so I modified the suggestion to use sed 's/./& /g'
which inserts spaces between every character to sed 's/./&\n/g'
to
instead insert newlines between every character. (The replacement &
in
sed
is the entire matched string.)
Look ma, no uniq
#
Since we don't care about how many times each character appears, we can
just use the -u
/--unique
option to sort
instead of
writing sort | uniq
. If you do want the counts, then write
sort | uniq -c
instead of sort -u
.
What's that mess of symbols after tr
?#
I just typed in all of the symbols across the top row of my keyboard,
and then added a couple more that showed up in the output. One catch is
that I wanted to include -
in the list of symbols, but it's a special
character for specifying ranges. Originally, I just put it in the list
of symbols:
tr -d 'a-zA-Z0-9!@#$%^&*()-_`~"[]{}|'"'"
and wasn't realizing it was getting interpreted as range. This wasn't
really a problem as the symbols between )
and _
are all on the
keyboard anyway, but this can be avoided by making sure -
is the last
character in the list.
Unicode support#
Although there are warnings online that these tools may not support multi-byte characters, on my system they actually appear to handle them just fine. The output includes multi-byte characters without any mangling.
Comments
Have something to add? Post a comment by sending an email to comments@aweirdimagination.net. You may use Markdown for formatting.
There are no comments yet.