For creating audio-books I use a text-to-speech engine. One problem is that the application dies on Unicode text. The documents that I encode are too long to correct manually so I want it automated. The correction isn’t as simple as removing all Unicode text though because if possible I don’t want to lose the meaning of the character when it is easily converted to ASCII.
For example here are some transliterations that ought to occur:
- ¢ → cents
- © → copyright
- ™ → trademark
- ∀ → for all
- ♥ → heart
- ∂ → derivative
I’m more concerned with not-breaking the text-to-speech engine but having a large breadth of transliterations would be nice. With that in mind I started looking for solutions and whittling them down to choosing one:
- Revision 1
- PKG/URL
- Package name
- Github URL
- Lang
- Programming language
- Str
- Number of stars
- Notes
- PKG/URL
- Revision 2 options. I want well supported and easy to run.
- #C: Number of committers
- C: Most recent commit: Hours, Days, Months, Years
PKG/URL | Lang | Str | Notes | #C | C |
---|---|---|---|---|---|
iki/unidecode | Python | 75 | Clone of. +. +. | 8 | Y |
Text-Unidecode | Perl | 1 | The original. | 1 | Y |
rainycape/unidecode | Go | 12 | NA. | ||
xuender/unidecode | Java | 35 | NA. | ||
node-unidecode | JS | 70 | Curious. | ||
UnidecodeR | R | 58 | Good to know! | ||
sindikat/unidecode | Elisp | 2 | NA. | ||
silverstripe-unidecode | PHP | 8 | NA. |
The Python port looks like the most actively maintained and Python is always a good choice. The author’s discussion of his port is interesting for programmers. In theory we design system that use Unicode even though we know that they’ll have to inter-operate with ASCII-only systems. In practice it is usually an afterthought that results in well-hidden bugs and exploits. Kind of gets you wondering whether or not we would be better off only building ASCII-only systems today.
Here is how to get it set up with virtualenv
on OS X and brew
:
Review this and this and verify that you have a Python build with the Unicode support for “wide” characters. For transliterating Blackboard bold, you need this.
This code should answer 1114111 (not 65535)
import sys
print sys.maxunicode
65535
This is the wrong Python build. It needs to be ucs4
instead of ucs2
. Seems like a fair number of people use ucs4
(here, here, here, here).
This explains common CFFI errors from systems with both ucs2
and ucs4
installatins that are “mixed up”:
Here is how you know that there is a problem:
This is about getting an ImportError about
_cffi_backend.so
with a message likeSymbol not found: _PyUnicodeUCS2_AsASCIIString
. This error occurs in Python 2 as soon as you mix “ucs2” and “ucs4” builds of Python. It means that you are now running a Python compiled with “ucs4”, but the extension module_cffi_backend.so
was compiled by a different Python: one that was running “ucs2”. (If the opposite problem occurs, you get an error about_PyUnicodeUCS4_AsASCIIString
instead.)
Here is the solution for doing a custom build with a custom CFFI and virtualenv though pyenv is also mentioned.
More generally, the solution that should always work is to download the sources of CFFI (instead of a prebuilt binary) and make sure that you build it with the same version of Python than the one that will use it. For example, with virtualenv:
virtualenv ~/venv
cd ~/path/to/sources/of/cffi
~/venv/bin/python setup.py build --force # forcing a rebuild to make sure
~/venv/bin/python setup.py install
This will compile and install CFFI in this virtualenv, using the Python from this virtualenv.
This post explains another approach to get it running. Here is another one. This all looks like it is fragile. Yuck. I’m going to set up a vagrant
box instead.
Here is a start. It doesn’t build right now and I’m stuck. Pythonistas, what am I doing wrong here?
I took a crack at it in Emacs Lisp for fun:
http://pastebin.com/Tat9xqcK
It draws the replacement name from the character’s Unicode name, combining characters are dropped, and there’s a table for custom replacements (em dash, en dash, etc.). Spacing around replacements might be needed (“copyright2016” vs. “copyright 2106”). It also doesn’t pluralize, so you’d end up with things like “12¢” “12cent”, which is where the custom table helps.
I said “for fun” since I imagine your need is part of a larger build process and Emacs wouldn’t really fit in.
Fun is appropriate because here that got me thinking about the process and how it is not a simple search and replace. Exploratory programming is indeed perfect here.
Cool, thank you Chris!