Dealing with Character Encodings is (sometimes) hard. It's especially confusing for those who've never done it before. Converting text from unicode to ascii can be tricky.
A lot of times, I'll import some data from a text file, and I just want to convert everything to ASCII and ignore anything that's not ascii (like MS Word's smart quotes). Luckily, this is fairly easy:
mystring = mystring.decode('ascii', 'ignore')
There's tons of great Python resources (and code!) for all your character encoding needs. In no particular order, here are a few I've found:
- A Crash Course in Character Encoding
- Dive Into Python's Chapter on Unicode
- Beautiful Soup gives you Unicode, Dammit and there's the companion: ASCII, Dammit
- There's also unaccent.py, which seems to convert various unicode characters to their ascii equivalent.
There's probably more, but most of these have helped me get the job done.
comments powered by Disqus