Sunday 11:30 a.m.–12:20 p.m.

U is for Unicode: Solving the Mystery

Greg Back

Audience level:
Novice

Description

This talk will attempt to resolve some of the mystery and confusing behavior about Unicode (and other text encoding issues) in Python. It will cover Python handles text in general, the differences in Unicode text between Python 2 and Python 3, how various standard library APIs handle Unicode text, and a bit about detecting the encoding of unknown text.

Abstract

This talk will be presented as a series of "clues" to help understand text encoding issues (generally) and Unicode handling in Python (specifically). It will cover:

  1. Unicode and bytestring types in Python 2 and Python 3.
  2. Why the distinction between Unicode and UTF-8 is important.
  3. How system default settings can affect text handling.
  4. Some gotchas around Unicode normalization.
  5. How (and when) to "guess" at the encoding of unknown text.