The Entropy of English vs. Chinese

I (Bob) have long been fascinated by the idea of comparing the communication efficiency of different languages. Clearly there’s a noisy-channel problem that languages have in some way optimized through evolution. There was some interesting discussion recently by Mark Liberman on the Language Log in an entry Comparing Communication Efficiency Across Languages and a reply to a follow-up by Bob Moore in Mailbag: comparative communication efficiency.

Mark does a great job of pointing out what the noisy channel issues are and why languages might not all be expected to have the same efficiency. He cites grammatical marking issues like English requiring articles, plural markers, etc., on every noun.

The spoken side is even more interesting, and not just because spoken language is more "natural" in an evolutionary sense. Just how efficiently (and accurately) the sounds of a language are encoded in its characters plays a role in the efficiency of the writing system. For instance, Arabic orthography doesn’t usually encode the vowels in their spellings, so you need to use context to sort them out. The alphabet includes vowels, but they are conventionally employed only for important texts, like the Qur’an.

I would add to Mark’s inventory of differences between English and Chinese the fact that English has a lot of borrowings on both the lexical and spelling side, which increase entropy. That is, we could probably eke out some gains by recoding “ph” and “f”, collapsing the distinction between reduced vowels and so on; for instance, we wouldn’t have to code the difference between “Stephen” and “Steven” which is only present in the written language (at least in my dialect).

There are lots of other differences. It may seem that Chinese doesn’t waste bits coding spaces between words. Or encoding capitalized versus uncapitalized letters. Surprisingly, when I was working on language modeling in LingPipe, I tested the compressibility (with character n-grams ranging from 5-16) of English drawn from LDC’s Gigaword corpus, with and without case normalization. The unnormalized version could be compressed more, indicating that even though there are more superficial distinctions (higher uniform model entropy), in fact, these added more information than they took away. Ditto for punctuation. I didn’t try removing spaces, but I should have.

I also found counter-intuitively that MEDLINE could be compressed tighter than Gigaword English. So even though it looks worse to non-specialists, it’s actually more predictable.

So why can’t we measure entropy? First of all, even the Gigaword New York times section is incredibly non-stationary. Evaluations on different samples have much more variance than would have been expected if the data were stationary.

Second of all, what’s English? We can only measure compressibility of a corpus, and they vary by content.

Finally, why can’t we trust Brown et al.‘s widely cited paper? Because the result will depend on what background training data is used. They used a ton of data from "similar" sources to what they were testing. The problem with this game is how close are you allowed to get? Given the test set, it’s pretty easy to engineer a training set by carefully culling data. We might try to compress a fixed corpus, but that leads to all the usual problems of overtraining. This is the approach of the Hutter Prize (based on compressing the Wikipedia). So instead, we create baskets of corpora and evaluate those, with the result that there’s no clear “winning” compression method.

This entry was posted on April 11, 2008 at 3:59 pm and is filed under Carp's Blog, LingPipe in Use. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Responses to “The Entropy of English vs. Chinese”

Dylan Thurston Says:
April 13, 2008 at 8:12 pm | Reply
You talk about spelling distinctions in English not present in the spoken speech, but oddly don’t mention the much more prevalent distinctions in Chinese not present in the spoken speech: there are generally many written characters representing a given spoken syllable. A native speaker has no trouble reconstructing the character from the spoken speech (except for some names), but my understanding is that modern written speech uses somewhat fewer characters because of the extra distinctions available. (Ancient written speech was much more compressed.) This surely affects compressibility.
Bob Carpenter Says:
April 14, 2008 at 2:19 pm | Reply
Dylan: Good point about the channel side of the problem (converting speech to characters); I was only thinking of the source side of the noisy channel (what sequences of characters are likely). It really is the whole source/channel system that should be measured for efficiency.

Transduction can go either way between sounds and characters, depending on whether you’re listening (speech recognition) or producing (synthesis). If you view the channel as carrying characters encoded as sounds, then you have p(characters) and p(sounds|characters) as source and channel model. If you view the channel as carrying sounds encoded as characters, you have p(sounds) and p(characters|sounds) respectively.

The issue of “entropy of language X” is about just the source, p(characters) or p(sounds) depending on whether you’re modeling the language as sequences of characters or sequences of sounds.

The truly interesting case is to take the source as “meaning” and the channel as carrying either characters or sounds. We don’t quite have a handle on that yet, I’m afraid.

It’d be interesting if the various versions of Chinese (e.g. traditional vs. simplified vs. ancient — my knowledge of Chinese is very limited) changed the encoding rates. It’d be the clearest example of the point Mark Liberman was trying to make in the Language Log post.
tiflo Says:
March 25, 2009 at 1:22 am | Reply
Hi Bob,

somebody pointed me to this page and I thought you might be interested in work one of the students here at Rochester, Ting Qian, has been doing on constant entropy rate (distribution of entropy throughout discourses) in Chinese. There is a paper under submission with written and spoken Chinese in it and a replication of Genzel and Charniak 2002 on English (using an approach that provides better control for potential confounds). The paper should be soon available at: http://www.tqian.org/pub.html, but I can forward you a copy of the paper, if you’re interested. not quite what you were talking about, but very related.

LingPipe Blog