next up previous
Next: Bibliography

Information Theory: Entropy and Huffman Codes

David Falke

April 15, 2004

Information Theory is a mathematical model of communication that specifically deals with the transmission of information. Claude Shannon developed this theory in an essay entitled ``A Mathematical Theory of Communication" which was published in The Bell System Technical Journal, Vol. 27, July and October, 1948. Information Theory aims to describe data transmission in the language of mathematics to find realistic bounds for transmission rates over communication lines. The theory is grounded in probability and statistics.

A basic model for a general communication system is as follows:
$INFORMATION SOURCE \rightarrow TRANSMITTER \rightarrow SIGNAL \rightarrow RECEIVER \rightarrow DESTINATION$
In the context of this schema, information is basically defined as something a transmitter transmits to a receiver which the receiver does not already know. In this way, information is not redundant and actually is, in the everyday sense of the word, information. Another way to view information is by the amount of uncertainty in the outcome of a specific event. This concept is known as entropy and plays a large role in encoding information. One way that Shannon suggests to encode information is by using what is called a Huffman Code. Huffman Codes are based on probability distributions of event outcomes. In general, the lower the probability of a given event occurring, the higher the entropy of said event. Likewise, an encoding scheme becomes more complicated as it compresses data closer to its sources entropy.

Entropy is a measure of the uncertainty of the outcome of a an event. As stated in Cryptography with Coding Theory, this measure must satisfy the following four conditions:

  1. To each set of nonnegative numbers $p_1, \ldots, p_n$ with $p_1+ \cdots +p_n = 1$, the uncertainty is given by a number $H(p_1, \ldots ,p_n)$.
  2. $H$ should be a continuous function of the probability distribution, so a small change in the probability distribution should not drastically change the uncertainty.
  3. $H(1/n, \ldots, 1/n) \leq H(1/(n+1), \ldots, 1/(n+1))$ for all $n > 0$. In other words, in situations where all the outcomes are equally likely, the uncertainty increases when there are more possible outcomes.
  4. If $0 < q < 1$, then $H(p_1, \ldots ,qp_j,(1-q)p_j, \ldots ,p_n) = H(p_1, \ldots ,p_j, \ldots ,p_n)+p_jH(q,1-q)$.
Entropy is thus define as
$H(X) = - \sum_{x \in X} p(x) log_2 p(x)$,
where $p(x)$ is the probability of the outcome $x$ occurring for event $X$, and when $p(x) = 0$, $p(x) log_2 p(x)$ is defined as 0.

As one can see, the lower the probability of an event $x$ occurring, the higher its entropy. A more meaningful interpretation of entropy is given in Cryptography with Coding Theory as the number of yes-no questions needed to determine the outcome of the event. E.g., if $H(X) = e$, then one must ask an average of $e$ yes-no questions to determine the outcome of event $X$. This naturally leads to a binary representation of the outcome of an event as a string of bits. For instance, where 0 means no and 1 means yes, each bit represents the answer to a specific yes-no question. So, if $H(X) = e$, on average, one must use a binary string of $e$ bits to represent the outcome.

It is also important to note that entropy is at its minimum when $H(X) = 0$. I.e., for some $x \in X$, $x$ has a probability of 1. In this case one need not send any information to the receiver since the outcome is always guaranteed, in this case, to be $x$. Entropy is at its maximum when $H(X) = log_2 n$. In other words, for all $x_1, x_2 \in X$, the probability of $x_1$ is equal to the probability of $x_2$. In this case, a maximum amount of information must be sent to the receiver.

When transmitting information, it is natural to want to use a binary representation as described above to encode the outcome of the event. One way of doing this is with the use of Huffman Codes. Huffman Codes exploit the probabilistic nature of information to achieve the most efficient representation possible. One of the most well-known Huffman Codes is Morse Code. In Morse Code, the most frequently used letters are given the smallest representation, while the least frequently used letters are given the largest representation. The overall goal of a Huffman Code is to achieve the lowest average amount of bits per outcome as possible. In this way, one is able to transmit as little data as possible over a communications line, allowing for maximum speed of data transmission. As explained above, entropy gives a theoretical lower bound for the average amount of data needed to represent the outcome of an event.

A simple example of a Huffman code is as follows: Given an alphabet of six letters (A, B, C, D, E, F), and a probability distribution (p(A) = 0.4, p(B) = 0.1, p(C) = 0.1, p(D) = 0.1, p(E) = 0.2, and p(F) = 0.1), we construct the following tree.

\includegraphics[height=7cm]{Huffmantree}

Now, to find the binary representation of a letter, we simply follow the edges coming out from the letter of interest. For example, we represent A as 1, and D as 1000. In this example, each letter is represented by an average of 2.4 symbols, while the entropy is approximately 2.32. Thus, we cannot find a representation much better than this one, if at all.

Another application of entropy is the English language. If we simply consider the probability of each letter on its own and find the entropy in this way, we get a result approximately 4.18. But things are not so simple. For instance, if one sees the letter $q$, one would expect the next letter to be $u$. Thus, the probability of $u$ following $q$ is extremely close to one. It follows that if a word contains the letter pair $qu$ one could simply send $q$. Thus, one would expect an entropy even lower than 4.18, since we can consider the probability distribution over letter pairs of the English language. But we can go further and consider letter triples, quadruples, and so on. Thus, a natural way to define the entropy of the English language is $H(English) = \lim_{x \to \infty} \frac{H(L^x)}{x}$, where $L^x$ denote the $x$-letter combinations in the English language, and $H(L^x)$ the entropy of $L^x$. Surprisingly, it turns out that the entropy of the English language is about 1, or a little more.

As one can see, entropy and Huffman Codes are extremely useful for efficient transmission of information. Entropy gives a theoretical lower bound for the number of symbols needed to represent the outcome of a given event. Huffman Codes provide a way to near this lower bound. Most importantly, Information Theory allows a way of expressing and talking about information that does not rely on the semantic value of the information. By allowing such a representation, Shannon was able to develop his theory into a complete mathematical model. His ideas and insights have influenced the development of communication since the date of publication of his theory.




next up previous
Next: Bibliography
Dave Falke 2004-04-29