Posted on April 20, 2020

*A note: this post is mostly aimed at precocious high school students and early undergraduates. If you arrived here from ML twitter, you’ll probably find the content pretty elementary compared to some of my other stuff.*

Logarithms are magic.

At first glance, the logarithm wouldn’t necessarily strike you as a particularly significant function. The grade school introduction to the logarithm is thoroughly uninspiring: the logarithm (base c) of x is the power you need to raise c to in order to get x. For example \(\log_2 ( 8) = 3\), because \(2^3 = 8\). If you were like me, after you learned this fact you then had to compute a bunch of logarithms by hand, and to solve word problems about the doubling time of investments. If you went on to learn computer science, you probably saw logarithms again when you talked about the running time of binary search. You probably also came across them in chemistry and physics, where they were used in the formulas for the entropy of a system. You might have been told that entropy describes how chaotic a system is, but if your lectures were anything like mine you probably didn’t dive deep into why, in all the mathematical universe, the logarithm is the function that best describes chaos. This entropy idea might also have come up again when you had lectures on information theory and compression, where it described how predictable a string was.

The intuition behind why the logarithm occurs in all of these contexts has to do with the number of bits it takes to describe a state. In the rest of this post, \(\log\) refers implicitly to base 2, and \(\ln\) refers the the *natural logarithm*, which is the logarithm base e. The log base 2 of an integer \(n\) tells you roughly how many digits are required to write out \(n\) in binary. For example, \(\log (1024) = 10\), because \(1024 = 1000000000\), which has 10 binary digits. For non-integral powers of 2 this interpretation is correct up to a rounding error. In information theory, it’s helpful to think about a ‘unit’ of information as corresponding to a single character in a string. When the character is in the set {0,1}, we call the unit of information a bit. So it takes 10 bits to write out the number 1024 in binary. More generally, given a set of \(n\) elements, we require $n $ bits to specify a specific element in that set.

This interpretation of the logarithm as describing the number of digits necessary to express a number is pretty obvious, but also pretty deep. Because information theory is basically the study of writing sequences of numbers down using as few digits as possible, it makes a lot of sense that the logarithm would be important. I’m pretty sure Claude Shannon rolled over in his grave when I wrote that last sentence, so I’ll clarify what I mean by ‘writing sequences of numbers down using as few digits as possible’ before we continue.

One of the crucial ideas in information theory is the idea of a code. A code is a function \(C : \mathcal{M} \rightarrow \Sigma\) that takes as input some set of size \(m\), \(\mathcal{M} \simeq \{1, \dots, m\}\), and gives as output a string of characters from some alphabet \(\Sigma\). In theory the alphabet can be anything, but it’s usually taken to be binary strings, so \(\Sigma = \{0, 1\}\). For example, a Cesar cipher would take as input an enumeration of the english alphabet \(\{a, \dots, z\} \equiv \{1, \dots, 26\}\), and the mapping \(C\) would satisfy \(C(x) =x + k\) mod 26 for some fixed \(k\). If we were a computer, we would write \(C(x)\) as a binary string, which we’ll denote \([x+k]_2\). Writing down a string of length \(n\) in this scheme would take \(n \times \log_2(26)\) binary characters.

But suppose that Cesar were trying to save on paper – is there a more efficient way of encoding messages so that the encoded messages are shorter than \(n \times \log_2 (26)\)? It turns out that the answer is yes! Going deep into optimal codes is well beyond the scope of this blogpost (see here for an example of Huffman codes), but the key idea is to take advantage of the fact that letters that appear more frequently should get shorter code words, and letters that appear less frequently should get longer code words.

For an example of how this might work, suppose Cesar wanted to invade Canada, and needed to send the message “Go to Toronto” to his general. Naively, he could encode this message using the code written above, by converting each letter to binary and then writing out the corresponding sequence of 1s and 0s on a sheet of paper. But suppose it was a small sheet of paper. Invading Canada is expensive, after all, and the military budget doesn’t have room for superfluous paper expenditure. The naive approach to encoding ‘go to toronto’ would be to encode each letter as its order in the alphabet, then write out that number in base 2. To make it clear where each number begins and ends, we have to make each encoding the same length – so even though the letter ‘g’ is number 7 in the alphabet, we have to write it as ‘00111’ instead of ‘111’. This gives us the following coded message.

0011101111 1010001111 10100011111001001111011101010001111

That’s a pretty long string, and it contains a lot of obvious redundancies – for starters, we only have 5 distinct characters in the message, but we’re using enough bits for each letter to distinguish up to 32 distinct characters! The letter ‘o’ appears 5 times – that’s almost half of the length of the string – but we use the same number of bits to encode the ‘o’ as we do ‘r’, which only occurs once. The Huffman code addresses all of these redundancies, and gives us the following much shorter string. It gives the map: ‘o’:0, ‘t’:11, ‘g’:1011, ‘r’:1010, ‘n’:100, which leaves us with the following much shorter message.

10110 110 11010100100110

The length of the message (ignoring spaces) is now 22 rather than 56.

Although this procedure might have seemed totally unrelated to logarithms and entropy, it turns out that this procedure of encoding strings as bits actually gives us an estimate of the ‘entropy’ of the string. The original message had length 11, and its encoding had length 22. This tells us that we need on average about 2 bits per character in the message in order to transmit it as efficiently as possible. Now let’s look at the probability distribution that corresponds to randomly sampling a letter from the string:

\[X \sim \textit{Uniform(}\text{'go to toronto'}) \quad \overset{D}{=} \quad P(X=g) = \frac{1}{11} \quad \dots \quad P(X=o) = \frac{5}{11} \]

The entropy of a probability distribution \(p\) on a discrete set \(x_1, \dots, x_n\) is defined as follows.

\[H(p) = -\sum_{i=1}^n p(x_i) \log p(x_i) \]

If we calculate this for the distribution we defined above, then we see something interesting.

\[H(P) = 1.97 \approx 2 = \frac{\text{len(code)}}{\text{len(original message)}} \]

Coincidence? Not a chance! Let’s go back to the Huffman code for our message. The code for ‘o’ is 0, which has length \(1 \approx -\log_2 (\frac{5}{11})\). Similarly, the average code length for the letters that appear once is \(\frac{11}{3} \approx -\log_2 \frac{1}{11}\). Notice a pattern here? The entropy of the distribution is telling us (up to a rounding error) the average number of bits it will take to encode characters sampled from that distribution given an optimal code! In fact, Shannon figured out in the 1950s that you can use the entropy of a distribution to get a lower bound on the length of the optimal coding for it (see e.g. here for a proof). This blog post is getting pretty long so I won’t go into that proof here, but it’s quite neat.

**Key takeaways.** The entropy of a probability distribution gives you an idea of how much information (measured as the average codeword length of an optimal encoding of the items in the distributiion) it takes to describe samples from that distribution. Because the logarithm tells you how long a string needs to be to enumerate a set of a given size, it appears in the formula for entropy as a measure of optimal codeword length. In probability distributions where only one event happens frequently (i.e. * predictable distributions *), we get a small average codeword length because we can use one bit to describe that event. In distributions where many possible events happen frequently (i.e. *unpredictable* or *chaotic* settings) we need longer codewords to be able to enumerate all of the possible events. This is why people use entropy to measure the chaos in a system.

If you found this post interesting, I’ve included additional reading material below. Otherwise, happy quarantine!

** Footnotes ** 1. The height of a full binary tree with \(2^n\) leaf nodes is \(n+1\). We can interpret the binary tree as implicitly encoding each of its leaf nodes with a binary number by travelling from the root to the leaf, and putting a 1 down for each left edge, and a 0 for each right edge. 2. A proof of the optimality of the Huffman code can be found here.

** Background reading **

- Calculus: Khan academy will cover the basics of derivatives and integrals, to the level needed to understand this post.
- Trees: not strictly necessary, but they provide nice intuition. See trees, how trees can encode strings, trees and binary search.