The Log-Sum-Exp Trick.

Introduction

When we multiply large or small numbers, it is easy to underflow or overflow. It is common when working with the small probabilities. To avoid such issue, we work in the $log$ scale. Eventually, we need to add the original numbers by moving out of the log scale. In such cases, we come across log-sum-exp expression,

\[\text{LSE}(\mathbf{x}) = \log \left( \sum_{i=1}^{n} \exp(x_i) \right) \tag{1}\]

The problem ? If only one $x_i$ is large, the expression becomes infinity inf. The log-sum-exp trick is a clever way to avoid numerical overflows.

Derivation

The trick relies on shifting the center of the exponentials. Let $a$ be the maximum value in our vector $x$.

\[a = \max(x_1, x_2, ...., x_N) \tag{2}\]

We can rewrite the above expression in $1$ as

\[y = \log \left( \sum_{i=1}^{n} \exp(x_i) \right)\] \[y = \log \left( \sum_{i=1}^{n} \exp(x_i + a - a) \right)\] \[y = \log \left( \sum_{i=1}^{n} \exp(a) \exp(x_i - a) \right)\] \[y = \log \left( \exp(a) \sum_{i=1}^{n} \exp(x_i - a) \right)\]

Using the log rule $\log(A.B) = \log A + \log B$,

\[y = \log( \exp(a)) + \log \left( \sum_{i=1}^{n} \exp(x_i - a) \right)\] \[\text{LSE}(\mathbf{x}) = a + \log \left( \sum_{i=1}^{n} \exp(x_i - a) \right) \tag{3}\]

Why it works ?

By subtracting the max value $a$, the largest term in the exponent becomes $\exp(a - a) = \exp(0) = 1$. All other terms become exponentials of negative numbers, which result in small values between 0 and 1. We have successfully eliminated the explosion.

Applications

1. Softmax Normalization

An immediate application of log-sum-exp is the softmax normalization. Take the softmax function,

\[\text{Softmax}(z_i) = \frac{\exp(z_i)}{\sum_{i=1}^{N} \exp(z_i)} \tag{4}\]

If $z_i$ at least larger than 710 (in python), the exponent of $z_i$ overflows resulting in inf for the whole expression. To avoid it, we convert it to the log scale and apply the log-sum-exp trick to avoid overflows.

Taking $\log$ on both sides in $eqn(4)$,

\[\log (\text{Softmax}(z_i)) = z_i - \log \left( {\sum_{i=1}^{N} \exp(z_i)}\right )\]

Substituting our $\text{LSE}$ definition,

\[\log (\text{Softmax}(z_i)) = z_i - \text{LSE}(z_i)\]

To get the actual probabilities, we simply exponentiate the result at the very end:

\[\text{Softmax}(z_i) = \exp(z_i - \text{LSE}(z_i)) \tag{5}\]

This ensures we never calculate $\exp(z_{large})$ directly, preventing overflow.

2. Calculating Geometric Mean

If we have values stored as $logs$(for e.g: $\log u_1$, $\log u_2$ and so on ) and we want to find the log of their average, $\log \left( \frac{1}{N} \sum_{i=1}^{N} u_i \right)$, we can use $\text{LSE}$,

\[\log \left( \frac{1}{N} \sum_{i=1}^{N} u_i \right)\] \[= \log \left( \frac{1}{N} \sum_{i=1}^{N} \exp (\log u_i) \right)\] \[= \log \left( \frac{1}{N}\right) + \log \left( \sum_{i=1}^{N} \exp (\log u_i) \right)\] \[= \log \left( \frac{1}{N}\right) + \text{LSE}(\log \textbf{u})\]

This allows us to compute averages of probabilities that are too small to represent as normal floating point numbers.