An intuitive POV of Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a widely used algorithm for dimensionality reduction. It is easy to implement and elegantly. However, I struggled to get an intuitive understanding of this algorithm. Questions like Why is the co-variance matrix used? or What is the relationship between eigenvectors and principal components ? would always leave me puzzled.
In this post, I will try to explain the above and many other questions that I struggled to understand.
First of all, let’s revisit PCA.
Represent data as a matrix.
Assume we have $n$ data points, and each data point consists of $m$ measurements (or features). For example, we have 100 images. Each image is $16 \times 16$ pixels in composition. This means there are 100 ($n$) data points and each data point has $16 \times 16 = 256$ ($m$) measurements. We arrange this dataset into a matrix $ X \in \mathbb{R}^{m \times n}$. Each column corresponds to a data point and each row corresponds to a specific measurement across all data points.
\[X = \begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(n)} \\ x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(n)} \\ \vdots & \vdots & \ddots & \vdots \\ x_m^{(1)} & x_m^{(2)} & \cdots & x_m^{(n)} \end{bmatrix}\]- The subscript $i \in {1, …, m }$ indexes the feature, and
- The superscript $j \in {1, …, n }$ indexes the data point.
Centering the data
We then center the data by subtracting the mean of each feature (i.e., each row). The mean of the $i$-th feature across all data points is given by:
\[\mu_i = \frac{1}{n} \sum_{j=1}^{n} x_i^{(j)}\]We can collect all feature means into a mean vector:
\[\mu = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_m \end{bmatrix} \in \mathbb{R}^m\]In matrix form:
\[\tilde{X} = X - \mu\]Calculate the covariance Matrix
Then, we define the covariance matrix of $X$ as a matrix where the element at position $(i, j)$ represents the covariance between feature $i$ and feature $j$.
Formally, the covariance between feature $i$ and feature $j$ is given by:
\[\mathrm{Cov}(i, j) = \frac{1}{n} \sum_{k=1}^{n} \tilde{x}_i^{(k)} \tilde{x}_j^{(k)}\]Using this, the covariance matrix $\Sigma \in \mathbb{R}^{m \times m}$ is defined as:
\[\Sigma = \begin{bmatrix} \mathrm{Cov}(1,1) & \mathrm{Cov}(1,2) & \cdots & \mathrm{Cov}(1,m) \\ \mathrm{Cov}(2,1) & \mathrm{Cov}(2,2) & \cdots & \mathrm{Cov}(2,m) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{Cov}(m,1) & \mathrm{Cov}(m,2) & \cdots & \mathrm{Cov}(m,m) \end{bmatrix}\]In compact matrix form, the covariance matrix can be written as:
\[\Sigma = \frac{1}{n} \tilde{X} \tilde{X}^T\]One important property of covariance matrix is that they are symmetric. Symmetric matrices have nice properties that makes our life easier. Keep in mind that covariance matrix is a symmetric matrix since that will come up later in the explanation.
Compute Eigen Vectors and Eigen Values.
Next, we perform eigen decomposition on the covariance matrix $\Sigma$ to identify the principal components.
We compute the eigenvalues and eigenvectors of $\Sigma$:
\[\Sigma v_i = \lambda_i v_i\]where:
- $v_i \in \mathbb{R}^m$ is the $i$-th eigenvector, and
- $\lambda_i \in \mathbb{R}$ is the corresponding eigenvalue.
We can stack all eigenvectors into a matrix: \(V = \begin{bmatrix} | & | & & | \\ v_1 & v_2 & \cdots & v_m \\ | & | & & | \end{bmatrix}\)
and the eigenvalues into a diagonal matrix:
\[\Lambda = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_m \end{bmatrix}\]This gives the eigendecomposition of the covariance matrix:
\[\Sigma = V \Lambda V^T\]The eigenvectors corresponding to the largest eigenvalues capture the directions of maximum variance in the data. These directions are called the principal components.
To reduce dimensionality, we select the top $k$ eigenvectors (corresponding to the largest eigenvalues) and form:
\[V_k \in \mathbb{R}^{m \times k}\]Finally, we project the centered data onto these principal components:
\[Z = V_k^T \tilde{X}\]where $Z \in \mathbb{R}^{k \times n}$ is the lower-dimensional representation of the data.
The algorithm is quite easy to follow. Anyone who has a working knowledge of linear algebra can easily understand it. But, to me the whole algorithm is not intuitive. I have a lot of whys that I always struggled with. Here are some of them:
- Why do we center the data at all ? Why can’t we skip this part ?
- Why do we calculate the co-variance matrix ? Why not any other matrix ?
- How does eigen vector of this covariance matrix correspond to principal component directions ?
To ansswer these, we need to rethink PCA from a different lens.
PCA is all about learning a linear transformation.
PCA is about finding a linear transformation that transforms our raw data $X$ into a better structured form. How is “better” defined ?
PCA defines “better” using two principles:
- Features should be uncorrelated (no redundancy)
- The representation should preserve maximum variance
We are looking for a transformation matrix $P$ such that:
\[PX = Y \tag{1}\]You can think of this as rotating the coordinate system.
Data should not be redundant.
We want our new data $Y$ to not be redundant. The same information should not be captured by multiple features.
In linear algebra terms, we want the features to be uncorrelated.
So the covariance matrix of $Y$ should be diagonal:
\[C_Y = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_m \end{bmatrix}\]Recall:
\[C_Y = \frac{1}{n} YY^T \tag{2}\]The main diagonal is the variance of that feature while the off-diagonal element at $(i,j)$ are co-variance of the $i_{th}$ and $j_{th}$ feature respectively.
Preserve maximum information.
We also want to preserve as much information as possible.
PCA assumes that variance = information.
So we want directions where variance is highest.
Eigen vectors <-> Principal Components.
We now connect everything.
From (1) and (2):
\[C_Y = \frac{1}{n} (PX)(PX)^T\] \[C_Y = \frac{1}{n} PXX^TP^T\] \[C_Y = P C_X P^T \tag{3}\]Our goal is to choose $P$ such that $C_Y$ becomes diagonal.
We know that $C_X$ is symmetric, so it can be diagonalized:
\[C_X = E D E^T \tag{4}\]Substituting:
\[C_Y = P(EDE^T)P^T\] \[C_Y = (PE)D(PE)^T \tag{5}\]Now here is the key idea:
If we choose:
\[P = E^T\]then:
\[C_Y = (E^TE)D(E^TE)\] \[C_Y = IDI = D\]Because $E$ is orthogonal.
Key result
By choosing $P = E^T$, we rotate the data into a coordinate system where:
- covariance matrix becomes diagonal
- features are uncorrelated
- variances are eigenvalues
Hence, the matrix $P$ is a matrix where each row $p_i$ is an eigenvector of $XX^T$.
To sum it all up, the reason we take the covariance matrix $C_X$ is to calculate the eigenvectors which are going to be the rows of the linear transformation matrix $P$ that transforms our unstructured data into a “better” structured form. The “better” is defined as having no correlation with other features and preserving maximum information by preserving variance in the feature.
Enjoy Reading This Article?
Here are some more articles you might like to read next: