Principal Component Analysis (PCA)
is used to reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables, while retaining most of the sample’s information, and useful for the regression and classification of data. So basically, compression while keeping the substance.
It is indeed an optimization problem where we need to maximize a sum. For example, when you want to project a point on an unit vector , we get a new point whose magnitude is:
If and is the amount of information stored about a point x, then the optimization problem we need to solve is
There are steps in order to achieve PCA and they involve Lagrange functions and eigenvalues and eigenvectors :
- Standardization: Ensuring that each variable has a mean of 0 and a standard deviation of 1.
Here,
- is the mean of independent features,
- is the standard deviation of independent features.
- Covariance Matrix Computation
To find the covariance we can use the formula:
- Compute the Eigenvalues() and Eigenvectors() of covariance matrix to identify principal components
, where needs to be a singular matrix (i.e. non-invertible), so:
Therefore, we can find the eigenvalues by using the equation:
Further down the line, I will return with some C++ or Python implementation of the PCA. As promised, the code