Principal Component Analysis for Visualization

Last Updated on October 27, 2021

Principal element evaluation (PCA) is an unsupervised machine studying method. Perhaps the most well-liked use of principal element evaluation is dimensionality discount. Besides utilizing PCA as an information preparation method, we are able to additionally use it to assist visualize knowledge. An image is value a thousand phrases. With the information visualized, it’s simpler for us to get some insights and determine on the following step in our machine studying fashions.

In this tutorial, you’ll uncover learn how to visualize knowledge utilizing PCA, in addition to utilizing visualization to assist figuring out the parameter for dimensionality discount.

After finishing this tutorial, you’ll know:

  • How to make use of visualize a excessive dimensional knowledge
  • What is defined variance in PCA
  • Visually observe the defined variance from the results of PCA of excessive dimensional knowledge

Let’s get began.

Principal Component Analysis for Visualization
Photo by Levan Gokadze, some rights reserved.

Tutorial Overview

This tutorial is split into two components; they’re:

  • Scatter plot of excessive dimensional knowledge
  • Visualizing the defined variance


For this tutorial, we assume that you’re already accustomed to:

Scatter plot of excessive dimensional knowledge

Visualization is a vital step to get insights from knowledge. We can study from the visualization that whether or not a sample may be noticed and therefore estimate which machine studying mannequin is appropriate.

It is straightforward to depict issues in two dimension. Normally a scatter plot with x- and y-axis are in two dimensional. Depicting issues in three dimensional is a bit difficult however not unimaginable. In matplotlib, for instance, can plot in 3D. The solely drawback is on paper or on display screen, we are able to solely take a look at a 3D plot at one viewport or projection at a time. In matplotlib, that is managed by the diploma of elevation and azimuth. Depicting issues in 4 or 5 dimensions is unimaginable as a result of we dwell in a three-dimensional world and do not know of how issues in such a excessive dimension would appear like.

This is the place a dimensionality discount method reminiscent of PCA comes into play. We can cut back the dimension to 2 or three so we are able to visualize it. Let’s begin with an instance.

We begin with the wine dataset, which is a classification dataset with 13 options (i.e., the dataset is 13 dimensional) and three lessons. There are 178 samples:

Among the 13 options, we are able to choose any two and plot with matplotlib (we color-coded the totally different lessons utilizing the c argument):

or we are able to additionally choose any three and present in 3D:

But this doesn’t reveal a lot of how the information appears to be like like, as a result of majority of the options aren’t proven. We now resort to principal element evaluation:

Here we remodel the enter knowledge X by PCA into Xt. We contemplate solely the primary two columns, which comprise essentially the most info, and plot it in two dimensional. We can see that the purple class is sort of distinctive, however there may be nonetheless some overlap. If we scale the information earlier than PCA, the end result could be totally different:

Because PCA is delicate to the size, if we normalized every characteristic by StandardScaler we are able to see a greater end result. Here the totally different lessons are extra distinctive. By this plot, we’re assured {that a} easy mannequin reminiscent of SVM can classify this dataset in excessive accuracy.

Putting these collectively, the next is the whole code to generate the visualizations:

If we apply the identical technique on a special dataset, reminiscent of MINST handwritten digits, the scatterplot will not be displaying distinctive boundary and subsequently it wants a extra difficult mannequin reminiscent of neural community to categorise:

Visualizing the defined variance

PCA in essence is to rearrange the options by their linear combos. Hence it’s referred to as a characteristic extraction method. One attribute of PCA is that the primary principal element holds essentially the most details about the dataset. The second principal element is extra informative than the third, and so forth.

To illustrate this concept, we are able to take away the principal elements from the unique dataset in steps and see how the dataset appears to be like like. Let’s contemplate a dataset with fewer options, and present two options in a plot:

This is the iris dataset which has solely 4 options. The options are in comparable scales and therefore we are able to skip the scaler. With a 4-features knowledge, the PCA can produce at most 4 principal elements:

For instance, the primary row is the primary principal axis on which the primary principal element is created. For any knowledge level $p$ with options $p=(a,b,c,d)$, because the principal axis is denoted by the vector $v=(0.36,-0.08,0.86,0.36)$, the primary principal element of this knowledge level has the worth $0.36 instances a – 0.08 instances b + 0.86 instances c + 0.36times d$ on the principal axis. Using vector dot product, this worth may be denoted by
p cdot v
Therefore, with the dataset $X$ as a 150 $instances$ 4 matrix (150 knowledge factors, every has 4 options), we are able to map every knowledge level into to the worth on this principal axis by matrix-vector multiplication:
X cdot v
and the result’s a vector of size 150. Now if we take away from every knowledge level the corresponding worth alongside the principal axis vector, that may be
X – (X cdot v) cdot v^T
the place the transposed vector $v^T$ is a row and $Xcdot v$ is a column. The product $(X cdot v) cdot v^T$ follows matrix-matrix multiplication and the result’s a $150times 4$ matrix, identical dimension as $X$.

If we plot the primary two characteristic of $(X cdot v) cdot v^T$, it appears to be like like this:

The numpy array Xmean is to shift the options of X to centered at zero. This is required for PCA. Then the array worth is computed by matrix-vector multiplication.
The array worth is the magnitude of every knowledge level mapped on the principal axis. So if we multiply this worth to the principal axis vector we get again an array pc1. Removing this from the unique dataset X, we get a brand new array Xremove. In the plot we noticed that the factors on the scatter plot crumbled collectively and the cluster of every class is much less distinctive than earlier than. This means we eliminated a whole lot of info by eradicating the primary principal element. If we repeat the identical course of once more, the factors are additional crumbled:

This appears to be like like a straight line however truly not. If we repeat as soon as extra, all factors collapse right into a straight line:

The factors all fall on a straight line as a result of we eliminated three principal elements from the information the place there are solely 4 options. Hence our knowledge matrix turns into rank 1. You can strive repeat as soon as extra this course of and the end result could be all factors collapse right into a single level. The quantity of knowledge eliminated in every step as we eliminated the principal elements may be discovered by the corresponding defined variance ratio from the PCA:

Here we are able to see, the primary element defined 92.5% variance and the second element defined 5.3% variance. If we eliminated the primary two principal elements, the remaining variance is simply 2.2%, therefore visually the plot after eradicating two elements appears to be like like a straight line. In reality, once we verify with the plots above, not solely we see the factors are crumbled, however the vary within the x- and y-axes are additionally smaller as we eliminated the elements.

In phrases of machine studying, we are able to think about using just one single characteristic for classification on this dataset, particularly the primary principal element. We ought to anticipate to realize at least 90% of the unique accuracy as utilizing the complete set of options:

The different use of the defined variance is on compression. Given the defined variance of the primary principal element is massive, if we have to retailer the dataset, we are able to retailer solely the the projected values on the primary principal axis ($Xcdot v$), in addition to the vector $v$ of the principal axis. Then we are able to roughly reproduce the unique dataset by multiplying them:
X approx (Xcdot v) cdot v^T
In this fashion, we want storage for just one worth per knowledge level as a substitute of 4 values for 4 options. The approximation is extra correct if we retailer the projected values on a number of principal axes and add up a number of principal elements.

Putting these collectively, the next is the whole code to generate the visualizations:

Further studying

This part gives extra sources on the subject if you’re trying to go deeper.





In this tutorial, you found learn how to visualize knowledge utilizing principal element evaluation.

Specifically, you discovered:

  • Visualize a excessive dimensional dataset in 2D utilizing PCA
  • How to make use of the plot in PCA dimensions to assist selecting an acceptable machine studying mannequin
  • How to watch the defined variance ratio of PCA
  • What the defined variance ratio means for machine studying


Get a Handle on Linear Algebra for Machine Learning!

Linear Algebra for Machine Learning

Develop a working perceive of linear algebra

…by writing traces of code in python

Discover how in my new Ebook:

Linear Algebra for Machine Learning

It gives self-study tutorials on subjects like:

Vector Norms, Matrix Multiplication, Tensors, Eigendecomposition, SVD, PCA and way more…

Finally Understand the Mathematics of Data

Skip the Academics. Just Results.

See What’s Inside