Machine LearningIntermediate

PCA (Principal Component Analysis) in Python

Learn PCA in Python with scikit-learn. Reduce high-dimensional data to 2D, visualize hidden structure, and understand explained variance — runnable in your browser.

Try it yourself

Run this code directly in your browser. Click "Open in full editor" to experiment further.

Loading...

Click Run to see output

Or press Ctrl + Enter

How it works

PCA is the Swiss Army knife of dimensionality reduction. When you have 50 columns, or 500, or 50,000, PCA takes that mountain of features and compresses it into a handful of new ones that capture most of what matters. The result: data you can actually visualize, models that train faster, and noise that quietly disappears.

What PCA Actually Does

Imagine your data as a cloud of points floating in N-dimensional space. PCA finds the directions along which the cloud is most stretched out — those are the principal components.

  • PC1 is the single direction with the most variance — the longest axis of the cloud.
  • PC2 is the next-best direction, perpendicular to PC1.
  • PC3 is perpendicular to both, pointing in the next-best direction.
  • ...and so on.
  • Each component captures less variance than the one before it. By the time you get to PC50 in a 50-dimensional dataset, that component is usually capturing nothing but noise. The whole game is keeping the first few components and throwing the rest away.

    Why "Standardize First" Is Non-Negotiable

    PCA finds directions of maximum variance, and variance is sensitive to scale. If one feature is in millimeters (range 0–1000) and another is in meters (range 0–1), the millimeter feature will dominate every principal component just because its numbers are bigger.

    Always run `StandardScaler` on your features before PCA. It centers them at zero and scales them to unit variance, putting every feature on equal footing.

    Reading The Output

    After pca.fit(X), two attributes carry the whole story:

  • explained_variance_ratio_ — array showing what fraction of the total variance each component captures. [0.73, 0.23, 0.04, ...] means PC1 alone explains 73% of the variation in your data.
  • cumsum(explained_variance_ratio_) — running total. Tells you "if I keep the first K components, I keep this much of the original information".
  • A common rule of thumb: keep enough components to retain 95% of the variance. The snippet above shows exactly how to find that number from the cumulative variance curve.

    The Killer Use Case: Visualization

    Humans can see 2D scatter plots. Sometimes 3D. Never 64D. PCA is how you visualize high-dimensional data:

  • Iris data has 4 features. PCA → 2D and you can see the three species form clear groups, with the same colors clustering together.
  • The digits dataset has 64 features (8×8 pixel images). PCA → 2D and you can see different digits naturally separating, even though PCA had no idea what the labels were.
  • This is one of the fastest sanity checks for any dataset: if PCA-to-2D shows zero structure, no model is going to find structure either.

    PCA As Compression

    The inverse_transform method is one of PCA's most underrated tricks. It takes a compressed vector and rebuilds the original-shape data — imperfectly, but recognizably. The image reconstruction demo at the bottom of the snippet shows this in action: at 1 component the digit is a blurry blob, at 4 you can almost guess it, at 8 it's clearly a digit, by 16 it's nearly identical to the original.

    This is exactly the principle JPEG compression uses (with discrete cosine transform instead of PCA, but the same idea).

    Other Things PCA Quietly Solves

  • Speeds up training — fewer features means faster model training. A 95% PCA on a 1000-feature dataset can cut training time by 10×.
  • Removes noise — random noise spreads across all components, while real signal concentrates in the first few. Throwing away the late components is implicit denoising.
  • Decorrelates features — PCA components are orthogonal by construction, which is exactly what some models (like linear regression) want.
  • Where PCA Falls Down

  • It's linear. PCA only finds directions defined by linear combinations of features. If your data has a curved or twisted structure (think a Swiss roll), PCA will project it to a useless mess. For non-linear cases, try t-SNE or UMAP — they're slower and harder to interpret, but they handle curved structure beautifully.
  • The components are not always interpretable. PC1 is some weighted mix of all your original features — useful mathematically, hard to explain to a stakeholder.
  • It assumes variance equals importance. Sometimes the most important signal is in a low-variance direction. PCA will throw it out.
  • When To Reach For PCA

  • You want to plot high-dimensional data in 2D or 3D.
  • Your model is too slow because you have hundreds or thousands of features.
  • Your features are highly correlated and you want to decorrelate them.
  • You suspect noise is overwhelming your signal.
  • When To Skip It

  • You only have a few features. PCA on 5 features is usually pointless.
  • Your features have specific real-world meanings you need to preserve (PCA components don't).
  • Your data is non-linear — try t-SNE/UMAP instead.
  • Run the snippet above and you'll see iris flowers cleanly cluster in 2D after losing two of their four dimensions, watch all ten digit classes find their own corners of a 2D map after being squashed from 64 dimensions, and see a digit get progressively rebuilt from a handful of components.

    Related examples