Try it yourself

Run this code directly in your browser. Click "Open in full editor" to experiment further.

Loading...

Click Run to see output

Or press Ctrl + Enter

How it works

If you only learn one thing about evaluating classifiers, learn this: accuracy alone is a lie waiting to happen. A model that predicts "not fraud" for every transaction will be 99.9% accurate on most fraud datasets — and completely worthless. The confusion matrix is the antidote.

What The Confusion Matrix Actually Shows

For a binary classifier, it's a 2×2 grid that breaks every prediction into one of four buckets:

Predicted Negative	Predicted Positive
Actually Negative	True Negative (TN) ✓	False Positive (FP) ✗
Actually Positive	False Negative (FN) ✗	True Positive (TP) ✓

The diagonal cells are correct predictions. The off-diagonal cells are your two types of mistakes — and they're rarely equally bad.

False positive = false alarm. Spam filter flags a legitimate email. Cancer screen says "malignant" when it isn't.

False negative = missed it. Spam filter lets spam through. Cancer screen says "benign" when it isn't.

In most real applications one of these is much worse than the other. A medical screening that misses cancer (FN) is catastrophic; a false positive just means a follow-up test. A spam filter that blocks important email (FP) is much more annoying than one that lets some spam through (FN). The whole point of looking past accuracy is to see which kind of error your model is making.

The Four Metrics That Actually Matter

From the four cells, you can compute four metrics. Each one answers a different question.

Accuracy — "How often is the model right?"

accuracy = (TP + TN) / total

Useful only when classes are roughly balanced. Useless when they're not.

Precision — "When the model says positive, is it right?"

precision = TP / (TP + FP)

Use when false positives are expensive. Examples: marking emails as spam (you don't want to lose real email), recommending products (you don't want to annoy users with garbage suggestions).

Recall (also called Sensitivity) — "How many real positives did the model catch?"

recall = TP / (TP + FN)

Use when false negatives are expensive. Examples: medical screening (don't miss a cancer), fraud detection (don't let a fraudulent charge through), security alerts (don't miss an attack).

F1 Score — "Balance of precision and recall"

F1 = 2 · (precision · recall) / (precision + recall)

F1 is the harmonic mean — it punishes models that are great at one metric but terrible at the other. If precision = 1.0 and recall = 0.0, the average is 0.5 but F1 is 0.0. Use F1 when you don't have a strong reason to prefer precision over recall, or vice versa.

The Imbalanced Data Trap

The demo at the bottom of the snippet is the lesson every ML beginner has to learn the hard way. Take a dataset that's 99% class 0, build a "model" that always predicts 0, and you get 99% accuracy. The confusion matrix instantly exposes it: recall is 0%, precision is undefined, and the model has caught zero positives.

This is not a contrived example — it's roughly what real fraud, churn, disease, and rare-event datasets look like. Always check recall and precision, especially on imbalanced data.

How To Move The Trade-Off

Precision and recall pull against each other. The default decision threshold of 0.5 is just one point on a curve.

Want higher recall? Lower the threshold (e.g. predict positive when predict_proba ≥ 0.3). You'll catch more positives, at the cost of more false alarms.

Want higher precision? Raise the threshold (e.g. ≥ 0.7). Fewer false alarms, but more misses.

For classifiers that expose probabilities (predict_proba), you can sweep through every threshold and plot the precision-recall curve — the right tool for picking an operating point on imbalanced problems.

Multi-class Confusion Matrices

For more than two classes, the matrix becomes N×N. Each row is a true class, each column is a predicted class. The diagonal is still correct predictions. Off-diagonal cells show which classes the model confuses with which — often the most useful insight in the whole evaluation.

For multi-class precision/recall, scikit-learn supports three averaging strategies:

average='macro' — unweighted average across classes (treats all classes equally).

average='weighted' — weighted by class size.

average='micro' — pools all classes together (equivalent to accuracy for single-label problems).

The One-Liner That Replaces All Of This

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=class_names))

That single function gives you per-class precision, recall, F1, and support. In real projects you'll print this every time you train a model.

Run the snippet above to see a real confusion matrix on the breast cancer dataset, the four metrics computed individually and as a full report, and a brutal demonstration of how a 99%-accurate model can be completely useless.

Related examples

Machine Learning

Confusion Matrix & Classification Metrics in Python