Confusion Matrix & Classification Metrics in Python
Understand the confusion matrix in Python with scikit-learn. Precision, recall, F1, and accuracy on a plotted heatmap — runnable in your browser, no setup.
Try it yourself
Run this code directly in your browser. Click "Open in full editor" to experiment further.
Click Run to see output
Or press Ctrl + Enter
How it works
If you only learn one thing about evaluating classifiers, learn this: accuracy alone is a lie waiting to happen. A model that predicts "not fraud" for every transaction will be 99.9% accurate on most fraud datasets — and completely worthless. The confusion matrix is the antidote.
What The Confusion Matrix Actually Shows
For a binary classifier, it's a 2×2 grid that breaks every prediction into one of four buckets:
| Predicted Negative | Predicted Positive | |
|---|---|---|
| Actually Negative | True Negative (TN) ✓ | False Positive (FP) ✗ |
| Actually Positive | False Negative (FN) ✗ | True Positive (TP) ✓ |
The diagonal cells are correct predictions. The off-diagonal cells are your two types of mistakes — and they're rarely equally bad.
In most real applications one of these is much worse than the other. A medical screening that misses cancer (FN) is catastrophic; a false positive just means a follow-up test. A spam filter that blocks important email (FP) is much more annoying than one that lets some spam through (FN). The whole point of looking past accuracy is to see which kind of error your model is making.
The Four Metrics That Actually Matter
From the four cells, you can compute four metrics. Each one answers a different question.
Accuracy — "How often is the model right?"
accuracy = (TP + TN) / totalUseful only when classes are roughly balanced. Useless when they're not.
Precision — "When the model says positive, is it right?"
precision = TP / (TP + FP)Use when false positives are expensive. Examples: marking emails as spam (you don't want to lose real email), recommending products (you don't want to annoy users with garbage suggestions).
Recall (also called Sensitivity) — "How many real positives did the model catch?"
recall = TP / (TP + FN)Use when false negatives are expensive. Examples: medical screening (don't miss a cancer), fraud detection (don't let a fraudulent charge through), security alerts (don't miss an attack).
F1 Score — "Balance of precision and recall"
F1 = 2 · (precision · recall) / (precision + recall)F1 is the harmonic mean — it punishes models that are great at one metric but terrible at the other. If precision = 1.0 and recall = 0.0, the average is 0.5 but F1 is 0.0. Use F1 when you don't have a strong reason to prefer precision over recall, or vice versa.
The Imbalanced Data Trap
The demo at the bottom of the snippet is the lesson every ML beginner has to learn the hard way. Take a dataset that's 99% class 0, build a "model" that always predicts 0, and you get 99% accuracy. The confusion matrix instantly exposes it: recall is 0%, precision is undefined, and the model has caught zero positives.
This is not a contrived example — it's roughly what real fraud, churn, disease, and rare-event datasets look like. Always check recall and precision, especially on imbalanced data.
How To Move The Trade-Off
Precision and recall pull against each other. The default decision threshold of 0.5 is just one point on a curve.
predict_proba ≥ 0.3). You'll catch more positives, at the cost of more false alarms.For classifiers that expose probabilities (predict_proba), you can sweep through every threshold and plot the precision-recall curve — the right tool for picking an operating point on imbalanced problems.
Multi-class Confusion Matrices
For more than two classes, the matrix becomes N×N. Each row is a true class, each column is a predicted class. The diagonal is still correct predictions. Off-diagonal cells show which classes the model confuses with which — often the most useful insight in the whole evaluation.
For multi-class precision/recall, scikit-learn supports three averaging strategies:
average='macro' — unweighted average across classes (treats all classes equally).average='weighted' — weighted by class size.average='micro' — pools all classes together (equivalent to accuracy for single-label problems).The One-Liner That Replaces All Of This
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=class_names))That single function gives you per-class precision, recall, F1, and support. In real projects you'll print this every time you train a model.
Run the snippet above to see a real confusion matrix on the breast cancer dataset, the four metrics computed individually and as a full report, and a brutal demonstration of how a 99%-accurate model can be completely useless.
Related examples
Logistic Regression in Python
Learn logistic regression in Python with scikit-learn. Binary classification, decision boundary, probabilities, and ROC curve — all explained and runnable in your browser.
Decision Tree Classifier in Python
Build a decision tree classifier in Python with scikit-learn. Train, visualize the actual tree, predict, and learn how to avoid overfitting — runnable in your browser.
PCA (Principal Component Analysis) in Python
Learn PCA in Python with scikit-learn. Reduce high-dimensional data to 2D, visualize hidden structure, and understand explained variance — runnable in your browser.