Logistic Regression in Python
Learn logistic regression in Python with scikit-learn. Binary classification, decision boundary, probabilities, and ROC curve — all explained and runnable in your browser.
Try it yourself
Run this code directly in your browser. Click "Open in full editor" to experiment further.
Click Run to see output
Or press Ctrl + Enter
How it works
Logistic regression is the most-used classification algorithm on the planet. Banks use it to score loan applications, doctors use it to estimate disease risk, ad networks use it to predict click-through. It is fast, interpretable, and almost always the first model you should try for a binary classification problem.
The Confusing Name
Despite "regression" in the name, logistic regression is a classifier. It predicts a class label, not a continuous number. The "regression" part comes from the math: under the hood it's running a regression on the log-odds of the probability — but you almost never need to think about that.
What you do need to think about: it predicts a probability between 0 and 1, then turns that probability into a class by checking if it's above or below a threshold (default: 0.5).
How It Works
Logistic regression starts with the same idea as linear regression — compute a weighted sum of the inputs:
z = w₀ + w₁·x₁ + w₂·x₂ + ...But then it squashes that number through the sigmoid function to turn it into a probability:
P(class 1) = 1 / (1 + e^(-z))Sigmoid maps any real number to (0, 1). Big positive z → probability near 1. Big negative z → probability near 0. z = 0 → probability exactly 0.5, which is where the decision boundary lives.
For 2D data, that boundary is a straight line. For 3D, it's a plane. For higher dimensions, a hyperplane. Logistic regression is fundamentally a linear classifier — if your two classes are tangled in a non-linear way, this isn't the right tool.
The Magic Method: `predict_proba`
This is what separates logistic regression from a black-box classifier. Instead of just saying "class 1", it tells you how confident it is:
model.predict(x) # array([1]) — just the label
model.predict_proba(x) # array([[0.12, 0.88]]) — P(class 0), P(class 1)Why this matters in practice:
Reading The Decision Boundary Plot
The filled contour in the snippet shows P(class 1) across the entire feature space — red where the model is confident it's class 0, blue where it's confident it's class 1, and a fuzzy zone in between. The black line is the decision boundary at exactly P = 0.5. Points on the wrong side are misclassifications.
ROC Curve and AUC — The Honest Way To Evaluate
Accuracy hides a lot. A model that always predicts "not spam" will be 99% accurate on a dataset that's 99% non-spam — and completely useless. The ROC curve is the antidote.
It plots the true positive rate (how many actual class-1 points you caught) against the false positive rate (how many class-0 points you wrongly flagged) across every possible decision threshold. A perfect model hugs the top-left corner. A random model is the diagonal line.
AUC (area under the curve) summarizes the whole curve as a single number:
Unlike accuracy, AUC is threshold-independent and works fine on imbalanced datasets.
Regularization (And Why It's On By Default)
scikit-learn's LogisticRegression applies L2 regularization out of the box (controlled by C — smaller C means more regularization). This prevents the weights from blowing up when features are correlated or when you have more features than samples. You almost never want to turn it off. If you have very high-dimensional data with lots of irrelevant features, switch to L1 (penalty='l1', solver='liblinear') — it forces irrelevant weights to zero.
When To Use It (And When Not To)
Use logistic regression when:
Reach for something else when:
multi_class='multinomial'), but trees often do better.Quick Practical Tips
StandardScaler). The convergence is faster and the regularization behaves predictably.train_test_split for classification — keeps class proportions consistent in train and test.LogisticRegression class, no extra config needed.Run the snippet above and you'll see the model's decision boundary cut cleanly between two classes, get back actual probabilities (not just labels), and watch the ROC curve sweep into the top-left corner — confirming the model is genuinely good, not just lucky.
Related examples
Linear Regression in Python (from Scratch)
Build linear regression in Python from scratch with NumPy, then compare to scikit-learn. Step-by-step math, runnable code, and a real plotted fit.
Decision Tree Classifier in Python
Build a decision tree classifier in Python with scikit-learn. Train, visualize the actual tree, predict, and learn how to avoid overfitting — runnable in your browser.
Confusion Matrix & Classification Metrics in Python
Understand the confusion matrix in Python with scikit-learn. Precision, recall, F1, and accuracy on a plotted heatmap — runnable in your browser, no setup.