Machine LearningIntermediate

Logistic Regression in Python

Learn logistic regression in Python with scikit-learn. Binary classification, decision boundary, probabilities, and ROC curve — all explained and runnable in your browser.

Try it yourself

Run this code directly in your browser. Click "Open in full editor" to experiment further.

Loading...

Click Run to see output

Or press Ctrl + Enter

How it works

Logistic regression is the most-used classification algorithm on the planet. Banks use it to score loan applications, doctors use it to estimate disease risk, ad networks use it to predict click-through. It is fast, interpretable, and almost always the first model you should try for a binary classification problem.

The Confusing Name

Despite "regression" in the name, logistic regression is a classifier. It predicts a class label, not a continuous number. The "regression" part comes from the math: under the hood it's running a regression on the log-odds of the probability — but you almost never need to think about that.

What you do need to think about: it predicts a probability between 0 and 1, then turns that probability into a class by checking if it's above or below a threshold (default: 0.5).

How It Works

Logistic regression starts with the same idea as linear regression — compute a weighted sum of the inputs:

z = w₀ + w₁·x₁ + w₂·x₂ + ...

But then it squashes that number through the sigmoid function to turn it into a probability:

P(class 1) = 1 / (1 + e^(-z))

Sigmoid maps any real number to (0, 1). Big positive z → probability near 1. Big negative z → probability near 0. z = 0 → probability exactly 0.5, which is where the decision boundary lives.

For 2D data, that boundary is a straight line. For 3D, it's a plane. For higher dimensions, a hyperplane. Logistic regression is fundamentally a linear classifier — if your two classes are tangled in a non-linear way, this isn't the right tool.

The Magic Method: `predict_proba`

This is what separates logistic regression from a black-box classifier. Instead of just saying "class 1", it tells you how confident it is:

model.predict(x)        # array([1])         — just the label
model.predict_proba(x)  # array([[0.12, 0.88]]) — P(class 0), P(class 1)

Why this matters in practice:

  • Risk scoring — "this loan has an 87% chance of default" is more useful than just "reject".
  • Threshold tuning — if false positives are costlier than false negatives, raise the threshold from 0.5 to 0.7.
  • Ranking — sort customers by P(churn) and call the top 100, even if none of them crossed 0.5.
  • Reading The Decision Boundary Plot

    The filled contour in the snippet shows P(class 1) across the entire feature space — red where the model is confident it's class 0, blue where it's confident it's class 1, and a fuzzy zone in between. The black line is the decision boundary at exactly P = 0.5. Points on the wrong side are misclassifications.

    ROC Curve and AUC — The Honest Way To Evaluate

    Accuracy hides a lot. A model that always predicts "not spam" will be 99% accurate on a dataset that's 99% non-spam — and completely useless. The ROC curve is the antidote.

    It plots the true positive rate (how many actual class-1 points you caught) against the false positive rate (how many class-0 points you wrongly flagged) across every possible decision threshold. A perfect model hugs the top-left corner. A random model is the diagonal line.

    AUC (area under the curve) summarizes the whole curve as a single number:

  • 1.0 — perfect classifier
  • 0.9+ — excellent
  • 0.8 — solid
  • 0.7 — okay
  • 0.5 — random guessing
  • < 0.5 — your labels are flipped
  • Unlike accuracy, AUC is threshold-independent and works fine on imbalanced datasets.

    Regularization (And Why It's On By Default)

    scikit-learn's LogisticRegression applies L2 regularization out of the box (controlled by Csmaller C means more regularization). This prevents the weights from blowing up when features are correlated or when you have more features than samples. You almost never want to turn it off. If you have very high-dimensional data with lots of irrelevant features, switch to L1 (penalty='l1', solver='liblinear') — it forces irrelevant weights to zero.

    When To Use It (And When Not To)

    Use logistic regression when:

  • You need to explain why the model made each prediction (the coefficients are interpretable).
  • You need probabilities, not just labels.
  • You have a roughly linear decision boundary.
  • You want a fast, dependable baseline before trying anything fancier.
  • Reach for something else when:

  • The relationship between features and class is highly non-linear → try Random Forest or Gradient Boosting.
  • You have more than two classes → still works (multi_class='multinomial'), but trees often do better.
  • You have huge feature spaces (text, images) → linear SVMs or neural networks may scale better.
  • Quick Practical Tips

  • Always scale your features before training (StandardScaler). The convergence is faster and the regularization behaves predictably.
  • Always use `stratify=y` in train_test_split for classification — keeps class proportions consistent in train and test.
  • For multi-class problems, scikit-learn handles it automatically — same LogisticRegression class, no extra config needed.
  • Don't compare two logistic regression coefficients directly unless the features are on the same scale.
  • Run the snippet above and you'll see the model's decision boundary cut cleanly between two classes, get back actual probabilities (not just labels), and watch the ROC curve sweep into the top-left corner — confirming the model is genuinely good, not just lucky.

    Related examples