Scientific ComputingIntermediate

Linear Regression in Python (from Scratch)

Build linear regression in Python from scratch with NumPy, then compare to scikit-learn. Step-by-step math, runnable code, and a real plotted fit.

Try it yourself

Run this code directly in your browser. Click "Open in full editor" to experiment further.

Loading...

Click Run to see output

Or press Ctrl + Enter

How it works

Linear regression is the hello world of machine learning. Every model you'll ever build — random forests, gradient boosters, neural networks — is in some sense a fancier answer to the same question: given some inputs, what's the best line (or curve, or surface) through this cloud of points?

The Goal in Plain English

You have a bunch of (x, y) pairs. You suspect there's a roughly linear relationship — when x goes up, y tends to go up (or down) too. You want a line y = w₁·x + w₀ that passes as close as possible to all the points at once.

"As close as possible" means: minimize the sum of squared distances between your line and each data point. That's what "least squares" means. Squared, not absolute, because the math becomes elegant — and because it punishes big misses harder than small ones.

Three Ways to Solve It

The code above shows the same problem solved three different ways, and they all land on the same answer. That's not a coincidence — they're all attacking the same convex optimization problem.

1. `np.polyfit` — the one-liner

slope, intercept = np.polyfit(x, y, deg=1)

Fastest way to get an answer. Behind the scenes it's doing the same math as the next two methods. Use this when you just need a quick fit and don't care about the mechanics.

2. The Normal Equation — the closed-form answer

This is the math that's actually being computed:

w = (Xᵀ X)⁻¹ Xᵀ y

Given the design matrix X (with a column of 1s for the intercept) and the target y, this formula gives you the exact best weights in one shot. No iteration. No learning rate. Just linear algebra.

In practice you don't compute the inverse directly — np.linalg.lstsq does the same thing in a numerically-stable way. Always prefer `lstsq` to `inv()` for least squares.

3. Gradient Descent — the iterative way

This is how neural networks learn. Start with random weights, compute how wrong they are, then nudge them in the direction that makes them less wrong. Repeat thousands of times.

For linear regression it's overkill — the closed-form answer is right there for free. But understanding gradient descent on this simple problem gives you the intuition that scales all the way up to deep learning.

The key knob is the learning rate. Too small and it crawls. Too large and it overshoots and explodes. 0.01 is a sane starting point for normalized data.

How Do You Know If The Fit Is Good?

The standard answer is R-squared (R²). It's a number between 0 and 1:

  • R² = 1 → the line passes through every point perfectly (suspicious — you might be overfitting)
  • R² = 0.85 → the line explains 85% of the variation in y. Solid fit.
  • R² = 0.1 → the line is barely better than predicting the average. Time to try a different model.
  • R² < 0 → your "model" is worse than just predicting the mean. Something is wrong.
  • When Linear Regression Is The Wrong Tool

    It only works well when the underlying relationship is roughly linear. Real-world warning signs:

  • The residuals (errors) form a pattern instead of looking random — your model is missing something.
  • One outlier dominates the fit — least squares is very sensitive to outliers because errors get squared.
  • Multiple input features are highly correlated with each other — the weights become unstable.
  • For non-linear patterns, polynomial regression (deg=2 or higher in polyfit) or tree-based models will do better.

    scikit-learn vs. From Scratch

    In real projects you'll always reach for scikit-learn:

    from sklearn.linear_model import LinearRegression
    model = LinearRegression().fit(X, y)
    model.predict(X_new)

    It handles edge cases, multiple features, and integrates with the rest of the ML ecosystem (cross-validation, pipelines, scaling). But every single thing it does is what the from-scratch code above is doing — just with more polish.

    Run the snippet to see all three methods produce the same slope (~2.5) and intercept (~4.0), watch the fitted line draw itself through the noisy data, and confirm scikit-learn agrees with the math you just wrote by hand.

    Related examples