K-Means Clustering in Python
Learn K-Means clustering in Python with scikit-learn. Visualize clusters forming, pick the right K with the elbow method, and run it all in your browser.
Try it yourself
Run this code directly in your browser. Click "Open in full editor" to experiment further.
Click Run to see output
Or press Ctrl + Enter
How it works
K-Means is the most popular clustering algorithm in the world, and for good reason: the idea is simple, it's blazingly fast, and the results are easy to explain to anyone — even people who don't know what machine learning is.
What "Clustering" Even Means
Clustering is unsupervised learning — you give the algorithm a pile of points and it tries to figure out how they naturally group together, without anyone telling it what the right answer looks like. There are no labels, no "correct" outputs to learn from. Just the geometry of the data.
Use cases you've definitely encountered:
How K-Means Actually Works
The algorithm is so simple you can explain it on a napkin:
1. Pick K — decide how many clusters you want.
2. Drop K random points somewhere in your data — these are your initial "centroids".
3. Assign every data point to its nearest centroid.
4. Move each centroid to the average position of the points assigned to it.
5. Repeat steps 3 and 4 until the centroids stop moving.
That's it. No gradients, no probabilities, no neural networks. Just "find center, assign points, move center, repeat".
The Hardest Part: Picking K
K-Means won't tell you how many clusters are in your data — you have to choose. The standard trick is the elbow method:
In the snippet above, the elbow lands cleanly at K=4 — which matches how the data was actually generated.
For harder cases there's also the silhouette score (sklearn.metrics.silhouette_score), which gives every choice of K a single number — pick the K with the highest score.
Things K-Means Is Bad At
It's a great default, but it has well-known weaknesses:
DBSCAN instead.StandardScaler first if your features are on different scales.KMedoids if outliers are a problem.n_init=10. It runs the whole thing 10 times with different random starts and keeps the best one. Always do this.The API You Actually Need
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, n_init=10, random_state=42)
kmeans.fit(X)
kmeans.labels_ # cluster index for each training point
kmeans.cluster_centers_ # coordinates of each centroid
kmeans.inertia_ # tightness score (lower is better)
kmeans.predict(X_new) # which cluster does a NEW point belong to?A Few Pro Tips
random_state so your results are reproducible.n_init=10 (in newer scikit-learn it defaults to 'auto' but explicit is better).StandardScaler().fit_transform(X).MiniBatchKMeans — same idea, much faster.predict() method is what makes K-Means useful in production: train on historical data, then assign new incoming points to existing clusters in real time.Run the snippet above and you'll see four clean clusters get discovered automatically, an elbow chart pointing at the right K, and the model assigning brand-new points to the cluster they obviously belong to.
Related examples
PCA (Principal Component Analysis) in Python
Learn PCA in Python with scikit-learn. Reduce high-dimensional data to 2D, visualize hidden structure, and understand explained variance — runnable in your browser.
Decision Tree Classifier in Python
Build a decision tree classifier in Python with scikit-learn. Train, visualize the actual tree, predict, and learn how to avoid overfitting — runnable in your browser.
Logistic Regression in Python
Learn logistic regression in Python with scikit-learn. Binary classification, decision boundary, probabilities, and ROC curve — all explained and runnable in your browser.