Run Scikit-learn Online – Free Sklearn Online Compiler
Train machine learning models in your browser with our free online sklearn compiler. No installation or signup required - Try It Now.
Try This Scikit-learn Example
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report # --- Load the classic Iris dataset --- iris = load_iris() X, y = iris.data, iris.target # --- Split into training and test sets --- X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) print(f"Training samples: {len(X_train)}") print(f"Test samples: {len(X_test)}") # --- Train a Random Forest classifier --- clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # --- Evaluate on the test set --- y_pred = clf.predict(X_test) print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.2%}") print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=iris.target_names))
What You Can Do With Scikit-learn Online
Classification & Regression
Train classifiers like RandomForest, SVM, and LogisticRegression, or build regressors with LinearRegression and DecisionTreeRegressor — all running directly in your browser.
Built-in Datasets & Metrics
Load classic datasets like Iris, Wine, and Digits instantly. Evaluate models with accuracy_score, classification_report, confusion_matrix, and more.
No Setup Needed
Scikit-learn, NumPy, and Pandas are pre-installed. Open the editor and start building ML models immediately — zero configuration required.
How to Build ML Models with Scikit-learn Online
Ready to train machine learning models without any local setup? Our sklearn online compiler gives you immediate access to the full scikit-learn library right inside your browser. Here is a quick workflow:
- Import the Library: Start by typing
from sklearn.ensemble import RandomForestClassifieror any estimator you need in the code editor. - Load or Create Data: Use built-in datasets like
load_iris()orload_breast_cancer(), or create your own with NumPy arrays and Pandas DataFrames. - Split Train and Test: Call
train_test_split(X, y, test_size=0.2, random_state=42)so you can measure how the model behaves on data it has never seen. - Preprocess with a Pipeline: Wrap a
StandardScalerand your estimator inPipeline— this prevents leakage and keeps train and test transformations identical. - Train Your Model: Call
.fit(X_train, y_train)on the pipeline. The same one-liner works for every sklearn estimator. - Predict on Held-out Data: Use
.predict(X_test)for hard labels and.predict_proba(X_test)for class probabilities when you need confidence scores. - Evaluate Results: Use
accuracy_score(),classification_report(), orconfusion_matrix()to measure performance — and switch tof1_scoreorroc_auc_scorewhen classes are imbalanced.
If you want to dive deeper into feature engineering, hyperparameter tuning, or pipelines, head over to the official scikit-learn documentation.
The Scikit-learn Workflow in 6 Steps
Almost every supervised learning project — fraud detection, churn prediction, medical diagnosis — follows the same six steps. Once this rhythm is muscle memory, you can swap the dataset and the algorithm without rewriting the scaffolding. Here is the canonical pipeline applied to the breast cancer dataset shipped with sklearn.
1. Load data
from sklearn.datasets import load_breast_cancer data = load_breast_cancer() X, y = data.data, data.targetSklearn ships several toy datasets so you can experiment without hunting for CSVs.
2. Split into train and test
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )Holding out 20% gives you an honest estimate of generalisation;
stratify=ypreserves class balance.3. Pick a model
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200, random_state=42)Random Forest is a strong, low-tuning default for tabular problems — great as a first baseline.
4. Fit on the training data
model.fit(X_train, y_train)Every sklearn estimator implements the same
.fit()contract — once you know one, you know them all.5. Predict on the test set
y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test)[:, 1]Use
.predict()for labels;.predict_proba()when you need calibrated confidence for thresholds or ROC curves.6. Evaluate
from sklearn.metrics import accuracy_score, classification_report print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}") print(classification_report(y_test, y_pred))Accuracy is a fine first glance —
classification_reportgives you precision, recall, and F1 per class.
Choosing the Right Algorithm — A Decision Guide
The biggest source of wasted hours in early ML projects is starting with the wrong model. There is no single best estimator, but there are reliable defaults for each shape of problem. Use this as a cheat sheet, not a rulebook.
| Situation | Reach for | Why |
|---|---|---|
| Quick linear baseline, binary classification | LogisticRegression | Fast, interpretable, calibrated probabilities out of the box. |
| Default tabular classifier or regressor | RandomForestClassifier / Regressor | Handles mixed scales, robust to outliers, almost no tuning needed. |
| Top accuracy on tabular data | HistGradientBoostingClassifier | Modern histogram-based boosting — handles missing values and beats vanilla GradientBoosting for n > 10k. |
| Continuous output prediction | LinearRegression / Ridge | Start linear, add Ridge for regularisation if features correlate. |
| No labels — find groups | KMeans | Simple, fast clustering when you roughly know the number of clusters. |
| Small dataset, low-dimensional | KNeighborsClassifier | No training step; great for teaching and tiny problems. Always scale features first. |
| High-dim, small-n (e.g. text TF-IDF) | SVC / LinearSVC | SVMs shine when features outnumber rows; pair with StandardScaler. |
Rule of thumb: start with Logistic Regression as a sanity baseline, then jump to Random Forest or HistGradientBoosting. If neither breaks 80% of the gap to your target, the bottleneck is almost always data quality or feature engineering — not the algorithm.
Pipelines & Preprocessing — The Often-Skipped Step
A model is only as good as the data going in. Sklearn's Pipeline bundles preprocessing and the estimator into a single object that follows the same fit/predict contract. StandardScaler centres and rescales numeric features. OneHotEncoder turns categorical strings into binary columns. ColumnTransformer applies different transforms to different columns — numeric ones get scaled, categorical ones get encoded — without leaking the test set into the training statistics.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
numeric_cols = ["age", "income"]
categorical_cols = ["country", "plan"]
preprocess = ColumnTransformer([
("num", StandardScaler(), numeric_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
])
pipe = Pipeline([
("prep", preprocess),
("model", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)
print("Test accuracy:", pipe.score(X_test, y_test))Why this matters: if you scale the entire dataset before splitting, the test set's mean and standard deviation leak into training — your accuracy looks better than reality. A pipeline solves this for free because fit only sees X_train. As a bonus, the same object plugs straight into GridSearchCV and cross_val_score, so cross-validation correctly refits the scaler on each fold.
Scikit-learn vs PyTorch vs TensorFlow vs XGBoost
Scikit-learn is the right tool for classical machine learning — tabular data, pipelines, model selection, and any project under a few million rows. It is also, by a wide margin, the best library to learn ML on. The API is consistent, the docs are excellent, and the abstractions match how textbooks teach the subject.
PyTorch and TensorFlow own deep learning. If your data is images, audio, or natural-language text — anything where representations need to be learned end-to-end on a GPU — sklearn is the wrong layer. Use PyTorch for research and most modern training stacks; TensorFlow / Keras still has a strong production and mobile story.
XGBoost, LightGBM, and CatBoost dominate Kaggle-style tabular competitions. They squeeze out extra accuracy that sklearn's built-in boosting often cannot match, and LightGBM in particular trains very fast on big tables. Learn sklearn first, then graduate to these when you need every last percentage point.
In practice the stack is layered, not exclusive: sklearn for preprocessing and evaluation, XGBoost or LightGBM for the final model, PyTorch when the data stops being tabular. PythonHere covers the first layer.
5 Gotchas That Trip Up Beginners
1. Forgetting to scale features for distance-based models. KNN, SVM, and KMeans all compute distances, so a feature measured in millions will dominate one measured in 0–1. Always wrap them in a pipeline with
StandardScaler; tree-based models like Random Forest do not need scaling.2. Data leakage from preprocessing before the split. Calling
scaler.fit(X)on the full dataset and then splitting leaks test statistics into training. The fix is mechanical: split first, then fit the scaler inside aPipelineso each fold sees only its own training data.3. Using accuracy on imbalanced datasets. If 99% of your rows are class 0, predicting "always 0" scores 99% accuracy and detects nothing. Switch to
f1_score,roc_auc_score, orprecision_recall_curve, and considerclass_weight="balanced"on the estimator.4. Not setting random_state. Without
random_stateset ontrain_test_splitand on the estimator, every run produces a slightly different score. You will chase phantom regressions. Pin a seed (42 is traditional) for any code you plan to share or compare.5. Confusing .predict() and .predict_proba().
.predict()returns the chosen class label;.predict_proba()returns the probability for each class. Threshold tuning, ROC curves, and calibrated decision-making all need the probabilities — accuracy on hard labels alone will quietly hide model quality.
Frequently Asked Questions
Can I run scikit-learn online without installing Python?
Yes. PythonHere runs Python entirely in your browser using WebAssembly (Pyodide). Scikit-learn, NumPy, and Pandas are pre-loaded — no installation required.
Which sklearn models are supported?
All major scikit-learn estimators are available — classification (RandomForest, SVM, KNN, LogisticRegression, HistGradientBoostingClassifier), regression (LinearRegression, Ridge, Lasso), clustering (KMeans, DBSCAN), dimensionality reduction (PCA, t-SNE), and preprocessing tools (StandardScaler, OneHotEncoder, LabelEncoder).
Is it free?
100% free, forever. No account, no credit card, no time limit.
Can I use NumPy and Pandas with sklearn here?
Yes. NumPy and Pandas are available alongside scikit-learn. Use import numpy as np and import pandas as pd directly in the editor.
Which scikit-learn version runs in the browser?
Pyodide ships a recent stable scikit-learn build compiled to WebAssembly — typically a 1.x release that supports modern APIs like Pipeline, ColumnTransformer, HistGradientBoostingClassifier, and the standard model_selection utilities. Run import sklearn; print(sklearn.__version__) in the editor to check the exact version your session loaded.
Can I train models on my own dataset here?
Absolutely. Paste a CSV string and load it with pandas.read_csv(io.StringIO(csv_text)), or upload data through the Pyodide file API. The dataset lives in browser memory only — nothing is uploaded to our servers, which makes the editor safe for sensitive exploration.
How do I split data into train and test sets?
Use train_test_split from sklearn.model_selection. Pass your features X, target y, a test_size (typically 0.2 or 0.3), and always set random_state for reproducible splits. For imbalanced classification, add stratify=y so both splits keep the same class distribution.
Can I save trained models with pickle or joblib?
You can pickle and unpickle inside the same browser session, and joblib is available too. Persisting beyond a tab refresh requires extra plumbing — you can serialise to bytes, base64-encode them, and download via a Blob, then re-upload to load. For long-term model storage and serving, train here but deploy on a real backend.
Does it support GridSearchCV and cross-validation?
Yes. GridSearchCV, RandomizedSearchCV, cross_val_score, KFold, StratifiedKFold and the rest of sklearn.model_selection all work. Keep n_jobs=1 in the browser — Pyodide is single-threaded — and prefer a small param grid so the search finishes in seconds instead of minutes.
Can I use this for production ML?
No. PythonHere is a learning, prototyping, and teaching environment. Pyodide runs single-threaded inside a browser tab, has no GPU access, and dies when the tab closes. For production, train and serve models from a proper Python environment — FastAPI, AWS SageMaker, Modal, Vertex AI, or any container platform.
Does it support deep learning?
Not really — PyTorch, TensorFlow, and JAX are too heavy or GPU-bound to run well in WebAssembly. Scikit-learn does ship a basic neural network (MLPClassifier and MLPRegressor) which is fine for tiny tabular problems. For real deep learning, use a Colab notebook or a local environment with a GPU.
Why is training slow in the browser?
Pyodide compiles Python to WebAssembly and runs single-threaded in your tab. Native sklearn on a desktop benefits from BLAS, OpenMP, and multiple cores; the browser version does not. Expect a 2–5x slowdown versus local Python. Keep datasets under ~100k rows and prefer fast estimators (LogisticRegression, HistGradientBoostingClassifier) for the smoothest experience.
Explore More Python Libraries Online
Run Pandas Online
Analyse data with DataFrames, filter rows, and run groupby — all in your browser.
Run NumPy Online
Create arrays, perform matrix operations, and run linear algebra — all in your browser.
Run Matplotlib Online
Create line, bar, and scatter charts that render instantly in the output panel.
Start Running Scikit-learn in Your Browser
Free forever. No install. No signup.
Open the Sklearn Editor →