Run Scikit-learn Online – Free Sklearn Online Compiler

Train machine learning models in your browser with our free online sklearn compiler. No installation or signup required - Try It Now.

Try This Scikit-learn Example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# --- Load the classic Iris dataset ---
iris = load_iris()
X, y = iris.data, iris.target

# --- Split into training and test sets ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples:     {len(X_test)}")

# --- Train a Random Forest classifier ---
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# --- Evaluate on the test set ---
y_pred = clf.predict(X_test)
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Open in full editor →Loads instantly in your browser. No install.

What You Can Do With Scikit-learn Online

Classification & Regression

Train classifiers like RandomForest, SVM, and LogisticRegression, or build regressors with LinearRegression and DecisionTreeRegressor — all running directly in your browser.

Built-in Datasets & Metrics

Load classic datasets like Iris, Wine, and Digits instantly. Evaluate models with accuracy_score, classification_report, confusion_matrix, and more.

No Setup Needed

Scikit-learn, NumPy, and Pandas are pre-installed. Open the editor and start building ML models immediately — zero configuration required.

How to Build ML Models with Scikit-learn Online

Ready to train machine learning models without any local setup? Our sklearn online compiler gives you immediate access to the full scikit-learn library right inside your browser. Here is a quick workflow:

  1. Import the Library: Start by typing from sklearn.ensemble import RandomForestClassifier or any estimator you need in the code editor.
  2. Load or Create Data: Use built-in datasets like load_iris() or load_breast_cancer(), or create your own with NumPy arrays and Pandas DataFrames.
  3. Split Train and Test: Call train_test_split(X, y, test_size=0.2, random_state=42) so you can measure how the model behaves on data it has never seen.
  4. Preprocess with a Pipeline: Wrap a StandardScaler and your estimator in Pipeline — this prevents leakage and keeps train and test transformations identical.
  5. Train Your Model: Call .fit(X_train, y_train) on the pipeline. The same one-liner works for every sklearn estimator.
  6. Predict on Held-out Data: Use .predict(X_test) for hard labels and .predict_proba(X_test) for class probabilities when you need confidence scores.
  7. Evaluate Results: Use accuracy_score(), classification_report(), or confusion_matrix() to measure performance — and switch to f1_score or roc_auc_score when classes are imbalanced.

If you want to dive deeper into feature engineering, hyperparameter tuning, or pipelines, head over to the official scikit-learn documentation.

The Scikit-learn Workflow in 6 Steps

Almost every supervised learning project — fraud detection, churn prediction, medical diagnosis — follows the same six steps. Once this rhythm is muscle memory, you can swap the dataset and the algorithm without rewriting the scaffolding. Here is the canonical pipeline applied to the breast cancer dataset shipped with sklearn.

  1. 1. Load data

    from sklearn.datasets import load_breast_cancer
    
    data = load_breast_cancer()
    X, y = data.data, data.target

    Sklearn ships several toy datasets so you can experiment without hunting for CSVs.

  2. 2. Split into train and test

    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    Holding out 20% gives you an honest estimate of generalisation; stratify=y preserves class balance.

  3. 3. Pick a model

    from sklearn.ensemble import RandomForestClassifier
    
    model = RandomForestClassifier(n_estimators=200, random_state=42)

    Random Forest is a strong, low-tuning default for tabular problems — great as a first baseline.

  4. 4. Fit on the training data

    model.fit(X_train, y_train)

    Every sklearn estimator implements the same .fit() contract — once you know one, you know them all.

  5. 5. Predict on the test set

    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    Use .predict() for labels; .predict_proba() when you need calibrated confidence for thresholds or ROC curves.

  6. 6. Evaluate

    from sklearn.metrics import accuracy_score, classification_report
    
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
    print(classification_report(y_test, y_pred))

    Accuracy is a fine first glance — classification_report gives you precision, recall, and F1 per class.

Choosing the Right Algorithm — A Decision Guide

The biggest source of wasted hours in early ML projects is starting with the wrong model. There is no single best estimator, but there are reliable defaults for each shape of problem. Use this as a cheat sheet, not a rulebook.

SituationReach forWhy
Quick linear baseline, binary classificationLogisticRegressionFast, interpretable, calibrated probabilities out of the box.
Default tabular classifier or regressorRandomForestClassifier / RegressorHandles mixed scales, robust to outliers, almost no tuning needed.
Top accuracy on tabular dataHistGradientBoostingClassifierModern histogram-based boosting — handles missing values and beats vanilla GradientBoosting for n > 10k.
Continuous output predictionLinearRegression / RidgeStart linear, add Ridge for regularisation if features correlate.
No labels — find groupsKMeansSimple, fast clustering when you roughly know the number of clusters.
Small dataset, low-dimensionalKNeighborsClassifierNo training step; great for teaching and tiny problems. Always scale features first.
High-dim, small-n (e.g. text TF-IDF)SVC / LinearSVCSVMs shine when features outnumber rows; pair with StandardScaler.

Rule of thumb: start with Logistic Regression as a sanity baseline, then jump to Random Forest or HistGradientBoosting. If neither breaks 80% of the gap to your target, the bottleneck is almost always data quality or feature engineering — not the algorithm.

Pipelines & Preprocessing — The Often-Skipped Step

A model is only as good as the data going in. Sklearn's Pipeline bundles preprocessing and the estimator into a single object that follows the same fit/predict contract. StandardScaler centres and rescales numeric features. OneHotEncoder turns categorical strings into binary columns. ColumnTransformer applies different transforms to different columns — numeric ones get scaled, categorical ones get encoded — without leaking the test set into the training statistics.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

numeric_cols = ["age", "income"]
categorical_cols = ["country", "plan"]

preprocess = ColumnTransformer([
    ("num", StandardScaler(), numeric_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
])

pipe = Pipeline([
    ("prep", preprocess),
    ("model", LogisticRegression(max_iter=1000)),
])

pipe.fit(X_train, y_train)
print("Test accuracy:", pipe.score(X_test, y_test))

Why this matters: if you scale the entire dataset before splitting, the test set's mean and standard deviation leak into training — your accuracy looks better than reality. A pipeline solves this for free because fit only sees X_train. As a bonus, the same object plugs straight into GridSearchCV and cross_val_score, so cross-validation correctly refits the scaler on each fold.

Scikit-learn vs PyTorch vs TensorFlow vs XGBoost

Scikit-learn is the right tool for classical machine learning — tabular data, pipelines, model selection, and any project under a few million rows. It is also, by a wide margin, the best library to learn ML on. The API is consistent, the docs are excellent, and the abstractions match how textbooks teach the subject.

PyTorch and TensorFlow own deep learning. If your data is images, audio, or natural-language text — anything where representations need to be learned end-to-end on a GPU — sklearn is the wrong layer. Use PyTorch for research and most modern training stacks; TensorFlow / Keras still has a strong production and mobile story.

XGBoost, LightGBM, and CatBoost dominate Kaggle-style tabular competitions. They squeeze out extra accuracy that sklearn's built-in boosting often cannot match, and LightGBM in particular trains very fast on big tables. Learn sklearn first, then graduate to these when you need every last percentage point.

In practice the stack is layered, not exclusive: sklearn for preprocessing and evaluation, XGBoost or LightGBM for the final model, PyTorch when the data stops being tabular. PythonHere covers the first layer.

5 Gotchas That Trip Up Beginners

  • 1. Forgetting to scale features for distance-based models. KNN, SVM, and KMeans all compute distances, so a feature measured in millions will dominate one measured in 0–1. Always wrap them in a pipeline with StandardScaler; tree-based models like Random Forest do not need scaling.

  • 2. Data leakage from preprocessing before the split. Calling scaler.fit(X) on the full dataset and then splitting leaks test statistics into training. The fix is mechanical: split first, then fit the scaler inside a Pipeline so each fold sees only its own training data.

  • 3. Using accuracy on imbalanced datasets. If 99% of your rows are class 0, predicting "always 0" scores 99% accuracy and detects nothing. Switch to f1_score, roc_auc_score, or precision_recall_curve, and consider class_weight="balanced" on the estimator.

  • 4. Not setting random_state. Without random_state set on train_test_split and on the estimator, every run produces a slightly different score. You will chase phantom regressions. Pin a seed (42 is traditional) for any code you plan to share or compare.

  • 5. Confusing .predict() and .predict_proba(). .predict() returns the chosen class label; .predict_proba() returns the probability for each class. Threshold tuning, ROC curves, and calibrated decision-making all need the probabilities — accuracy on hard labels alone will quietly hide model quality.

Frequently Asked Questions

Can I run scikit-learn online without installing Python?

Yes. PythonHere runs Python entirely in your browser using WebAssembly (Pyodide). Scikit-learn, NumPy, and Pandas are pre-loaded — no installation required.

Which sklearn models are supported?

All major scikit-learn estimators are available — classification (RandomForest, SVM, KNN, LogisticRegression, HistGradientBoostingClassifier), regression (LinearRegression, Ridge, Lasso), clustering (KMeans, DBSCAN), dimensionality reduction (PCA, t-SNE), and preprocessing tools (StandardScaler, OneHotEncoder, LabelEncoder).

Is it free?

100% free, forever. No account, no credit card, no time limit.

Can I use NumPy and Pandas with sklearn here?

Yes. NumPy and Pandas are available alongside scikit-learn. Use import numpy as np and import pandas as pd directly in the editor.

Which scikit-learn version runs in the browser?

Pyodide ships a recent stable scikit-learn build compiled to WebAssembly — typically a 1.x release that supports modern APIs like Pipeline, ColumnTransformer, HistGradientBoostingClassifier, and the standard model_selection utilities. Run import sklearn; print(sklearn.__version__) in the editor to check the exact version your session loaded.

Can I train models on my own dataset here?

Absolutely. Paste a CSV string and load it with pandas.read_csv(io.StringIO(csv_text)), or upload data through the Pyodide file API. The dataset lives in browser memory only — nothing is uploaded to our servers, which makes the editor safe for sensitive exploration.

How do I split data into train and test sets?

Use train_test_split from sklearn.model_selection. Pass your features X, target y, a test_size (typically 0.2 or 0.3), and always set random_state for reproducible splits. For imbalanced classification, add stratify=y so both splits keep the same class distribution.

Can I save trained models with pickle or joblib?

You can pickle and unpickle inside the same browser session, and joblib is available too. Persisting beyond a tab refresh requires extra plumbing — you can serialise to bytes, base64-encode them, and download via a Blob, then re-upload to load. For long-term model storage and serving, train here but deploy on a real backend.

Does it support GridSearchCV and cross-validation?

Yes. GridSearchCV, RandomizedSearchCV, cross_val_score, KFold, StratifiedKFold and the rest of sklearn.model_selection all work. Keep n_jobs=1 in the browser — Pyodide is single-threaded — and prefer a small param grid so the search finishes in seconds instead of minutes.

Can I use this for production ML?

No. PythonHere is a learning, prototyping, and teaching environment. Pyodide runs single-threaded inside a browser tab, has no GPU access, and dies when the tab closes. For production, train and serve models from a proper Python environment — FastAPI, AWS SageMaker, Modal, Vertex AI, or any container platform.

Does it support deep learning?

Not really — PyTorch, TensorFlow, and JAX are too heavy or GPU-bound to run well in WebAssembly. Scikit-learn does ship a basic neural network (MLPClassifier and MLPRegressor) which is fine for tiny tabular problems. For real deep learning, use a Colab notebook or a local environment with a GPU.

Why is training slow in the browser?

Pyodide compiles Python to WebAssembly and runs single-threaded in your tab. Native sklearn on a desktop benefits from BLAS, OpenMP, and multiple cores; the browser version does not. Expect a 2–5x slowdown versus local Python. Keep datasets under ~100k rows and prefer fast estimators (LogisticRegression, HistGradientBoostingClassifier) for the smoothest experience.

Explore More Python Libraries Online

Start Running Scikit-learn in Your Browser

Free forever. No install. No signup.

Open the Sklearn Editor →