Data ScienceIntermediate

Data Analysis with Pandas in Python

Learn Pandas for data analysis in Python! A beginner-friendly guide to DataFrames, filtering data, grouping, and handling messy missing data just like Excel.

Try it yourself

Run this code directly in your browser. Click "Open in full editor" to experiment further.

Loading...

Click Run to see output

Or press Ctrl + Enter

How it works

Pandas is the bedrock of Python data analysis. ๐Ÿผ If NumPy gave Python its math muscles, pandas gave it a brain for tables. Whenever you need to wrangle a CSV, clean messy survey data, summarise sales, or prep features for a model, pandas is the first tool you reach for. Even with Polars and DuckDB rising fast in 2026, it remains the most-imported data library in Python.

Why not just use plain lists and dicts? Because once your data has more than a handful of rows, raw Python becomes painful โ€” loops to filter, group, average, and handle missing values. A pandas dataframe turns all of that into one-liners running on optimised C code, with aligned indexes, label-based access, NaN-aware math, and a whole ecosystem (matplotlib, seaborn, scikit-learn, plotly) that speaks pandas natively.

Series and DataFrame

Pandas has two core types:

  • `Series` โ€” a single labelled column. Think of it as a dict-meets-numpy-array.
  • `DataFrame` โ€” a 2D table of Series sharing the same index. Each column can have its own dtype (int, float, string, datetime, category).
  • import pandas as pd
    s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
    df = pd.DataFrame({'name': ['Ada', 'Linus'], 'age': [36, 54]})

    That's it. Every method you'll learn hangs off these two objects.

    Loading Data

    Pandas reads almost anything:

  • pd.read_csv('file.csv') โ€” the workhorse; supports compression, chunking, and dtype hints.
  • pd.read_excel('file.xlsx', sheet_name='Q1') โ€” needs openpyxl or xlrd.
  • pd.read_json('file.json') โ€” records or column-oriented.
  • pd.read_sql(query, conn) โ€” pulls a query result straight into a dataframe.
  • pd.read_parquet('file.parquet') โ€” columnar, compressed, fast โ€” the preferred format in 2026.
  • For anything bigger than ~100 MB, prefer Parquet over CSV. It's 5โ€“10ร— faster to read and preserves dtypes.

    The Six Operations You'll Use Daily

    Selecting โ€” df['col'] for one column, df[['a', 'b']] for several. For rows use df.loc[label] (label-based) or df.iloc[position] (integer position). Boolean indexing like df[df['age'] > 30] is the most common filter.

    Filtering โ€” combine conditions with & and | (and parentheses!): df[(df['age'] > 25) & (df['city'] == 'Paris')]. Use .query("age > 25 and city == 'Paris'") for readable filters on long expressions.

    Group-by โ€” df.groupby('city')['salary'].mean() collapses rows into per-group summaries. The split-apply-combine engine of pandas.

    Joining / merging โ€” pd.merge(left, right, on='id', how='left') mirrors SQL joins. df.join() is the index-based shortcut.

    Sorting โ€” df.sort_values('salary', ascending=False) for value sort, df.sort_index() for index sort.

    Aggregating โ€” df.agg({'salary': 'mean', 'age': ['min', 'max']}) runs multiple aggregations in one pass.

    Group-By: The Killer Feature

    The groupby pandas pattern is split-apply-combine: split the table into groups, apply a function to each, combine the results. Once it clicks, half of data analysis becomes one line.

    sales = pd.DataFrame({
        'region': ['EU', 'US', 'EU', 'US', 'APAC'],
        'product': ['A', 'A', 'B', 'B', 'A'],
        'revenue': [100, 250, 80, 300, 150]
    })
    sales.groupby('region')['revenue'].sum()
    # region
    # APAC    150
    # EU      180
    # US      550

    Multi-key groupings work too: sales.groupby(['region', 'product']).agg(total=('revenue', 'sum'), n=('revenue', 'count')). This is your pivot table on steroids.

    Common Pitfalls

    1. `SettingWithCopyWarning` โ€” chained assignment like df[df.age > 30]['salary'] = 0 modifies a copy, not df. Use .loc: df.loc[df.age > 30, 'salary'] = 0. Pandas 3.0 makes this an error, so enable pd.options.mode.copy_on_write = True today to future-proof your code.

    2. `inplace=True` confusion โ€” it's being deprecated and rarely saves memory anyway. Prefer df = df.dropna() over df.dropna(inplace=True).

    3. `.iloc` vs `.loc` โ€” .iloc[0] is the first row, .loc[0] is the row with index label 0. After filtering, those are usually different.

    4. Mutating during iteration โ€” don't loop with iterrows() and mutate. The view/copy semantics will bite you. Use vectorised ops or .assign().

    5. Mixed dtypes in one column โ€” a column of mostly numbers with a few strings becomes object dtype and loses all numeric speed.

    6. `NaN` propagation โ€” NaN + anything == NaN. Use .fillna(), .dropna(), or aggregations like .mean(skipna=True) (default).

    7. Looping rows instead of vectorising โ€” for i, row in df.iterrows(): ... is 100โ€“1000ร— slower than the vectorised equivalent. If you find yourself writing a row loop, stop and think about a column expression.

    Performance Tips

  • Vectorise everything. df['c'] = df['a'] + df['b'] beats any loop.
  • For genuinely hot inner loops, drop to df.to_numpy() and work in NumPy.
  • Convert repeated string columns to category dtype โ€” huge memory and groupby wins.
  • Use pd.read_csv(..., chunksize=100_000) to stream files that don't fit in RAM.
  • For datasets above ~1 GB, consider Polars or DuckDB โ€” they're typically 5โ€“10ร— faster and use less memory.
  • Enable PyArrow-backed strings: pd.options.future.infer_string = True.
  • Pandas vs NumPy vs Polars vs Spark

    ToolBest forAvoid when
    NumPyPure numeric arrays, math, ML internalsYou need labels, mixed dtypes, or SQL-like joins
    PandasTabular analysis up to a few GB, ecosystem glueMulti-GB data, strict typing, parallelism
    PolarsFast multi-core dataframes, lazy queries, big CSV/ParquetYou need a niche pandas integration
    Spark / DaskDistributed terabyte-scale dataSingle-machine work โ€” overkill

    Real-World Uses

    This python data analysis stack shows up everywhere: financial back-testing and risk reporting, A/B test analysis at startups, ETL pipelines stitching APIs into warehouses, log and clickstream analysis, scientific datasets in genomics and climate research, dashboards built with Plotly Dash or Streamlit, feature engineering for scikit-learn and XGBoost models, and time-series forecasting with Prophet or statsmodels. Most production data teams in 2026 still write their final "last mile" transforms in pandas even when Polars or DuckDB does the heavy lifting upstream.

    Frequently Asked Questions

    When should I use a DataFrame vs a dictionary? โ€” If you have one row, a dict is fine. If you have many rows that share the same keys, use a dataframe. The moment you want to filter, group, or aggregate, the dataframe wins.

    How do I handle missing values? โ€” df.isna().sum() to find them, df.fillna(value) to impute, df.dropna() to remove. For numeric columns, mean/median imputation is common; for categoricals, fill with 'Unknown' or the mode.

    Why is pandas slow on large datasets? โ€” Pandas is single-threaded and copies data often. For files above 1 GB, switch to Polars (multi-core, lazy) or DuckDB (SQL on Parquet).

    What's the difference between `.apply()` and a vectorised operation? โ€” .apply() runs Python code per row or column โ€” readable but slow. Vectorised ops (df['a'] * 2, df['a'].str.upper()) push the loop into C. Always try the vectorised form first.

    How do I read a CSV that doesn't fit in memory? โ€” Use chunksize= to stream, filter each chunk, and concat. Or convert once to Parquet and query lazily with Polars or DuckDB.

    Should I use pandas or Polars for new projects in 2026? โ€” For small data, teaching, notebooks, and ML feature engineering: pandas. For ETL pipelines, multi-GB files, or anything performance-sensitive: Polars. They convert to each other in milliseconds, so a hybrid workflow is the modern default.

    Run the code above to see a pandas dataframe come to life and try a real pandas tutorial in your browser โ€” no install required.

    Related examples