Data Analysis with Pandas in Python
Learn Pandas for data analysis in Python! A beginner-friendly guide to DataFrames, filtering data, grouping, and handling messy missing data just like Excel.
Try it yourself
Run this code directly in your browser. Click "Open in full editor" to experiment further.
Click Run to see output
Or press Ctrl + Enter
How it works
Pandas is the bedrock of Python data analysis. ๐ผ If NumPy gave Python its math muscles, pandas gave it a brain for tables. Whenever you need to wrangle a CSV, clean messy survey data, summarise sales, or prep features for a model, pandas is the first tool you reach for. Even with Polars and DuckDB rising fast in 2026, it remains the most-imported data library in Python.
Why not just use plain lists and dicts? Because once your data has more than a handful of rows, raw Python becomes painful โ loops to filter, group, average, and handle missing values. A pandas dataframe turns all of that into one-liners running on optimised C code, with aligned indexes, label-based access, NaN-aware math, and a whole ecosystem (matplotlib, seaborn, scikit-learn, plotly) that speaks pandas natively.
Series and DataFrame
Pandas has two core types:
import pandas as pd
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
df = pd.DataFrame({'name': ['Ada', 'Linus'], 'age': [36, 54]})That's it. Every method you'll learn hangs off these two objects.
Loading Data
Pandas reads almost anything:
pd.read_csv('file.csv') โ the workhorse; supports compression, chunking, and dtype hints.pd.read_excel('file.xlsx', sheet_name='Q1') โ needs openpyxl or xlrd.pd.read_json('file.json') โ records or column-oriented.pd.read_sql(query, conn) โ pulls a query result straight into a dataframe.pd.read_parquet('file.parquet') โ columnar, compressed, fast โ the preferred format in 2026.For anything bigger than ~100 MB, prefer Parquet over CSV. It's 5โ10ร faster to read and preserves dtypes.
The Six Operations You'll Use Daily
Selecting โ df['col'] for one column, df[['a', 'b']] for several. For rows use df.loc[label] (label-based) or df.iloc[position] (integer position). Boolean indexing like df[df['age'] > 30] is the most common filter.
Filtering โ combine conditions with & and | (and parentheses!): df[(df['age'] > 25) & (df['city'] == 'Paris')]. Use .query("age > 25 and city == 'Paris'") for readable filters on long expressions.
Group-by โ df.groupby('city')['salary'].mean() collapses rows into per-group summaries. The split-apply-combine engine of pandas.
Joining / merging โ pd.merge(left, right, on='id', how='left') mirrors SQL joins. df.join() is the index-based shortcut.
Sorting โ df.sort_values('salary', ascending=False) for value sort, df.sort_index() for index sort.
Aggregating โ df.agg({'salary': 'mean', 'age': ['min', 'max']}) runs multiple aggregations in one pass.
Group-By: The Killer Feature
The groupby pandas pattern is split-apply-combine: split the table into groups, apply a function to each, combine the results. Once it clicks, half of data analysis becomes one line.
sales = pd.DataFrame({
'region': ['EU', 'US', 'EU', 'US', 'APAC'],
'product': ['A', 'A', 'B', 'B', 'A'],
'revenue': [100, 250, 80, 300, 150]
})
sales.groupby('region')['revenue'].sum()
# region
# APAC 150
# EU 180
# US 550Multi-key groupings work too: sales.groupby(['region', 'product']).agg(total=('revenue', 'sum'), n=('revenue', 'count')). This is your pivot table on steroids.
Common Pitfalls
1. `SettingWithCopyWarning` โ chained assignment like df[df.age > 30]['salary'] = 0 modifies a copy, not df. Use .loc: df.loc[df.age > 30, 'salary'] = 0. Pandas 3.0 makes this an error, so enable pd.options.mode.copy_on_write = True today to future-proof your code.
2. `inplace=True` confusion โ it's being deprecated and rarely saves memory anyway. Prefer df = df.dropna() over df.dropna(inplace=True).
3. `.iloc` vs `.loc` โ .iloc[0] is the first row, .loc[0] is the row with index label 0. After filtering, those are usually different.
4. Mutating during iteration โ don't loop with iterrows() and mutate. The view/copy semantics will bite you. Use vectorised ops or .assign().
5. Mixed dtypes in one column โ a column of mostly numbers with a few strings becomes object dtype and loses all numeric speed.
6. `NaN` propagation โ NaN + anything == NaN. Use .fillna(), .dropna(), or aggregations like .mean(skipna=True) (default).
7. Looping rows instead of vectorising โ for i, row in df.iterrows(): ... is 100โ1000ร slower than the vectorised equivalent. If you find yourself writing a row loop, stop and think about a column expression.
Performance Tips
df['c'] = df['a'] + df['b'] beats any loop.df.to_numpy() and work in NumPy.category dtype โ huge memory and groupby wins.pd.read_csv(..., chunksize=100_000) to stream files that don't fit in RAM.pd.options.future.infer_string = True.Pandas vs NumPy vs Polars vs Spark
| Tool | Best for | Avoid when |
|---|---|---|
| NumPy | Pure numeric arrays, math, ML internals | You need labels, mixed dtypes, or SQL-like joins |
| Pandas | Tabular analysis up to a few GB, ecosystem glue | Multi-GB data, strict typing, parallelism |
| Polars | Fast multi-core dataframes, lazy queries, big CSV/Parquet | You need a niche pandas integration |
| Spark / Dask | Distributed terabyte-scale data | Single-machine work โ overkill |
Real-World Uses
This python data analysis stack shows up everywhere: financial back-testing and risk reporting, A/B test analysis at startups, ETL pipelines stitching APIs into warehouses, log and clickstream analysis, scientific datasets in genomics and climate research, dashboards built with Plotly Dash or Streamlit, feature engineering for scikit-learn and XGBoost models, and time-series forecasting with Prophet or statsmodels. Most production data teams in 2026 still write their final "last mile" transforms in pandas even when Polars or DuckDB does the heavy lifting upstream.
Frequently Asked Questions
When should I use a DataFrame vs a dictionary? โ If you have one row, a dict is fine. If you have many rows that share the same keys, use a dataframe. The moment you want to filter, group, or aggregate, the dataframe wins.
How do I handle missing values? โ df.isna().sum() to find them, df.fillna(value) to impute, df.dropna() to remove. For numeric columns, mean/median imputation is common; for categoricals, fill with 'Unknown' or the mode.
Why is pandas slow on large datasets? โ Pandas is single-threaded and copies data often. For files above 1 GB, switch to Polars (multi-core, lazy) or DuckDB (SQL on Parquet).
What's the difference between `.apply()` and a vectorised operation? โ .apply() runs Python code per row or column โ readable but slow. Vectorised ops (df['a'] * 2, df['a'].str.upper()) push the loop into C. Always try the vectorised form first.
How do I read a CSV that doesn't fit in memory? โ Use chunksize= to stream, filter each chunk, and concat. Or convert once to Parquet and query lazily with Polars or DuckDB.
Should I use pandas or Polars for new projects in 2026? โ For small data, teaching, notebooks, and ML feature engineering: pandas. For ETL pipelines, multi-GB files, or anything performance-sensitive: Polars. They convert to each other in milliseconds, so a hybrid workflow is the modern default.
Run the code above to see a pandas dataframe come to life and try a real pandas tutorial in your browser โ no install required.
Related examples
NumPy Array Operations in Python
Learn NumPy basics in Python! A fun and easy guide to super-fast arrays, matrices, and data science math without using slow for-loops.
Data Visualization with Matplotlib
Learn data visualization in Python with Matplotlib! A fun guide to creating line plots, scatter plots, and bar charts the recommended object-oriented way.