Python Pandas for Data Analysis: A Practical Intro
Load CSVs, explore DataFrames, filter and group data, and export results—ideal after you have Python installed.
Data & machine learning Beginner 7 min read
·
pandas wraps NumPy arrays with labels so you can align time series and join tables in memory. You run read_csv to parse files into typed columns; you run info()/describe() because exploratory stats catch bad dtypes before you aggregate wrong.
Install Python in a venv, then pip install pandas (often with pyarrow or matplotlib).
Core objects
Why Series vs DataFrame: A Series is a single column with an index; a DataFrame is a table—operations broadcast along columns.
Load and peek
Why head/info/describe: Confirms delimiter parsing worked and shows null counts and numeric ranges before you trust aggregates.
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
df.info()
df.describe()
Select and filter
Why boolean indexing: Vectorized filters are faster than Python loops and keep code readable.
df[df["region"] == "EU"]
df.loc[:, ["date", "revenue"]]
Group and aggregate
df.groupby("product_id")["revenue"].sum()
What this does: Splits rows by product_id, computes per-group sum—SQL GROUP BY analogue.
Joins
pd.merge(df_orders, df_customers, on="customer_id", how="left")
Why specify how: Defaults differ from SQL habits; left joins keep all orders even if a customer row is missing.
Missing data
isna(), fillna(), or explicit drops—document assumptions for reproducibility.
Frequently asked questions
Pandas vs SQL?
Use SQL in the database for big aggregates; pandas shines for exploration and glue code in Python.
Performance?
For huge tables, consider Polars, DuckDB, or push work into the database.
Notebooks?
Jupyter is fine; keep notebooks under Git with cleared outputs or export scripts for production.
Time zones?
Parse with pd.to_datetime(..., utc=True) and normalize early.