Python Pandas for Data Analysis: A Practical Intro

Load CSVs, explore DataFrames, filter and group data, and export results—ideal after you have Python installed.

Data & machine learning Beginner 7 min read

·

pandas wraps NumPy arrays with labels so you can align time series and join tables in memory. You run read_csv to parse files into typed columns; you run info()/describe() because exploratory stats catch bad dtypes before you aggregate wrong.

Install Python in a venv, then pip install pandas (often with pyarrow or matplotlib).

Core objects

Why Series vs DataFrame: A Series is a single column with an index; a DataFrame is a table—operations broadcast along columns.

Load and peek

Why head/info/describe: Confirms delimiter parsing worked and shows null counts and numeric ranges before you trust aggregates.

import pandas as pd
df = pd.read_csv("data.csv")
df.head()
df.info()
df.describe()

Select and filter

Why boolean indexing: Vectorized filters are faster than Python loops and keep code readable.

df[df["region"] == "EU"]
df.loc[:, ["date", "revenue"]]

Group and aggregate

df.groupby("product_id")["revenue"].sum()

What this does: Splits rows by product_id, computes per-group sum—SQL GROUP BY analogue.

Joins

pd.merge(df_orders, df_customers, on="customer_id", how="left")

Why specify how: Defaults differ from SQL habits; left joins keep all orders even if a customer row is missing.

Missing data

isna(), fillna(), or explicit drops—document assumptions for reproducibility.

Frequently asked questions

Pandas vs SQL?

Use SQL in the database for big aggregates; pandas shines for exploration and glue code in Python.

Performance?

For huge tables, consider Polars, DuckDB, or push work into the database.

Notebooks?

Jupyter is fine; keep notebooks under Git with cleared outputs or export scripts for production.

Time zones?

Parse with pd.to_datetime(..., utc=True) and normalize early.