Practical Data Analysis with pandas for Developers

pandas is one of the most useful Python libraries when your work involves tabular data. If you regularly open spreadsheets, clean messy rows, combine files, calculate summaries, or build charts, pandas gives you a programmable way to do the same work with less manual effort.

The central idea is simple: pandas lets you treat a table as a Python object. You can load data, inspect it, clean it, filter it, join it with another table, calculate statistics, create pivot-style summaries, and export the result. Instead of hiding logic inside spreadsheet formulas and manual clicks, the analysis becomes readable code that can be rerun whenever new data arrives.

This post walks through a practical pandas workflow from beginning to end. The examples use small in-memory datasets so the behavior is easy to see, but the same techniques apply when the data comes from CSV files, Excel workbooks, databases, or other systems.

The Problem

A typical data analysis task often looks like this:

Receive a table from a spreadsheet, CSV file, or export from another system.
Check the shape, column names, data types, and missing values.
Select only the rows and columns that matter.
Fix messy values, missing data, duplicated records, and inconsistent text.
Combine the data with lookup tables or other exports.
Calculate summaries by category, region, product, date, or customer.
Create charts to communicate the result.
Export the cleaned or summarized data for someone else.

In a spreadsheet, these steps often become a mixture of formulas, filters, hidden columns, manual copy-paste work, and pivot tables. That can be fine for small one-off tasks, but it gets risky when the same analysis must be repeated every week or every month.

pandas helps by making each step explicit:

Raw data
  |
  v
DataFrame
  |
  v
Inspect, clean, filter, and enrich
  |
  v
Aggregate and visualize
  |
  v
Export result

The main object you work with is the DataFrame.

DataFrame and Series

A DataFrame is a two-dimensional table with row labels, column labels, and data. It is similar to a spreadsheet range, but it is controlled with Python code.

A Series is one-dimensional. You usually get a Series when you select one column or one row from a DataFrame.

A DataFrame has three important parts:

Part	What it means
Index	Row labels. These can be numbers, IDs, dates, or other labels.
Columns	Column labels. These describe the fields in the table.
Data	The actual values inside the table. Each column can have its own data type.

Here is a small order dataset:

import pandas as pd

rows = [
    ["North", "Keyboard", "paid", 2, 125.0],
    ["South", "Mouse", "paid", 5, 35.0],
    ["North", "Monitor", "pending", 1, 299.0],
    ["West", "Keyboard", "paid", 3, 120.0],
    ["South", "Monitor", "cancelled", 1, 310.0],
]

orders = pd.DataFrame(
    data=rows,
    columns=["region", "product", "status", "quantity", "unit_price"],
    index=[5001, 5002, 5003, 5004, 5005],
)

orders.index.name = "order_id"
orders.columns.name = "fields"

orders

The custom index is useful because the row label now represents an order_id instead of a generic row number.

A few quick inspection methods are worth learning early:

orders.info()
orders.dtypes
orders.head(3)
orders.tail(2)
orders.describe()

Use these as your first sanity check:

info() shows column names, non-missing counts, data types, index type, and memory usage.
dtypes shows the data type of each column.
head() shows the first rows.
tail() shows the last rows.
describe() gives basic statistics for numeric columns, such as count, mean, standard deviation, min, max, and percentiles.

Working with the Index

The index is not the same as a database primary key. A pandas index can contain duplicates, but a clean unique index is often easier to reason about.

You can turn the index into a normal column:

orders_with_id = orders.reset_index()
orders_with_id

You can set a column as the index:

orders_by_id = orders_with_id.set_index("order_id")
orders_by_id

You can reindex to align the table to a specific set of row labels:

requested_orders = orders.reindex([5002, 5003, 5099])
requested_orders

When pandas cannot find a requested label, it creates a row with missing values. Missing values are usually represented as NaN, which means "not a number", but pandas also uses it as a general missing-value marker in many table operations.

Sorting by index is also common:

orders.sort_index()

Sorting by values is different. Here, the data is sorted by region first, then by unit price:

orders.sort_values(["region", "unit_price"])

Working with Columns

Column names should be readable and consistent. You can rename columns without changing the original DataFrame:

renamed_orders = orders.rename(
    columns={
        "unit_price": "price_each",
        "quantity": "items",
    }
)

renamed_orders

Many pandas methods return a new DataFrame instead of editing the existing one. That is why the result is assigned to renamed_orders. If you call a method and do not assign the result, the original variable still points to the original data.

You can drop columns or rows:

smaller_orders = orders.drop(
    columns=["status"],
    index=[5005],
)

smaller_orders

You can transpose a table, which swaps rows and columns:

orders.T

Transposing is not something you use every day, but it can be useful when a table is easier to inspect with rows and columns switched.

You can also reorder columns explicitly:

orders.loc[:, ["product", "region", "quantity", "unit_price", "status"]]

That example uses loc, which is one of the most important pandas selection tools.

Selecting Data with loc

Use loc when you want to select by labels.

The general pattern is:

dataframe.loc[row_selection, column_selection]

Examples:

orders.loc[5001, "product"]

Selecting one row and multiple columns returns a Series:

orders.loc[5001, ["product", "quantity", "unit_price"]]

Selecting multiple rows and multiple columns returns a DataFrame:

orders.loc[[5001, 5003, 5005], ["region", "product", "status"]]

Label slicing with loc includes both the start and stop labels:

orders.loc[5001:5003, ["region", "product"]]

That is an important detail. Standard Python slicing excludes the stop position, but label-based slicing with pandas includes the final label.

For column-only selection, pandas offers a shorthand:

orders["product"]

That returns a Series.

Use double brackets when you want a DataFrame with one or more columns:

orders[["product", "status"]]

Selecting Data with iloc

Use iloc when you want to select by integer position.

The general pattern is:

dataframe.iloc[row_position, column_position]

Examples:

orders.iloc[0, 1]

That returns the value in the first row and second column.

Select specific rows and one column:

orders.iloc[[0, 2, 4], 1]

Select a position-based rectangular subset:

orders.iloc[:3, [0, 1, 3]]

With iloc, slicing behaves like normal Python slicing: the start is included, the stop is excluded.

Use loc when labels matter. Use iloc when positions matter.

Filtering with Boolean Indexing

Boolean indexing means you create a Series of True and False values, then use it to filter rows.

For example, select paid orders from the North region:

paid_north = (orders["status"] == "paid") & (orders["region"] == "North")
orders.loc[paid_north, :]

The parentheses are not optional when combining conditions. Without them, Python may evaluate the expression in a different order than you expect.

Boolean operators in pandas look different from normal Python operators:

Meaning	pandas operator
And	`&`
Or	`
Not	`~`

Select rows where the region is one of several allowed values:

orders.loc[orders["region"].isin(["North", "West"]), :]

Filter using the index:

orders.loc[orders.index > 5002, :]

You can also apply a boolean DataFrame to another DataFrame. Values that do not pass the condition become missing:

monthly_sales = pd.DataFrame(
    data=[
        [1200.0, 800.0, 1500.0],
        [950.0, 1100.0, 1700.0],
    ],
    columns=["North", "South", "West"],
    index=["January", "February"],
)

monthly_sales[monthly_sales >= 1000.0]

This is useful when you want to keep the table shape but hide values that fail a condition.

Setting Data Safely

Before changing a DataFrame, make a copy when you still need the original version:

clean_orders = orders.copy()

Set a single value by label:

clean_orders.loc[5003, "status"] = "paid"

Set multiple values:

clean_orders.loc[[5001, 5002], "quantity"] = [3, 6]

Set values with boolean indexing:

large_order = clean_orders["quantity"] >= 5
clean_orders.loc[large_order, "status"] = "priority"

Replace values across the whole table or selected columns:

standardized = clean_orders.replace("cancelled", "canceled")

Add a new column:

clean_orders["revenue"] = clean_orders["quantity"] * clean_orders["unit_price"]
clean_orders

This is where pandas starts to feel powerful. The multiplication is vectorized, so pandas calculates the result for the whole column without you writing a loop.

Handling Missing Data

Real data often contains blanks. pandas gives you direct tools for finding, dropping, and filling missing values.

Create a small, messy table:

messy_orders = clean_orders.copy()
messy_orders.loc[5002, "unit_price"] = None
messy_orders.loc[5005, ["region", "product", "status", "quantity", "unit_price", "revenue"]] = None

messy_orders

Drop rows that contain at least one missing value:

messy_orders.dropna()

Drop only rows where all values are missing:

messy_orders.dropna(how="all")

Check missing values:

messy_orders.isna()

Fill selected columns with meaningful defaults:

filled_orders = messy_orders.fillna(
    {
        "status": "unknown",
        "unit_price": messy_orders["unit_price"].mean(),
        "quantity": 0,
        "revenue": 0,
    }
)

filled_orders

A good missing-data strategy depends on the meaning of the column. Filling a missing status with "unknown" can be reasonable. Filling a missing price with the mean may be acceptable for a demo, but in a real business process, you should decide whether that value should be corrected upstream instead.

Handling Duplicate Data

Duplicated records can break summaries. pandas gives you tools to detect and remove them.

with_duplicates = pd.concat([orders, orders.loc[[5002]]])
with_duplicates

Check whether a column has unique values:

with_duplicates["product"].is_unique

Return unique values from a column:

with_duplicates["region"].unique()

Find duplicated rows:

with_duplicates.duplicated()

Mark all rows involved in a duplicate, not only the later repeated rows:

with_duplicates.duplicated(keep=False)

Drop duplicate rows:

deduplicated_orders = with_duplicates.drop_duplicates()
deduplicated_orders

The default behavior keeps the first occurrence and removes later duplicates. That is often fine, but not always. For production workflows, decide whether the first, last, or no duplicate record should be kept.

Arithmetic Operations and Data Alignment

pandas aligns data by labels before calculating. This is one of the most important differences compared with many manual spreadsheet workflows.

Consider two regional sales tables. They do not have the same months or the same columns:

plan = pd.DataFrame(
    data=[
        [1000.0, 1200.0],
        [1100.0, 1300.0],
    ],
    columns=["North", "South"],
    index=["January", "February"],
)

actual = pd.DataFrame(
    data=[
        [900.0, 1250.0],
        [1400.0, 700.0],
    ],
    columns=["South", "West"],
    index=["February", "March"],
)

plan + actual

pandas does not simply add the first column to the first column and the first row to the first row. It aligns by index and column labels. Where a label exists only on one side, the result is missing.

When you want missing labels to behave like zero during an arithmetic operation, use the method form with fill_value:

combined = plan.add(actual, fill_value=0)
combined

pandas supports arithmetic through normal operators and method names:

Operation	Operator	Method
Multiplication	`*`	`mul`
Addition	`+`	`add`
Subtraction	`-`	`sub`
Division	`/`	`div`
Power	`**`	`pow`

For business analysis, arithmetic is often done on columns:

margin_orders = clean_orders.copy()
margin_orders["gross_amount"] = margin_orders["quantity"] * margin_orders["unit_price"]
margin_orders["discount"] = 0.10
margin_orders["net_amount"] = margin_orders["gross_amount"] * (1 - margin_orders["discount"])

margin_orders[["region", "product", "gross_amount", "net_amount"]]

This approach is clearer than creating helper columns manually in a spreadsheet each time new data arrives.

Cleaning Text Columns

Text columns often contain extra spaces, inconsistent capitalization, or values that should follow a standard format.

pandas exposes string operations through the .str accessor:

customers = pd.DataFrame(
    data=[
        ["  anna  "],
        ["BOB"],
        ["  clara"],
        ["john  "],
    ],
    columns=["name"],
)

clean_names = customers["name"].str.strip().str.capitalize()
clean_names

You can also create boolean filters from text operations:

clean_names.str.startswith("J")

Text cleaning is a good place to chain small operations. Keep the chain readable and assign the cleaned result to a named variable or a new column:

customers["normalized_name"] = customers["name"].str.strip().str.capitalize()
customers

Combining DataFrames with concat

Use concat when you want to stack tables vertically or place them side by side.

Create another batch of orders with the same columns:

more_rows = [
    ["East", "Mouse", "paid", 2, 32.0],
    ["North", "Laptop Stand", "pending", 1, 80.0],
]

more_orders = pd.DataFrame(
    data=more_rows,
    columns=["region", "product", "status", "quantity", "unit_price"],
    index=[5006, 5007],
)

all_orders = pd.concat([orders, more_orders])
all_orders

Vertical concatenation is useful when you receive multiple files with the same structure.

You can also concatenate horizontally with axis=1:

risk_flags = pd.DataFrame(
    data=["low", "medium", "high", "low", "medium"],
    columns=["risk_level"],
    index=[5001, 5002, 5003, 5004, 5005],
)

orders_with_risk = pd.concat([orders, risk_flags], axis=1)
orders_with_risk

Horizontal concatenation depends on index alignment. If the indexes do not match, pandas will align where possible and fill the rest with missing values.

Combining DataFrames with join and merge

Use join when you want to combine tables horizontally by index.

shipment_status = pd.DataFrame(
    data=[
        ["packed"],
        ["shipped"],
        ["waiting"],
    ],
    columns=["shipment_state"],
    index=[5001, 5002, 5010],
)

orders.join(shipment_status, how="left")

Common join types:

Join type	Result
`inner`	Keep only labels that exist in both tables.
`left`	Keep all labels from the left table and match from the right table where possible.
`right`	Keep all labels from the right table and match from the left table where possible.
`outer`	Keep the union of labels from both tables.

Use merge when the matching key is stored in one or more columns instead of the index.

region_lookup = pd.DataFrame(
    data=[
        ["North", "Nordic Desk"],
        ["South", "Southern Desk"],
        ["West", "Western Desk"],
    ],
    columns=["region", "team"],
)

orders.reset_index().merge(region_lookup, how="left", on=["region"])

A left merge is similar to a lookup operation: keep every row from the left table, and bring in matching values from the right table.

Descriptive Statistics

Once the data is clean enough, the next step is usually summarization.

Basic descriptive statistics are available as methods:

clean_orders["unit_price"].mean()
clean_orders["quantity"].sum()
clean_orders["status"].count()

For a DataFrame, many statistics operate column by column by default:

numeric_summary = clean_orders[["quantity", "unit_price"]].mean()
numeric_summary

To calculate row-wise statistics, provide axis=1:

score_table = pd.DataFrame(
    data=[
        [80, 90, 75],
        [60, 70, 85],
    ],
    columns=["test_a", "test_b", "test_c"],
    index=["candidate_1", "candidate_2"],
)

score_table.mean(axis=1)

By default, missing values are ignored by many descriptive statistics such as mean and sum. That behavior is similar to how spreadsheet averages usually ignore blank cells.

Grouping Data

groupby lets you calculate statistics per category.

Average unit price by region:

clean_orders.groupby(["region"])["unit_price"].mean()

Total quantity by product:

clean_orders.groupby(["product"])["quantity"].sum()

You can also aggregate multiple numeric columns. The next example calculates the range within each region by subtracting the smallest value from the largest value:

numeric_selection = clean_orders[["region", "quantity", "unit_price"]]

numeric_selection.groupby(["region"]).agg(lambda values: values.max() - values.min())

Lambda expressions are small anonymous functions. They are useful when the aggregation is simple and used only once.

For larger codebases, prefer named functions when the logic needs a meaningful name or tests.

Pivot Tables

A pivot table summarizes values across two dimensions. In pandas, use pd.pivot_table.

Create a transaction-style table where each row is one sale:

sales_rows = [
    ["Keyboard", "North", 250.0],
    ["Keyboard", "South", 175.0],
    ["Mouse", "North", 70.0],
    ["Mouse", "West", 96.0],
    ["Monitor", "South", 620.0],
    ["Monitor", "North", 299.0],
]

sales = pd.DataFrame(
    data=sales_rows,
    columns=["product", "region", "revenue"],
)

sales_pivot = pd.pivot_table(
    sales,
    index="product",
    columns="region",
    values="revenue",
    aggfunc="sum",
    margins=True,
    margins_name="Total",
)

sales_pivot

The arguments mean:

Argument	Purpose
`index`	Values from this column become row labels.
`columns`	Values from this column become column labels.
`values`	Numeric values to aggregate.
`aggfunc`	Aggregation function, such as `sum` or `mean`.
`margins`	Adds total rows and columns.
`margins_name`	Name used for the total label.

Pivot tables are useful when you want to answer questions like:

Which product creates the most revenue per region?
Which region contributes most to each product?
Which product-region combinations are missing?
What are the totals across rows and columns?

Plotting DataFrames

Tables are good for exact values. Charts are better for patterns.

pandas can create plots directly from a DataFrame. By default, the plotting backend is Matplotlib.

quarterly_sales = pd.DataFrame(
    data=[
        [12000.0, 8000.0, 10000.0],
        [15000.0, 9500.0, 11000.0],
        [13000.0, 10500.0, 12500.0],
        [17000.0, 11500.0, 14000.0],
    ],
    index=["Q1", "Q2", "Q3", "Q4"],
    columns=["North", "South", "West"],
)

quarterly_sales.index.name = "quarter"
quarterly_sales.columns.name = "region"

quarterly_sales.plot()

A line plot is the default when you call plot().

You can create a bar chart:

quarterly_sales.plot.bar()

pandas supports several plot types:

Plot type	Method
Line chart	`df.plot()` or `df.plot.line()`
Vertical bar chart	`df.plot.bar()`
Horizontal bar chart	`df.plot.barh()`
Histogram	`df.plot.hist()`
Box plot	`df.plot.box()`
Area chart	`df.plot.area()`
Scatter plot	`df.plot.scatter()`
Pie chart	`df.plot.pie()`

For interactive charts in notebooks, pandas can use Plotly as the plotting backend:

pd.set_option("plotting.backend", "plotly")

quarterly_sales.plot()
quarterly_sales.plot.bar(barmode="group")

With an interactive backend, charts can support actions such as zooming, hovering over data points, and toggling series from the legend. When you switch plotting backends, check the plotting library documentation for backend-specific arguments.

Importing and Exporting Data

In real projects, you rarely type data by hand. pandas includes reader functions for loading data and writer methods for exporting data.

Common examples:

Format or system	Import function	Export method
CSV	`pd.read_csv()`	`df.to_csv()`
Excel	`pd.read_excel()`	`df.to_excel()`
JSON	`pd.read_json()`	`df.to_json()`
HTML	`pd.read_html()`	`df.to_html()`
Clipboard	`pd.read_clipboard()`	`df.to_clipboard()`
Parquet	`pd.read_parquet()`	`df.to_parquet()`

Export to CSV:

clean_orders.to_csv("clean_orders.csv")

Read a CSV file:

loaded_orders = pd.read_csv("clean_orders.csv")

If the CSV uses a separator other than a comma, provide sep:

semicolon_orders = pd.read_csv("orders_semicolon.csv", sep=";")

On Windows, use raw strings for paths so backslashes are handled literally:

clean_orders.to_csv(r"C:\reports\clean_orders.csv")

After loading a larger file, start with exploration:

loaded_orders.info()
loaded_orders.head()
loaded_orders.tail(3)
loaded_orders.describe()

This should become a habit. Before writing filters, joins, or summaries, confirm that the file loaded as expected.

A Practical End-to-End Workflow

Here is a compact workflow that combines the main ideas.

import pandas as pd

raw_orders = pd.DataFrame(
    data=[
        [" north ", "Keyboard", "paid", 2, 125.0],
        ["South", "Mouse", "paid", 5, 35.0],
        ["NORTH", "Monitor", None, 1, 299.0],
        ["West", "Keyboard", "paid", 3, 120.0],
        ["South", "Monitor", "cancelled", 1, None],
    ],
    columns=["region", "product", "status", "quantity", "unit_price"],
    index=[5001, 5002, 5003, 5004, 5005],
)

raw_orders.index.name = "order_id"

cleaned = raw_orders.copy()
cleaned["region"] = cleaned["region"].str.strip().str.capitalize()
cleaned["status"] = cleaned["status"].fillna("unknown")
cleaned["unit_price"] = cleaned["unit_price"].fillna(cleaned["unit_price"].mean())
cleaned["revenue"] = cleaned["quantity"] * cleaned["unit_price"]

paid_only = cleaned.loc[cleaned["status"].isin(["paid"]), :]

summary = pd.pivot_table(
    paid_only,
    index="product",
    columns="region",
    values="revenue",
    aggfunc="sum",
    margins=True,
    margins_name="Total",
)

summary.to_csv("paid_order_summary.csv")
summary

The workflow does the following:

Builds a DataFrame.
Names the index.
Copies the original data before changes.
Cleans inconsistent region values.
Fills missing status values.
Fills missing price values.
Adds a calculated revenue column.
Filters to paid orders.
Builds a pivot-style summary.
Exports the result to CSV.

The exact business rules will differ in real systems, but the pattern is reusable.

Common Mistakes to Watch For

Forgetting that many methods return copies

This does not change orders:

orders.rename(columns={"unit_price": "price_each"})

Assign the result:

orders_renamed = orders.rename(columns={"unit_price": "price_each"})

Mixing up loc and iloc

Use loc for labels:

orders.loc[5001, "product"]

Use iloc for positions:

orders.iloc[0, 1]

Missing parentheses in boolean filters

This is the safer pattern:

selected = (orders["region"] == "North") & (orders["status"] == "paid")
orders.loc[selected, :]

Each condition is wrapped in parentheses.

Assuming arithmetic is position-based

pandas aligns by labels. This is usually a feature, not a problem. It prevents accidental mismatches, but it can produce missing values when labels do not overlap.

Use method forms with fill_value when missing labels should be treated as zero:

plan.add(actual, fill_value=0)

Ignoring missing values after joins

Left, right, and outer joins can create missing values. Always inspect the result after combining tables:

joined = orders.join(shipment_status, how="left")
joined.isna()

Cleaning text too late

String inconsistencies can break grouping and merging. For example, "North", " north ", and "NORTH" may represent the same category but behave like different values until normalized.

Checklist for pandas Data Analysis

Use this checklist when starting a new pandas task:

Load the data into a DataFrame.
Run info() to check columns, missing counts, data types, and memory usage.
Run head() and tail() to confirm the file loaded correctly.
Rename confusing columns.
Decide what the index should represent.
Normalize text columns that will be used for filtering, grouping, or joining.
Check missing values with isna().
Decide whether to drop or fill missing values.
Check duplicates with duplicated() and important columns with is_unique.
Use loc for label-based selection.
Use iloc for position-based selection.
Use boolean indexing for filters.
Use concat to stack similar tables.
Use join when matching by index.
Use merge when matching by columns.
Use groupby for category-based statistics.
Use pivot_table for two-dimensional summaries.
Plot the result when visual comparison is easier than reading numbers.
Export the cleaned or summarized result.

When pandas May Not Be Enough

pandas is very capable, but it has practical limits.

The most important limitation is memory. A pandas DataFrame must fit into available RAM. If the data is too large, you may hit memory problems.

Another limitation is that many core operations are mostly single-threaded. Some functions can use multiple cores, but common operations such as grouping may not use all available CPU cores.

For larger workloads, two common alternatives are worth knowing:

Polars: a DataFrame library implemented in Rust. It focuses on speed, memory efficiency, parallel execution, and can work with data larger than memory.
DuckDB: an in-process analytical database implemented in C++. It can query pandas DataFrame objects and can handle larger-than-memory datasets.

For many everyday spreadsheet replacement tasks, pandas is still an excellent starting point. When data size or performance becomes a real bottleneck, consider whether one of these alternatives fits the workload better.

Conclusion

pandas turns tabular data work into readable, repeatable Python code. The main workflow is straightforward: load a DataFrame, inspect it, clean it, select the rows and columns you need, combine it with related data, calculate summaries, create charts, and export the result.

The most important concepts to practice are DataFrame, Series, index alignment, loc, iloc, boolean indexing, missing data handling, duplicates, concat, join, merge, groupby, and pivot_table.

Once these tools feel natural, many manual spreadsheet tasks become scripts that are easier to review, rerun, debug, and share.