Part 9: More Pandas
Kerry Back, Rice University
Sorting Data
import pandas as pd
employees = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'department': ['Sales', 'Engineering', 'Sales', 'Marketing', 'Engineering'],
'salary': [50000, 80000, 55000, 60000, 85000]
})
# Sort by salary (descending)
print(employees.sort_values('salary', ascending=False))
name department salary
4 Eve Engineering 85000
1 Bob Engineering 80000
3 Diana Marketing 60000
2 Charlie Sales 55000
0 Alice Sales 50000
Practice: Sorting Data
Exercise 1 (with Gemini): Ask Gemini to “sort a DataFrame by the ‘name’ column in alphabetical order”
Exercise 2 (on your own): Type employees.sort_values('salary') and run it to sort by salary ascending.
Grouping and Aggregation
Group data by categories and compute statistics:
# Average salary by department
dept_avg = employees.groupby('department')['salary'].mean()
print(dept_avg)
department
Engineering 82500.0
Marketing 60000.0
Sales 52500.0
Name: salary, dtype: float64
Common aggregation functions: mean(), median(), min(), max(), std(), count()
Practice: Grouping Data
Exercise 1 (with Gemini): Ask Gemini to “group a DataFrame by department and calculate the maximum salary in each department”
Exercise 2 (on your own): Type employees.groupby('department')['salary'].count() and run it to count employees per department.
Pivot Tables
Reshape data to summarize by multiple dimensions:
import random
random.seed(42)
sales = pd.DataFrame({
'product': ['Widget A', 'Widget B', 'Widget C'] * 6,
'region': ['North'] * 9 + ['South'] * 9,
'revenue': [random.randint(100, 500) for _ in range(18)]
})
pivot = sales.pivot_table(values='revenue',
index='product',
columns='region',
aggfunc='sum')
print(pivot)
region North South
product
Widget A 1120 847
Widget B 568 706
Widget C 814 996
Handling Missing Data
import numpy as np
data = pd.DataFrame({
'product': ['Widget A', 'Widget B', None, 'Widget C'],
'price': [10.5, None, 15.0, 12.75],
'quantity': [100, 200, 150, None]
})
print("Missing values per column:")
print(data.isnull().sum())
Missing values per column:
product 1
price 1
quantity 1
dtype: int64
Filling Missing Values
# Fill missing values
filled = data.copy()
filled['product'] = filled['product'].fillna('Unknown')
filled['price'] = filled['price'].fillna(data['price'].mean())
filled['quantity'] = filled['quantity'].fillna(0)
print(filled)
product price quantity
0 Widget A 10.50 100.0
1 Widget B 12.75 200.0
2 Unknown 15.00 150.0
3 Widget C 12.75 0.0
Practice: Handling Missing Data
Exercise 1 (with Gemini): Ask Gemini to “check for missing values in a DataFrame and fill them with the column mean”
Exercise 2 (on your own): Type data.isnull().sum() and run it to count missing values per column.
Merging DataFrames
Combine data from multiple sources:
employees_info = pd.DataFrame({
'id': [101, 102, 103],
'name': ['Alice', 'Bob', 'Charlie']
})
salaries = pd.DataFrame({
'id': [101, 102, 104],
'salary': [50000, 80000, 60000]
})
# Inner merge - only matching records
inner = pd.merge(employees_info, salaries, on='id')
print("Inner merge:")
print(inner)
Inner merge:
id name salary
0 101 Alice 50000
1 102 Bob 80000
Types of Merges
- Inner: Only matching records from both DataFrames
- Left: All records from left, matching from right
- Right: All records from right, matching from left
- Outer: All records from both DataFrames
# Left merge - all employees
left = pd.merge(employees_info, salaries, on='id', how='left')
print("Left merge:")
print(left)
Left merge:
id name salary
0 101 Alice 50000.0
1 102 Bob 80000.0
2 103 Charlie NaN
Practice: Merging DataFrames
Exercise 1 (with Gemini): Ask Gemini to “merge two DataFrames on a common ‘id’ column using an inner join”
Exercise 2 (on your own): Type pd.merge(employees_info, salaries, on='id') and run it to perform an inner merge.
Concatenating DataFrames
Combine data from the same structure:
q1 = pd.DataFrame({
'product': ['Widget A', 'Widget B'],
'sales': [100, 150]
})
q2 = pd.DataFrame({
'product': ['Widget A', 'Widget B'],
'sales': [120, 140]
})
all_sales = pd.concat([q1, q2])
print(all_sales)
product sales
0 Widget A 100
1 Widget B 150
0 Widget A 120
1 Widget B 140
Reading and Writing Files
# Read Excel file
df = pd.read_excel('data.xlsx')
# Read CSV file
df = pd.read_csv('data.csv')
# Write to Excel
df.to_excel('output.xlsx', index=False)
# Write to CSV
df.to_csv('output.csv', index=False)
Common formats: Excel (.xlsx), CSV (.csv), JSON (.json)
String Processing
Clean and transform text data:
customer_data = pd.DataFrame({
'name': [' John Smith ', 'JANE DOE', 'bob jones'],
'email': ['john@email.com', 'JANE@GMAIL.COM', 'bob@yahoo.com']
})
# Clean names
customer_data['name_clean'] = customer_data['name'].str.strip().str.title()
# Lowercase emails
customer_data['email_clean'] = customer_data['email'].str.lower()
print(customer_data[['name', 'name_clean', 'email_clean']])
name name_clean email_clean
0 John Smith John Smith john@email.com
1 JANE DOE Jane Doe jane@gmail.com
2 bob jones Bob Jones bob@yahoo.com
Common String Methods
Useful .str methods:
.str.lower(), .str.upper() - Change case
.str.strip() - Remove whitespace
.str.contains() - Check for substring
.str.split() - Split strings
.str.replace() - Replace text
# Extract email domain
customer_data['domain'] = customer_data['email_clean'].str.split('@').str[1]
print(customer_data[['email_clean', 'domain']])
email_clean domain
0 john@email.com email.com
1 jane@gmail.com gmail.com
2 bob@yahoo.com yahoo.com
Practice: String Processing
Exercise 1 (with Gemini): Ask Gemini to “convert a column of names to lowercase and remove leading/trailing spaces”
Exercise 2 (on your own): Type customer_data['name'].str.upper() and run it to convert names to uppercase.
Summary
Key Operations:
- Sort:
sort_values()
- Group:
groupby() with aggregation functions
- Pivot:
pivot_table() for multi-dimensional summaries
- Missing Data:
isnull(), fillna(), dropna()
- Merge:
pd.merge() with different join types
- Concatenate:
pd.concat() for stacking data
- String Operations:
.str accessor for text processing
These tools form the foundation of data cleaning and analysis!