Part 9: More Pandas

Kerry Back, Rice University

Sorting Data

import pandas as pd

employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'department': ['Sales', 'Engineering', 'Sales', 'Marketing', 'Engineering'],
    'salary': [50000, 80000, 55000, 60000, 85000]
})

# Sort by salary (descending)
print(employees.sort_values('salary', ascending=False))
      name   department  salary
4      Eve  Engineering   85000
1      Bob  Engineering   80000
3    Diana    Marketing   60000
2  Charlie        Sales   55000
0    Alice        Sales   50000

Practice: Sorting Data

Exercise 1 (with Gemini): Ask Gemini to “sort a DataFrame by the ‘name’ column in alphabetical order”

Exercise 2 (on your own): Type employees.sort_values('salary') and run it to sort by salary ascending.

Grouping and Aggregation

Group data by categories and compute statistics:

# Average salary by department
dept_avg = employees.groupby('department')['salary'].mean()
print(dept_avg)
department
Engineering    82500.0
Marketing      60000.0
Sales          52500.0
Name: salary, dtype: float64

Common aggregation functions: mean(), median(), min(), max(), std(), count()

Practice: Grouping Data

Exercise 1 (with Gemini): Ask Gemini to “group a DataFrame by department and calculate the maximum salary in each department”

Exercise 2 (on your own): Type employees.groupby('department')['salary'].count() and run it to count employees per department.

Pivot Tables

Reshape data to summarize by multiple dimensions:

import random
random.seed(42)

sales = pd.DataFrame({
    'product': ['Widget A', 'Widget B', 'Widget C'] * 6,
    'region': ['North'] * 9 + ['South'] * 9,
    'revenue': [random.randint(100, 500) for _ in range(18)]
})

pivot = sales.pivot_table(values='revenue',
                          index='product',
                          columns='region',
                          aggfunc='sum')
print(pivot)
region    North  South
product               
Widget A   1120    847
Widget B    568    706
Widget C    814    996

Handling Missing Data

import numpy as np

data = pd.DataFrame({
    'product': ['Widget A', 'Widget B', None, 'Widget C'],
    'price': [10.5, None, 15.0, 12.75],
    'quantity': [100, 200, 150, None]
})

print("Missing values per column:")
print(data.isnull().sum())
Missing values per column:
product     1
price       1
quantity    1
dtype: int64

Filling Missing Values

# Fill missing values
filled = data.copy()
filled['product'] = filled['product'].fillna('Unknown')
filled['price'] = filled['price'].fillna(data['price'].mean())
filled['quantity'] = filled['quantity'].fillna(0)

print(filled)
    product  price  quantity
0  Widget A  10.50     100.0
1  Widget B  12.75     200.0
2   Unknown  15.00     150.0
3  Widget C  12.75       0.0

Practice: Handling Missing Data

Exercise 1 (with Gemini): Ask Gemini to “check for missing values in a DataFrame and fill them with the column mean”

Exercise 2 (on your own): Type data.isnull().sum() and run it to count missing values per column.

Merging DataFrames

Combine data from multiple sources:

employees_info = pd.DataFrame({
    'id': [101, 102, 103],
    'name': ['Alice', 'Bob', 'Charlie']
})

salaries = pd.DataFrame({
    'id': [101, 102, 104],
    'salary': [50000, 80000, 60000]
})

# Inner merge - only matching records
inner = pd.merge(employees_info, salaries, on='id')
print("Inner merge:")
print(inner)
Inner merge:
    id   name  salary
0  101  Alice   50000
1  102    Bob   80000

Types of Merges

  • Inner: Only matching records from both DataFrames
  • Left: All records from left, matching from right
  • Right: All records from right, matching from left
  • Outer: All records from both DataFrames
# Left merge - all employees
left = pd.merge(employees_info, salaries, on='id', how='left')
print("Left merge:")
print(left)
Left merge:
    id     name   salary
0  101    Alice  50000.0
1  102      Bob  80000.0
2  103  Charlie      NaN

Practice: Merging DataFrames

Exercise 1 (with Gemini): Ask Gemini to “merge two DataFrames on a common ‘id’ column using an inner join”

Exercise 2 (on your own): Type pd.merge(employees_info, salaries, on='id') and run it to perform an inner merge.

Concatenating DataFrames

Combine data from the same structure:

q1 = pd.DataFrame({
    'product': ['Widget A', 'Widget B'],
    'sales': [100, 150]
})

q2 = pd.DataFrame({
    'product': ['Widget A', 'Widget B'],
    'sales': [120, 140]
})

all_sales = pd.concat([q1, q2])
print(all_sales)
    product  sales
0  Widget A    100
1  Widget B    150
0  Widget A    120
1  Widget B    140

Reading and Writing Files

# Read Excel file
df = pd.read_excel('data.xlsx')

# Read CSV file
df = pd.read_csv('data.csv')

# Write to Excel
df.to_excel('output.xlsx', index=False)

# Write to CSV
df.to_csv('output.csv', index=False)

Common formats: Excel (.xlsx), CSV (.csv), JSON (.json)

String Processing

Clean and transform text data:

customer_data = pd.DataFrame({
    'name': ['  John Smith  ', 'JANE DOE', 'bob jones'],
    'email': ['john@email.com', 'JANE@GMAIL.COM', 'bob@yahoo.com']
})

# Clean names
customer_data['name_clean'] = customer_data['name'].str.strip().str.title()

# Lowercase emails
customer_data['email_clean'] = customer_data['email'].str.lower()

print(customer_data[['name', 'name_clean', 'email_clean']])
             name  name_clean     email_clean
0    John Smith    John Smith  john@email.com
1        JANE DOE    Jane Doe  jane@gmail.com
2       bob jones   Bob Jones   bob@yahoo.com

Common String Methods

Useful .str methods:

  • .str.lower(), .str.upper() - Change case
  • .str.strip() - Remove whitespace
  • .str.contains() - Check for substring
  • .str.split() - Split strings
  • .str.replace() - Replace text
# Extract email domain
customer_data['domain'] = customer_data['email_clean'].str.split('@').str[1]
print(customer_data[['email_clean', 'domain']])
      email_clean     domain
0  john@email.com  email.com
1  jane@gmail.com  gmail.com
2   bob@yahoo.com  yahoo.com

Practice: String Processing

Exercise 1 (with Gemini): Ask Gemini to “convert a column of names to lowercase and remove leading/trailing spaces”

Exercise 2 (on your own): Type customer_data['name'].str.upper() and run it to convert names to uppercase.

Summary

Key Operations:

  • Sort: sort_values()
  • Group: groupby() with aggregation functions
  • Pivot: pivot_table() for multi-dimensional summaries
  • Missing Data: isnull(), fillna(), dropna()
  • Merge: pd.merge() with different join types
  • Concatenate: pd.concat() for stacking data
  • String Operations: .str accessor for text processing

These tools form the foundation of data cleaning and analysis!