Part 8: Intro to Pandas

Kerry Back, Rice University

What is Pandas?

Pandas is Python’s primary library for data analysis and manipulation.

Key features:

  • Works with tabular data (like Excel spreadsheets)
  • Two main structures: Series (1D) and DataFrame (2D)
  • Powerful tools for filtering, grouping, and transforming data
  • Integrates seamlessly with other data science libraries
import pandas as pd

Creating a DataFrame

Create DataFrames from dictionaries, where keys become column names:

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Tokyo', 'Paris'],
    'Salary': [70000, 80000, 90000, 75000]
}

df = pd.DataFrame(data)
print(df)
print(f"\nShape: {df.shape}")  # (rows, columns)
      Name  Age      City  Salary
0    Alice   25  New York   70000
1      Bob   30    London   80000
2  Charlie   35     Tokyo   90000
3    Diana   28     Paris   75000

Shape: (4, 4)

Practice: Creating DataFrames

Exercise 1 (with Gemini): Ask Gemini to “create a DataFrame with three columns: ‘product’, ‘price’, and ‘quantity’ with at least 3 products”

Exercise 2 (on your own): Type data = {'a': [1, 2, 3], 'b': [4, 5, 6]} then df = pd.DataFrame(data) then print(df) and run it.

Understanding Series

A Series is a one-dimensional labeled array - essentially one column of data.

# Create a Series with custom index
products = pd.Series(
    [29.99, 999.99, 79.99, 299.99],
    index=['Mouse', 'Laptop', 'Keyboard', 'Monitor']
)
print(products)
print(f"\nType: {type(products)}")
Mouse        29.99
Laptop      999.99
Keyboard     79.99
Monitor     299.99
dtype: float64

Type: <class 'pandas.core.series.Series'>

Key point: Each DataFrame column is a Series!

Accessing Series Elements

Series support both label-based and position-based indexing:

# Label-based indexing
print("By label - Laptop:", products['Laptop'])

# Position-based with .iloc (like lists)
print("By position - first item:", products.iloc[0])
print("Last item:", products.iloc[-1])

# Multiple elements
print("\nFirst 3 items:")
print(products.iloc[:3])
By label - Laptop: 999.99
By position - first item: 29.99
Last item: 299.99

First 3 items:
Mouse        29.99
Laptop      999.99
Keyboard     79.99
dtype: float64

Series Summary Statistics

Series have built-in methods for common statistics:

print(f"Average price: ${products.mean():.2f}")
print(f"Most expensive: ${products.max()}")
print(f"Cheapest: ${products.min()}")
print(f"Standard deviation: ${products.std():.2f}")
Average price: $352.49
Most expensive: $999.99
Cheapest: $29.99
Standard deviation: $447.32

Practice: Series Operations

Exercise 1 (with Gemini): Ask Gemini to “create a Series with 5 numbers and calculate the mean and standard deviation”

Exercise 2 (on your own): Type s = pd.Series([10, 20, 30, 40]) then print(s.mean()) and run it.

Creating a Larger Dataset

Let’s create a more realistic dataset with employee information:

# Create a larger DataFrame for analysis
employees_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'Age': [25, 30, 35, 28, 32, 45, 29, 38],
    'Department': ['Engineering', 'Marketing', 'Engineering', 'HR',
                   'Marketing', 'Engineering', 'HR', 'Marketing'],
    'Salary': [70000, 65000, 80000, 60000, 72000, 90000, 62000, 75000],
    'Years': [2, 5, 8, 3, 6, 12, 4, 9],
    'Education': [16, 14, 18, 16, 14, 20, 16, 16]
}

df = pd.DataFrame(employees_data)
df.head()
Name Age Department Salary Years Education
0 Alice 25 Engineering 70000 2 16
1 Bob 30 Marketing 65000 5 14
2 Charlie 35 Engineering 80000 8 18
3 Diana 28 HR 60000 3 16
4 Eve 32 Marketing 72000 6 14

DataFrame Attributes

Essential attributes to understand your data:

print(f"Shape: {df.shape}")
print(f"Column names: {list(df.columns[:5])}...")  # First 5
print(f"\nFirst 3 rows:")
print(df.head(3))
Shape: (8, 6)
Column names: ['Name', 'Age', 'Department', 'Salary', 'Years']...

First 3 rows:
      Name  Age   Department  Salary  Years  Education
0    Alice   25  Engineering   70000      2         16
1      Bob   30    Marketing   65000      5         14
2  Charlie   35  Engineering   80000      8         18

Useful methods:

  • .head(n) - first n rows
  • .tail(n) - last n rows
  • .info() - comprehensive overview
  • .describe() - statistical summary

Selecting Columns

Select one or more columns from a DataFrame:

# Single column (returns Series)
salaries = df['Salary']
print(f"Salaries - type: {type(salaries)}")
print(salaries.head())

# Multiple columns (returns DataFrame)
subset = df[['Name', 'Salary', 'Department']]
print(f"\nSubset shape: {subset.shape}")
print(subset.head(3))
Salaries - type: <class 'pandas.core.series.Series'>
0    70000
1    65000
2    80000
3    60000
4    72000
Name: Salary, dtype: int64

Subset shape: (8, 3)
      Name  Salary   Department
0    Alice   70000  Engineering
1      Bob   65000    Marketing
2  Charlie   80000  Engineering

Practice: Selecting Columns

Exercise 1 (with Gemini): Ask Gemini to “select the ‘Name’ and ‘Age’ columns from a DataFrame”

Exercise 2 (on your own): Type df['Name'] and run it to select a single column.

Selecting Rows

Use .loc for label-based or .iloc for position-based row selection:

# First row
print("First row:")
print(df.iloc[0])

# Multiple rows by position
print("\nFirst 3 rows:")
print(df.iloc[0:3])
First row:
Name                Alice
Age                    25
Department    Engineering
Salary              70000
Years                   2
Education              16
Name: 0, dtype: object

First 3 rows:
      Name  Age   Department  Salary  Years  Education
0    Alice   25  Engineering   70000      2         16
1      Bob   30    Marketing   65000      5         14
2  Charlie   35  Engineering   80000      8         18

Filtering Data

Filter rows based on conditions using boolean indexing:

# Filter for high earners
high_earners = df[df['Salary'] > 70000]
print(f"People earning >$70,000: {len(high_earners)}")
print(high_earners)

# Multiple conditions with & (and) or | (or)
eng_high = df[(df['Salary'] > 70000) & (df['Department'] == 'Engineering')]
print(f"\nEngineering high earners:")
print(eng_high[['Name', 'Salary', 'Department']])
People earning >$70,000: 4
      Name  Age   Department  Salary  Years  Education
2  Charlie   35  Engineering   80000      8         18
4      Eve   32    Marketing   72000      6         14
5    Frank   45  Engineering   90000     12         20
7    Henry   38    Marketing   75000      9         16

Engineering high earners:
      Name  Salary   Department
2  Charlie   80000  Engineering
5    Frank   90000  Engineering

Practice: Filtering Data

Exercise 1 (with Gemini): Ask Gemini to “filter a DataFrame to show only rows where Age is greater than 30”

Exercise 2 (on your own): Type df[df['Salary'] > 65000] and run it to filter rows.

Adding and Modifying Columns

Create new columns or modify existing ones:

# Create new calculated column (10% bonus)
df['Bonus'] = df['Salary'] * 0.10

print(df[['Name', 'Salary', 'Bonus']].head())

# Create experience category
df['Experience'] = pd.cut(
    df['Years'],
    bins=[0, 5, 10, 20],
    labels=['Junior', 'Mid-level', 'Senior']
)

print(f"\nExperience levels:")
print(df['Experience'].value_counts())
      Name  Salary   Bonus
0    Alice   70000  7000.0
1      Bob   65000  6500.0
2  Charlie   80000  8000.0
3    Diana   60000  6000.0
4      Eve   72000  7200.0

Experience levels:
Experience
Junior       4
Mid-level    3
Senior       1
Name: count, dtype: int64

Practice: Adding Columns

Exercise 1 (with Gemini): Ask Gemini to “add a new column to a DataFrame that calculates 15% tax on a Salary column”

Exercise 2 (on your own): Type df['Double_Salary'] = df['Salary'] * 2 then print(df[['Name', 'Salary', 'Double_Salary']].head()) and run it.

Working with Indexes

The index provides row labels (default is 0, 1, 2, …):

# Create small example
employees = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Salary': [70000, 80000, 90000],
    'Dept': ['Eng', 'Mkt', 'Eng']
})

# Set Name as index
indexed = employees.set_index('Name')
print(indexed)

# Access by index label
print(f"\nBob's data:\n{indexed.loc['Bob']}")
         Salary Dept
Name                
Alice     70000  Eng
Bob       80000  Mkt
Charlie   90000  Eng

Bob's data:
Salary    80000
Dept        Mkt
Name: Bob, dtype: object

Summary

What we learned:

  • DataFrames are 2D tables, Series are 1D columns
  • Create DataFrames from dictionaries or read from files
  • Select columns with df['col'] or df[['col1', 'col2']]
  • Select rows with .iloc[position] or .loc[label]
  • Filter using boolean conditions: df[df['col'] > value]
  • Add columns by assignment: df['new_col'] = values

Pandas is essential for data analysis in Python!