import pandas as pdPandas is Python’s primary library for data analysis and manipulation.
Key features:
Create DataFrames from dictionaries, where keys become column names:
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'London', 'Tokyo', 'Paris'],
'Salary': [70000, 80000, 90000, 75000]
}
df = pd.DataFrame(data)
print(df)
print(f"\nShape: {df.shape}") # (rows, columns) Name Age City Salary
0 Alice 25 New York 70000
1 Bob 30 London 80000
2 Charlie 35 Tokyo 90000
3 Diana 28 Paris 75000
Shape: (4, 4)
Exercise 1 (with Gemini): Ask Gemini to “create a DataFrame with three columns: ‘product’, ‘price’, and ‘quantity’ with at least 3 products”
Exercise 2 (on your own): Type data = {'a': [1, 2, 3], 'b': [4, 5, 6]} then df = pd.DataFrame(data) then print(df) and run it.
A Series is a one-dimensional labeled array - essentially one column of data.
# Create a Series with custom index
products = pd.Series(
[29.99, 999.99, 79.99, 299.99],
index=['Mouse', 'Laptop', 'Keyboard', 'Monitor']
)
print(products)
print(f"\nType: {type(products)}")Mouse 29.99
Laptop 999.99
Keyboard 79.99
Monitor 299.99
dtype: float64
Type: <class 'pandas.core.series.Series'>
Key point: Each DataFrame column is a Series!
Series support both label-based and position-based indexing:
# Label-based indexing
print("By label - Laptop:", products['Laptop'])
# Position-based with .iloc (like lists)
print("By position - first item:", products.iloc[0])
print("Last item:", products.iloc[-1])
# Multiple elements
print("\nFirst 3 items:")
print(products.iloc[:3])By label - Laptop: 999.99
By position - first item: 29.99
Last item: 299.99
First 3 items:
Mouse 29.99
Laptop 999.99
Keyboard 79.99
dtype: float64
Series have built-in methods for common statistics:
Exercise 1 (with Gemini): Ask Gemini to “create a Series with 5 numbers and calculate the mean and standard deviation”
Exercise 2 (on your own): Type s = pd.Series([10, 20, 30, 40]) then print(s.mean()) and run it.
Let’s create a more realistic dataset with employee information:
# Create a larger DataFrame for analysis
employees_data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
'Age': [25, 30, 35, 28, 32, 45, 29, 38],
'Department': ['Engineering', 'Marketing', 'Engineering', 'HR',
'Marketing', 'Engineering', 'HR', 'Marketing'],
'Salary': [70000, 65000, 80000, 60000, 72000, 90000, 62000, 75000],
'Years': [2, 5, 8, 3, 6, 12, 4, 9],
'Education': [16, 14, 18, 16, 14, 20, 16, 16]
}
df = pd.DataFrame(employees_data)
df.head()| Name | Age | Department | Salary | Years | Education | |
|---|---|---|---|---|---|---|
| 0 | Alice | 25 | Engineering | 70000 | 2 | 16 |
| 1 | Bob | 30 | Marketing | 65000 | 5 | 14 |
| 2 | Charlie | 35 | Engineering | 80000 | 8 | 18 |
| 3 | Diana | 28 | HR | 60000 | 3 | 16 |
| 4 | Eve | 32 | Marketing | 72000 | 6 | 14 |
Essential attributes to understand your data:
print(f"Shape: {df.shape}")
print(f"Column names: {list(df.columns[:5])}...") # First 5
print(f"\nFirst 3 rows:")
print(df.head(3))Shape: (8, 6)
Column names: ['Name', 'Age', 'Department', 'Salary', 'Years']...
First 3 rows:
Name Age Department Salary Years Education
0 Alice 25 Engineering 70000 2 16
1 Bob 30 Marketing 65000 5 14
2 Charlie 35 Engineering 80000 8 18
Useful methods:
.head(n) - first n rows.tail(n) - last n rows.info() - comprehensive overview.describe() - statistical summarySelect one or more columns from a DataFrame:
# Single column (returns Series)
salaries = df['Salary']
print(f"Salaries - type: {type(salaries)}")
print(salaries.head())
# Multiple columns (returns DataFrame)
subset = df[['Name', 'Salary', 'Department']]
print(f"\nSubset shape: {subset.shape}")
print(subset.head(3))Salaries - type: <class 'pandas.core.series.Series'>
0 70000
1 65000
2 80000
3 60000
4 72000
Name: Salary, dtype: int64
Subset shape: (8, 3)
Name Salary Department
0 Alice 70000 Engineering
1 Bob 65000 Marketing
2 Charlie 80000 Engineering
Exercise 1 (with Gemini): Ask Gemini to “select the ‘Name’ and ‘Age’ columns from a DataFrame”
Exercise 2 (on your own): Type df['Name'] and run it to select a single column.
Use .loc for label-based or .iloc for position-based row selection:
# First row
print("First row:")
print(df.iloc[0])
# Multiple rows by position
print("\nFirst 3 rows:")
print(df.iloc[0:3])First row:
Name Alice
Age 25
Department Engineering
Salary 70000
Years 2
Education 16
Name: 0, dtype: object
First 3 rows:
Name Age Department Salary Years Education
0 Alice 25 Engineering 70000 2 16
1 Bob 30 Marketing 65000 5 14
2 Charlie 35 Engineering 80000 8 18
Filter rows based on conditions using boolean indexing:
# Filter for high earners
high_earners = df[df['Salary'] > 70000]
print(f"People earning >$70,000: {len(high_earners)}")
print(high_earners)
# Multiple conditions with & (and) or | (or)
eng_high = df[(df['Salary'] > 70000) & (df['Department'] == 'Engineering')]
print(f"\nEngineering high earners:")
print(eng_high[['Name', 'Salary', 'Department']])People earning >$70,000: 4
Name Age Department Salary Years Education
2 Charlie 35 Engineering 80000 8 18
4 Eve 32 Marketing 72000 6 14
5 Frank 45 Engineering 90000 12 20
7 Henry 38 Marketing 75000 9 16
Engineering high earners:
Name Salary Department
2 Charlie 80000 Engineering
5 Frank 90000 Engineering
Exercise 1 (with Gemini): Ask Gemini to “filter a DataFrame to show only rows where Age is greater than 30”
Exercise 2 (on your own): Type df[df['Salary'] > 65000] and run it to filter rows.
Create new columns or modify existing ones:
# Create new calculated column (10% bonus)
df['Bonus'] = df['Salary'] * 0.10
print(df[['Name', 'Salary', 'Bonus']].head())
# Create experience category
df['Experience'] = pd.cut(
df['Years'],
bins=[0, 5, 10, 20],
labels=['Junior', 'Mid-level', 'Senior']
)
print(f"\nExperience levels:")
print(df['Experience'].value_counts()) Name Salary Bonus
0 Alice 70000 7000.0
1 Bob 65000 6500.0
2 Charlie 80000 8000.0
3 Diana 60000 6000.0
4 Eve 72000 7200.0
Experience levels:
Experience
Junior 4
Mid-level 3
Senior 1
Name: count, dtype: int64
Exercise 1 (with Gemini): Ask Gemini to “add a new column to a DataFrame that calculates 15% tax on a Salary column”
Exercise 2 (on your own): Type df['Double_Salary'] = df['Salary'] * 2 then print(df[['Name', 'Salary', 'Double_Salary']].head()) and run it.
The index provides row labels (default is 0, 1, 2, …):
# Create small example
employees = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Salary': [70000, 80000, 90000],
'Dept': ['Eng', 'Mkt', 'Eng']
})
# Set Name as index
indexed = employees.set_index('Name')
print(indexed)
# Access by index label
print(f"\nBob's data:\n{indexed.loc['Bob']}") Salary Dept
Name
Alice 70000 Eng
Bob 80000 Mkt
Charlie 90000 Eng
Bob's data:
Salary 80000
Dept Mkt
Name: Bob, dtype: object
What we learned:
df['col'] or df[['col1', 'col2']].iloc[position] or .loc[label]df[df['col'] > value]df['new_col'] = valuesPandas is essential for data analysis in Python!