A pairplot creates a matrix of scatter plots showing relationships between all numeric variables.
```{python}
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
url = 'https://faculty.utrgv.edu/diego.escobari/teaching/Datasets/WAGE1.xls'
wages = pd.read_excel(url, header=None)
columns = ['wage', 'educ', 'exper', 'tenure', 'nonwhite', 'female',
'married', 'numdep', 'smsa', 'northcen', 'south', 'west',
'construc', 'ndurman', 'trcommpu', 'trade', 'services',
'profserv', 'profocc', 'clerocc', 'servocc', 'lwage',
'expersq', 'tenursq']
wages.columns = columns
# Pairplot with corner=True shows only lower triangle
sns.pairplot(wages[['wage', 'educ', 'exper', 'tenure']], corner=True)
plt.show()
```Goal: Visually explore what determines wages
Terminology: - Target variable: wage (numeric) - Explanatory variables: education, experience, gender, etc. (numeric or categorical)
Scatter plot: Shows relationship and correlation
```{python}
plt.figure(figsize=(8, 5))
sns.scatterplot(data=wages, x='educ', y='wage', alpha=0.5)
plt.xlabel('Years of Education')
plt.ylabel('Hourly Wage ($)')
plt.title('Wage vs Education')
plt.show()
```Regression plot: Adds trend line to see direction and strength
Exercise 1 (with Gemini): Ask Gemini to “create a regression plot showing the relationship between two numeric variables”
Exercise 2 (on your own): Type sns.regplot(x=[1, 2, 3, 4], y=[2, 4, 3, 5]) then plt.show() and run it.
Box plot: Shows distribution of wages within each category (preferred)
```{python}
# Create categorical variable
wages['Gender'] = wages['female'].map({0: 'Male', 1: 'Female'})
plt.figure(figsize=(8, 5))
sns.boxplot(data=wages, x='Gender', y='wage')
plt.ylabel('Hourly Wage ($)')
plt.title('Wage Distribution by Gender')
plt.show()
```Alternatives: Strip plot, violin plot, or bar plot with error bars
Exercise 1 (with Gemini): Ask Gemini to “create a box plot comparing test scores across different classes”
Exercise 2 (on your own): Type sns.boxplot(x=['A', 'A', 'B', 'B'], y=[10, 12, 15, 17]) then plt.show() and run it.
Grouped box plots: Use hue for a second categorical variable
```{python}
wages['Marital_Status'] = wages['married'].map({0: 'Not Married', 1: 'Married'})
plt.figure(figsize=(10, 6))
sns.boxplot(data=wages, x='Gender', y='wage', hue='Marital_Status')
plt.ylabel('Hourly Wage ($)')
plt.title('Wage by Gender and Marital Status')
plt.show()
```Heatmap: Matrix showing mean wage for each combination
Exercise 1 (with Gemini): Ask Gemini to “create a heatmap showing average values for different category combinations”
Exercise 2 (on your own): Type sns.heatmap([[1, 2], [3, 4]], annot=True) then plt.show() and run it.
Scatter plot with hue: Distinguish categories by color
```{python}
plt.figure(figsize=(10, 6))
sns.scatterplot(data=wages, x='educ', y='wage', hue='Gender', alpha=0.6)
plt.xlabel('Years of Education')
plt.ylabel('Hourly Wage ($)')
plt.title('Wage vs Education by Gender')
plt.show()
```Alternative: Faceted scatter plots for clearer visualization with multiple categorical variables
Recommended Seaborn Functions: - sns.boxplot() - Best for categorical variables - sns.scatterplot() - Best for numeric variables - sns.regplot() - Adds trend line to scatter plots - sns.heatmap() - Good for two categorical variables - Use hue, size, style parameters to add additional variables
Make it interpretable: - Always label axes clearly (“Annual Wage ($)”, not just “wage”) - Use meaningful titles - Order categories logically (Low → High, not alphabetical) - Consider log scale if values span wide ranges
Handle overlapping points: - Use transparency (alpha parameter) - Bin continuous variables (x_bins in regplot)
Look for patterns: - Non-linear relationships (curved scatter plots) - Outliers (unusual wage values) - Interactions (effect of one variable depends on another)
Choose the right plot: - Start with box plots for categorical variables (most informative) - Use scatter plots for numeric relationships - Add regression lines to show trends - Add hue/size/style to show effects of additional variables