Part 11: More Visualization

Kerry Back, Rice University

Overview

  • Pairplots for exploring relationships
  • Visualizing effects of explanatory variables on a target variable
  • Choosing the right plot type
  • Best practices for interpretable visualizations

Seaborn Pairplot

A pairplot creates a matrix of scatter plots showing relationships between all numeric variables.

```{python}
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

url = 'https://faculty.utrgv.edu/diego.escobari/teaching/Datasets/WAGE1.xls'
wages = pd.read_excel(url, header=None)
columns = ['wage', 'educ', 'exper', 'tenure', 'nonwhite', 'female',
           'married', 'numdep', 'smsa', 'northcen', 'south', 'west',
           'construc', 'ndurman', 'trcommpu', 'trade', 'services',
           'profserv', 'profocc', 'clerocc', 'servocc', 'lwage',
           'expersq', 'tenursq']
wages.columns = columns

# Pairplot with corner=True shows only lower triangle
sns.pairplot(wages[['wage', 'educ', 'exper', 'tenure']], corner=True)
plt.show()
```

Understanding Relationships

Goal: Visually explore what determines wages

  • Which factors most strongly influence wages?
  • How do different variables interact to affect wages?
  • Are there patterns or outliers to investigate?

Terminology: - Target variable: wage (numeric) - Explanatory variables: education, experience, gender, etc. (numeric or categorical)

Single Numeric Explanatory Variable

Scatter plot: Shows relationship and correlation

```{python}
plt.figure(figsize=(8, 5))
sns.scatterplot(data=wages, x='educ', y='wage', alpha=0.5)
plt.xlabel('Years of Education')
plt.ylabel('Hourly Wage ($)')
plt.title('Wage vs Education')
plt.show()
```

Regression plot: Adds trend line to see direction and strength

```{python}
plt.figure(figsize=(8, 5))
sns.regplot(data=wages, x='educ', y='wage', scatter_kws={'alpha':0.5})
plt.xlabel('Years of Education')
plt.ylabel('Hourly Wage ($)')
plt.title('Wage vs Education with Trend Line')
plt.show()
```

Practice: Regression Plots

Exercise 1 (with Gemini): Ask Gemini to “create a regression plot showing the relationship between two numeric variables”

Exercise 2 (on your own): Type sns.regplot(x=[1, 2, 3, 4], y=[2, 4, 3, 5]) then plt.show() and run it.

Single Categorical Explanatory Variable

Box plot: Shows distribution of wages within each category (preferred)

```{python}
# Create categorical variable
wages['Gender'] = wages['female'].map({0: 'Male', 1: 'Female'})

plt.figure(figsize=(8, 5))
sns.boxplot(data=wages, x='Gender', y='wage')
plt.ylabel('Hourly Wage ($)')
plt.title('Wage Distribution by Gender')
plt.show()
```

Alternatives: Strip plot, violin plot, or bar plot with error bars

Practice: Box Plots

Exercise 1 (with Gemini): Ask Gemini to “create a box plot comparing test scores across different classes”

Exercise 2 (on your own): Type sns.boxplot(x=['A', 'A', 'B', 'B'], y=[10, 12, 15, 17]) then plt.show() and run it.

Multiple Categorical Variables

Grouped box plots: Use hue for a second categorical variable

```{python}
wages['Marital_Status'] = wages['married'].map({0: 'Not Married', 1: 'Married'})

plt.figure(figsize=(10, 6))
sns.boxplot(data=wages, x='Gender', y='wage', hue='Marital_Status')
plt.ylabel('Hourly Wage ($)')
plt.title('Wage by Gender and Marital Status')
plt.show()
```

Heatmap: Matrix showing mean wage for each combination

```{python}
pivot_data = wages.pivot_table(values='wage',
                               index='Gender',
                               columns='Marital_Status')
plt.figure(figsize=(8, 5))
sns.heatmap(pivot_data, annot=True, fmt='.2f', cmap='YlOrRd')
plt.title('Mean Wage by Gender and Marital Status')
plt.show()
```

Practice: Heatmaps

Exercise 1 (with Gemini): Ask Gemini to “create a heatmap showing average values for different category combinations”

Exercise 2 (on your own): Type sns.heatmap([[1, 2], [3, 4]], annot=True) then plt.show() and run it.

Numeric + Categorical Variables

Scatter plot with hue: Distinguish categories by color

```{python}
plt.figure(figsize=(10, 6))
sns.scatterplot(data=wages, x='educ', y='wage', hue='Gender', alpha=0.6)
plt.xlabel('Years of Education')
plt.ylabel('Hourly Wage ($)')
plt.title('Wage vs Education by Gender')
plt.show()
```

Alternative: Faceted scatter plots for clearer visualization with multiple categorical variables

Key Recommendations

Recommended Seaborn Functions: - sns.boxplot() - Best for categorical variables - sns.scatterplot() - Best for numeric variables - sns.regplot() - Adds trend line to scatter plots - sns.heatmap() - Good for two categorical variables - Use hue, size, style parameters to add additional variables

Make it interpretable: - Always label axes clearly (“Annual Wage ($)”, not just “wage”) - Use meaningful titles - Order categories logically (Low → High, not alphabetical) - Consider log scale if values span wide ranges

Best Practices

Handle overlapping points: - Use transparency (alpha parameter) - Bin continuous variables (x_bins in regplot)

Look for patterns: - Non-linear relationships (curved scatter plots) - Outliers (unusual wage values) - Interactions (effect of one variable depends on another)

Choose the right plot: - Start with box plots for categorical variables (most informative) - Use scatter plots for numeric relationships - Add regression lines to show trends - Add hue/size/style to show effects of additional variables