2025-08-22

Day 6: Advanced EDA & Patterns

  • Today, we’ll dive deeper into Exploratory Data Analysis (EDA).
  • We’ll focus on:
    • Identifying patterns, relationships, and anomalies.
    • Building more complex visualizations using ggplot2.

Recap: What is EDA?

  • EDA is the process of analyzing data sets to summarize their main characteristics.
  • It often involves visual methods.
  • EDA helps us understand the data before formal modeling.
  • Today, we’ll go beyond basic plots to uncover deeper insights.

Beyond Basic Scatter Plots: Faceting

  • Faceting allows us to split a single plot into multiple subplots.
  • This is based on the levels of one or more categorical variables.
  • It’s incredibly powerful for comparing distributions or relationships across different groups.
  • What if we want to see this relationship by cylinder count?
# Basic scatter plot
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() +
  labs(title = "HP vs MPG (Overall)", x = "Horsepower", y = "Miles Per Gallon")

Example: Faceting by Categorical Variable

  • We can easily create subplots using facet_wrap() or facet_grid().
  • facet_wrap() is ideal for a single categorical variable.
  • facet_grid() is used for two categorical variables.
# Faceting by cylinder count
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  facet_wrap(~ cyl_factor) +
  labs(title = "HP vs MPG by Cylinder Count", x = "Horsepower", y = "Miles Per Gallon")

Exploring Relationships with geom_smooth()

  • geom_smooth() adds a smoothed conditional mean to a plot.
  • It’s useful for visualizing trends in data.
# Scatter plot with a smooth line
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + # Add linear regression line, no standard error
  labs(title = "HP vs MPG with Linear Trend", x = "Horsepower", y = "Miles Per Gallon")

Example: Adding Regression Lines to Faceted Plots

  • Combining geom_smooth() with faceting shows how trends vary across different groups.
  • This provides a richer understanding of conditional relationships.
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() +   facet_wrap(~ cyl_factor) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "HP vs MPG by Cylinder Count with Linear Trends", x = "Horsepower", y = "Miles Per Gallon")

Correlation Matrices

  • A correlation matrix is a table showing correlation coefficients between variables.
  • Each cell shows the correlation between two variables.
  • It’s a powerful way to quickly identify linear relationships.
# Calculate correlation matrix for numerical variables
cor_matrix <- cor(mtcars[, c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")])
print(round(cor_matrix, 2))
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

Example: Visualizing via Heatmap

  • A heatmap provides an intuitive visual representation of the correlation matrix.
  • It makes strong positive and negative correlations immediately apparent.
# Convert correlation matrix to long format for ggplot2
melted_cor <- melt(cor_matrix)
ggplot(melted_cor, aes(x = Var1, y = Var2, fill = value)) + geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), name = "Correlation") +
  theme_minimal() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(title = "Correlation Matrix of mtcars variables")

Detecting Anomalies: What are They?

  • Anomalies (or outliers) are data points that significantly deviate from the majority of the data.
  • Identifying them is crucial because they can indicate:
    • Errors in data collection.
    • Rare events.
    • Important insights.

Density (or histogram) Plots

  • Density plots (or kernel density estimates) show the distribution of a numerical variable.
  • Anomalies might appear as:
    • Small, isolated peaks.
    • Points in very low-density regions.
ggplot(mtcars, aes(x = hp)) +
  geom_density(fill = "lightblue", alpha = 0.7,adjust = 0.3) +
  labs(title = "Density Plot of Horsepower (mtcars)", x = "Horsepower", y = "Density")

Statistical Methods for Anomaly Detection: IQR Method

  • The Interquartile Range (IQR) method defines outliers as points outside a certain range.
  • Outliers are typically below \(Q1 - 1.5 \times IQR\) or above \(Q3 + 1.5 \times IQR\).
  • This method is robust to skewed distributions.
# Calculate Q1, Q3, and IQR for 'hp'
Q1 <- quantile(mtcars$hp, 0.25)
Q3 <- quantile(mtcars$hp, 0.75)
IQR_val <- Q3 - Q1

# Define outlier bounds
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val

# Identify outliers based on IQR method
mtcars %>% filter(hp < lower_bound | hp > upper_bound) %>% select(model, hp)
##           model  hp
## 1 Maserati Bora 335

Using Box Plots for Anomaly Detection

  • Box plots are excellent for visualizing the distribution of numerical data.
  • The box edges represent Q1 and Q3 and the whiskers represent the 1.5 range.
  • Outliers are typically plotted as individual points beyond the ‘whiskers’ of the box plot.
# Using 'hp' from mtcars for anomaly detection
ggplot(mtcars, aes(y = hp)) +
  geom_boxplot() +
  labs(title = "Box Plot of Horsepower (mtcars)", y = "Horsepower")

Statistical Methods for Anomaly Detection: Z-score

  • The Z-score measures how many standard deviations an observation is from the mean.
  • A common threshold for outliers is a Z-score greater than 2 or 3 (or less than -2 or -3).
  • Relies more on normal dist than IQR. However, output is more interpretable
# Calculate Z-scores for 'hp'
mtcars$hp_zscore <- scale(mtcars$hp)

# Identify outliers based on Z-score > 2
mtcars %>% filter(abs(hp_zscore) > 2) %>% select(model,hp,hp_zscore)
##           model  hp hp_zscore
## 1 Maserati Bora 335  2.746567

Challenges in Anomaly Detection

  • Defining “Normal”: What constitutes normal behavior can be subjective and context-dependent.
  • High Dimensionality: As the number of variables increases, the concept of distance and density becomes less intuitive.
  • Evolving Patterns: Normal behavior can change over time, requiring adaptive anomaly detection systems.

Visualizing Categorical Data Relationships

  • When working with two or more categorical variables, we often want to see their joint distribution.
  • Stacked bar charts are excellent for this, showing proportions within groups.
  • Grouped bar charts can also be used to compare counts across categories.
# Example: Stacked bar chart of transmission type by cylinder count
ggplot(mtcars, aes(x = cyl_factor, fill = as.factor(am))) +
  geom_bar(position = "fill") + # 'fill' shows proportions
  labs(title = "Proportion of Transmission Types by Cylinder",
       x = "Cylinders", y = "Proportion", fill = "Transmission (0=Auto, 1=Manual)") + theme_minimal()

Grouped Bar Charts

  • Grouped bar charts allow for direct comparison of counts or values across different categories.
  • They are useful when you want to compare the absolute numbers of sub-groups side-by-side.
# Example: Grouped bar chart of car count by cylinder and transmission type
ggplot(mtcars, aes(x = cyl_factor, fill = as.factor(am))) +
  geom_bar(position = "dodge") + # 'dodge' places bars side-by-side
  labs(title = "Car Count by Cylinder and Transmission (Grouped)",
       x = "Cylinders", y = "Count", fill = "Transmission (0=Auto, 1=Manual)") +
  theme_minimal()

Data Transformation for Visualization

  • Sometimes, raw data distributions can be skewed, making visualizations hard to interpret.
  • Transformations (like log or square root) can make data more symmetric.
  • This helps reveal patterns that might be hidden in the original scale.
par(mfrow = c(1, 2)) # Arrange plots side by side
hist(income_data$income, main = "Original Income Distribution", xlab = "Income")
hist(log(income_data$income), main = "Log-transformed Income Distribution", xlab = "Log(Income)")

Heatmaps for Multi-variable Relationships

  • Heatmaps are excellent for visualizing relationships between three or more variables.
  • They are especially useful when one or more variables are categorical.
  • They use color intensity to represent a third variable’s value.
mtcars_summary <- mtcars %>% group_by(cyl_factor, gear) %>% summarise(avg_mpg = mean(mpg)) #dplyr
ggplot(mtcars_summary, aes(x = cyl_factor, y = as.factor(gear), fill = avg_mpg)) +
  geom_tile(color = "white") + scale_fill_gradient(low = "yellow", high = "red") +
  labs(title = "Average MPG by Cylinder and Gear", x = "Cylinders", y = "Gears", fill = "Average MPG") + theme_minimal()

Key Takeaways from Day 6

Today, we explored advanced EDA techniques:

  • Faceting: Comparing relationships across groups.
  • geom_smooth(): Visualizing trends.
  • Correlation Matrices & Heatmaps: Understanding multi-variable linear relationships.
  • Box & Density Plots: Identifying anomalies/outliers visually.
  • Statistical Anomaly Detection: Using Z-score and IQR methods.
  • Challenges in Anomaly Detection: Understanding practical difficulties.
  • Data Transformation: Improving data distribution for visualization through log transforms and scaling.
  • Visualizing Categorical Relationships: Using stacked and grouped bar charts.