library(seedhash)
gen <- SeedHashGenerator$new("ANLY530-Week04-Regression")
MASTER_SEED <- gen$generate_seeds(1)
set.seed(MASTER_SEED)

Random Seed Source: seedhash package — input string: "ANLY530-Week04-Regression" → seed: -887874702

Introduction to Regression

Welcome to the world of regression! Regression analysis is a cornerstone of predictive modeling. At its heart, it’s about understanding and predicting a continuous value. Unlike classification, where we predict a category (like “Yes/No” or “Cat/Dog”), regression helps us answer questions like:

How much will a house sell for?
What will the temperature be tomorrow?
How many units will we sell next quarter?

The goal is to find a mathematical relationship between a set of predictor variables (features) and a target variable (the continuous value we want to predict).

Learning Objectives:

Understand and implement Linear Regression, the foundational regression model.
Explore different training methods like the Normal Equation and Gradient Descent.
Learn to evaluate regression models using metrics like MSE and RMSE.
Build more advanced models like Support Vector Regression (SVR), Decision Tree Regressors, and Random Forest Regressors.
Appreciate the trade-offs between different regression models.

Required Packages

We will use a few key packages for this tutorial.

tidyverse: For data manipulation and visualization (ggplot2, dplyr).
rpart & rpart.plot: For Decision Trees.
e1071: For Support Vector Machines.
randomForest: For Random Forest models.
gridExtra: For multi-panel plot layouts.

# install.packages(c("tidyverse", "rpart", "rpart.plot", "e1071", "randomForest", "gridExtra"))

library(tidyverse)
library(rpart)
library(rpart.plot)
library(e1071)
library(randomForest)
library(gridExtra)
library(scales)

Part 1: Linear Regression - The Foundation

Linear Regression is the simplest and most famous regression model. It assumes a linear relationship between the features and the target variable. In the case of a single feature, this means fitting a straight line to the data.

The Equation of a Line:

\[ \hat{y} = \theta_0 + \theta_1 x_1 \]

\(\hat{y}\) (y-hat) is the predicted value.
\(x_1\) is the feature value.
\(\theta_0\) is the intercept (the value of y when x=0).
\(\theta_1\) is the coefficient or slope (how much y changes for a one-unit increase in x).

Our goal is to find the best values for \(\theta_0\) and \(\theta_1\) that make our line fit the data as closely as possible.

Generating Sample Data

Let’s create some data that has a clear linear trend, but with some random noise, which is typical of real-world data.

set.seed(MASTER_SEED)
# Create a dataset with a clear linear trend + noise
X <- 2 * runif(100, 0, 2) # 100 data points
y <- 4 + 3 * X + rnorm(100) # y = 4 + 3x + noise

linear_data <- data.frame(X, y)

# Plot the data
ggplot(linear_data, aes(x = X, y = y)) +
  geom_point(alpha = 0.7) +
  labs(title = "Sample Data for Linear Regression", x = "Feature (X)", y = "Target (y)")

What is a Residual?

Before we fit a model, we need to understand what we are trying to minimize. A residual is the vertical distance between a data point and the model’s prediction — the error the model makes on a single observation.

\[e_i = y_i - \hat{y}_i\]

# Fit the model first so we can show residuals
model_lm_tmp <- lm(y ~ X, data = linear_data)
fitted_tmp   <- predict(model_lm_tmp)

set.seed(MASTER_SEED)
resid_show <- data.frame(
  X       = linear_data$X,
  y       = linear_data$y,
  fitted  = fitted_tmp,
  resid   = linear_data$y - fitted_tmp
) %>% slice_sample(n = 18)  # highlight 18 representative points

ggplot(linear_data, aes(x = X, y = y)) +
  geom_point(alpha = 0.25, color = "grey60", size = 1.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#1976D2", linewidth = 1.3) +
  geom_segment(data = resid_show,
    aes(x = X, xend = X, y = y, yend = fitted),
    color = "firebrick", linewidth = 0.8, alpha = 0.85) +
  geom_point(data = resid_show, aes(x = X, y = y),
    color = "firebrick", size = 2.5) +
  labs(
    title    = "Visualising Residuals",
    subtitle = "Red vertical segments = residuals (actual − predicted). OLS minimises the sum of their squares.",
    x = "Feature (X)", y = "Target (y)"
  ) +
  theme_minimal()

The Ordinary Least Squares (OLS) criterion finds the unique line that minimises the Mean Squared Error (MSE):

\[\text{MSE}(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2\]

Squaring the residuals has two benefits: it penalises large errors more heavily, and it makes the objective differentiable so we can solve it analytically.

How to Find the “Best-Fit” Line?

We need a way to measure how well our line fits the data. The most common method is Ordinary Least Squares (OLS), where we minimize the Sum of Squared Errors (SSE) or Mean Squared Error (MSE).

Error (or Residual): The vertical distance between a data point and the regression line.
MSE: The average of the squared errors.

We want to find the line (the \(\theta\) values) that makes this MSE as small as possible.

There are two main ways to do this:

The Normal Equation: A direct, analytical formula.
Gradient Descent: An iterative optimization algorithm.

Method 1: The Normal Equation

The Normal Equation is a mathematical formula that gives you the optimal \(\theta\) values directly.

\[\hat{\theta} = (X^T X)^{-1} X^T y\]

This looks intimidating, but it’s just a series of matrix operations that R can handle easily.

# Add x0 = 1 to each instance for the intercept term
X_b <- cbind(1, linear_data$X) 

# Calculate theta using the Normal Equation
theta_best <- solve(t(X_b) %*% X_b) %*% t(X_b) %*% linear_data$y

# Print the results
cat(paste("Intercept (theta0):", round(theta_best[1], 2), "\n"))

## Intercept (theta0): 3.89

cat(paste("Coefficient (theta1):", round(theta_best[2], 2), "\n"))

## Coefficient (theta1): 3.03

# R's built-in lm() function does the same thing
model_lm <- lm(y ~ X, data = linear_data)
print(summary(model_lm))

## 
## Call:
## lm(formula = y ~ X, data = linear_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9741 -0.5703  0.0813  0.6243  2.3544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88981    0.17246   22.55   <2e-16 ***
## X            3.02880    0.07438   40.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8634 on 98 degrees of freedom
## Multiple R-squared:  0.9442, Adjusted R-squared:  0.9436 
## F-statistic:  1658 on 1 and 98 DF,  p-value: < 2.2e-16

The Normal Equation is fast and precise for smaller datasets. However, it can become very slow if you have a huge number of features, as inverting a large matrix is computationally expensive.

Visualising the MSE Cost Surface

It is easy to understand why Gradient Descent works when you can see the landscape it is navigating. For two parameters (\(\theta_0\) and \(\theta_1\)), the MSE is a bowl-shaped surface — also called a convex function. Gradient Descent always finds the global minimum.

# Build MSE grid over a range around the optimal parameters
t0_seq <- seq(theta_best[1] - 3,   theta_best[1] + 3,   length.out = 120)
t1_seq <- seq(theta_best[2] - 1.8, theta_best[2] + 1.8, length.out = 120)

cost_grid <- expand.grid(theta0 = t0_seq, theta1 = t1_seq)
cost_grid$mse <- apply(cost_grid, 1, function(row) {
  pred <- row["theta0"] + row["theta1"] * linear_data$X
  mean((linear_data$y - pred)^2)
})

# Simulate a GD path starting from a fixed off-minimum point
th_path <- c(theta_best[1] - 2.6, theta_best[2] - 1.5)
gd_path <- data.frame(t0 = th_path[1], t1 = th_path[2])
m_cs <- nrow(linear_data)
for (i in 1:60) {
  X_b_cs <- cbind(1, linear_data$X)
  th_mat  <- matrix(th_path, ncol = 1)
  grad    <- 2 / m_cs * t(X_b_cs) %*% (X_b_cs %*% th_mat - linear_data$y)
  th_path <- th_path - 0.1 * as.vector(grad)
  gd_path <- rbind(gd_path, data.frame(t0 = th_path[1], t1 = th_path[2]))
}

ggplot(cost_grid, aes(x = theta0, y = theta1, z = mse)) +
  geom_contour_filled(bins = 18, alpha = 0.85) +
  geom_path(data = gd_path, aes(x = t0, y = t1, z = NULL),
    color = "white", linewidth = 1.1,
    arrow = arrow(length = unit(0.18, "cm"), ends = "last", type = "closed")) +
  geom_point(data = data.frame(t0 = theta_best[1], t1 = theta_best[2]),
    aes(x = t0, y = t1, z = NULL), color = "red", size = 5, shape = 8) +
  labs(
    title    = "MSE Cost Surface — the Bowl that Gradient Descent Navigates",
    subtitle = "Each contour is a level of equal MSE. White path = GD trajectory. Red star = optimum.",
    x        = expression(theta[0]~"(intercept)"),
    y        = expression(theta[1]~"(slope)"),
    fill     = "MSE"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

The concentric rings (contour lines) represent equal levels of MSE. The algorithm starts far from the centre and follows the steepest downhill direction at each step, spiralling inward to the red-star minimum.

Method 2: Gradient Descent

Gradient Descent is a more general-purpose optimization algorithm. Imagine you are on a mountain in thick fog and you want to get to the bottom. A good strategy would be to feel the slope of the ground beneath your feet and take a step in the steepest downhill direction. You repeat this until you reach the valley floor.

That’s exactly what Gradient Descent does:

It starts with random values for \(\theta\).
It measures the gradient (slope) of the cost function (MSE) with respect to each \(\theta\).
It updates each \(\theta\) by taking a small step in the direction of the negative gradient.
It repeats this process until it converges to the minimum.

The Learning Rate (\(\eta\)): This is a critical hyperparameter that controls the size of the steps. Too small, and the algorithm will take too long to converge. Too large, and it might overshoot the minimum and diverge.

Visualizing Gradient Descent

Let’s visualize the process of finding the best line.

set.seed(MASTER_SEED + 1)
# A simplified manual implementation for visualization
eta <- 0.1 # Learning rate
n_iterations <- 1000
m <- nrow(linear_data)

theta <- matrix(rnorm(2), ncol=1) # Random initialization

# Create a plot object
p <- ggplot(linear_data, aes(x = X, y = y)) + geom_point(alpha=0.6)

# Run Gradient Descent and plot the first few lines
for (iteration in 1:n_iterations) {
  gradients <- 2/m * t(X_b) %*% (X_b %*% theta - y)
  theta <- theta - eta * gradients
  
  # Plot first 10 iterations
  if(iteration <= 10) {
    p <- p + geom_abline(intercept = theta[1], slope = theta[2], color=rainbow(10)[iteration], alpha=0.5)
  }
}

# Add the final line
p <- p + geom_abline(intercept = theta[1], slope = theta[2], color = "black", linewidth = 1.2) +
        labs(title = "Gradient Descent: Finding the Best Line", 
             subtitle = "Colored lines show first 10 iterations, black line is the final fit")
print(p)

cat(paste("Final Intercept:", round(theta[1], 2), "\n"))

## Final Intercept: 3.89

cat(paste("Final Coefficient:", round(theta[2], 2), "\n"))

## Final Coefficient: 3.03

As you can see, the algorithm starts with a random line and iteratively adjusts it until it converges on the best-fit line, which is identical to the one found by the Normal Equation.

Effect of the Learning Rate

The learning rate \(\eta\) is perhaps the most important hyperparameter to tune. The plot below shows what happens with three very different values.

simulate_gd_mse <- function(eta, n_iter = 80, start_offset = c(-2.6, -1.5)) {
  th <- matrix(theta_best + start_offset, ncol = 1)
  X_b_lr <- cbind(1, linear_data$X)
  m_lr   <- nrow(linear_data)
  path   <- numeric(n_iter + 1)
  path[1] <- mean((linear_data$y - (X_b_lr %*% th))^2)
  for (i in seq_len(n_iter)) {
    grad   <- 2 / m_lr * t(X_b_lr) %*% (X_b_lr %*% th - linear_data$y)
    th     <- th - eta * grad
    path[i + 1] <- mean((linear_data$y - (X_b_lr %*% th))^2)
  }
  data.frame(Iteration = 0:n_iter, MSE = path,
             Rate = sprintf("η = %.2f", eta))   # sprintf keeps trailing zeros
}

lr_df <- bind_rows(
  simulate_gd_mse(0.02),
  simulate_gd_mse(0.10),
  simulate_gd_mse(0.50)
) %>%
  mutate(
    MSE  = pmin(MSE, 500),         # cap diverging values for readability
    Rate = factor(Rate, levels = c("η = 0.02", "η = 0.10", "η = 0.50"))
  )

ggplot(lr_df, aes(x = Iteration, y = MSE, color = Rate)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c(
    "η = 0.02" = "#FF9800",
    "η = 0.10" = "#4CAF50",
    "η = 0.50" = "#E91E63"
  )) +
  scale_y_log10(labels = scales::comma) +
  geom_hline(yintercept = min(lr_df$MSE[lr_df$Rate == "η = 0.10"]),
    linetype = "dashed", color = "grey50") +
  labs(
    title    = "Learning Rate Effect on Gradient Descent Convergence",
    subtitle = "Too small (orange) → slow. Just right (green) → fast. Too large (pink) → overshoots and diverges.",
    x = "Iteration", y = "MSE (log scale)", color = "Learning Rate"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Rule of thumb: Start with \(\eta = 0.1\) and halve it if training loss oscillates; double it if convergence is too slow. Modern optimisers like Adam adapt \(\eta\) automatically.

Evaluating a Regression Model

Once we have a model, how do we know if it’s any good?

R-squared (R²): The proportion of the variance in the target variable that is predictable from the feature(s). It ranges from 0 to 1, with higher values being better. An R² of 0.85 means that 85% of the variation in y can be explained by X.
Mean Squared Error (MSE): The average of the squared differences between the actual and predicted values. It’s in squared units, so it’s hard to interpret directly.
Root Mean Squared Error (RMSE): The square root of the MSE. This is a great metric because it’s in the same units as the target variable. An RMSE of 5.0 means the model is, on average, off by about 5.0 units.

# Get summary of our lm() model
model_summary <- summary(model_lm)

# Extract metrics
r_squared <- model_summary$r.squared
residuals <- model_summary$residuals
mse <- mean(residuals^2)
rmse <- sqrt(mse)

cat(paste("R-squared:", round(r_squared, 3), "\n"))

## R-squared: 0.944

cat(paste("MSE:", round(mse, 3), "\n"))

## MSE: 0.73

cat(paste("RMSE:", round(rmse, 3), "\n"))

## RMSE: 0.855

Confidence vs. Prediction Intervals

A fitted regression line is just an estimate. There are two distinct types of uncertainty we can quantify:

Confidence Interval: Where does the true mean response lie? This is uncertainty about the line itself.
Prediction Interval: Where will a single new observation fall? This is always wider because it also includes the natural scatter of individual points around the mean.

\[\text{CI:}\quad \hat{y} \pm t^* \cdot \text{SE}(\hat{y}) \qquad \text{PI:}\quad \hat{y} \pm t^* \cdot \sqrt{\text{SE}(\hat{y})^2 + \hat{\sigma}^2}\]

# compute both intervals over a fine prediction grid
pred_grid  <- data.frame(X = seq(min(linear_data$X), max(linear_data$X), length.out = 300))
ci_mat     <- predict(model_lm, newdata = pred_grid, interval = "confidence", level = 0.95)
pi_mat     <- predict(model_lm, newdata = pred_grid, interval = "prediction", level = 0.95)

band_df <- data.frame(pred_grid,
  fit    = ci_mat[, "fit"],
  ci_lwr = ci_mat[, "lwr"], ci_upr = ci_mat[, "upr"],
  pi_lwr = pi_mat[, "lwr"], pi_upr = pi_mat[, "upr"]
)

ggplot(band_df, aes(x = X)) +
  geom_ribbon(aes(ymin = pi_lwr, ymax = pi_upr), fill = "#BBDEFB", alpha = 0.55) +
  geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr), fill = "#1976D2", alpha = 0.40) +
  geom_line(aes(y = fit), color = "#0D47A1", linewidth = 1.3) +
  geom_point(data = linear_data, aes(x = X, y = y), alpha = 0.35, color = "grey40", size = 1.5) +
  annotate("text", x = 2.8, y = 16.5,
    label = "95 % Prediction Interval\n(where a new observation will land)",
    color = "#1565C0", size = 3.4, hjust = 0) +
  annotate("text", x = 2.8, y = 14.5,
    label = "95 % Confidence Interval\n(uncertainty in the true mean line)",
    color = "#0D47A1", size = 3.4, hjust = 0) +
  labs(
    title    = "Confidence Band vs. Prediction Band",
    subtitle = "The prediction band is always wider — it accounts for individual-level noise on top of line-estimation uncertainty.",
    x = "Feature (X)", y = "Target (y)"
  ) +
  theme_minimal()

Residual Diagnostics

After fitting any model, always examine the residuals. Three key assumptions of linear regression can be checked visually:

Plot	What to look for	Violation signal
Residuals vs. Fitted	Horizontal band around zero — no pattern	Funnel shape → heteroscedasticity; curve → non-linearity
Normal Q-Q	Points on the diagonal line	Heavy tails or S-curve → non-normal errors

diag_df <- data.frame(
  fitted   = fitted(model_lm),
  residual = residuals(model_lm),
  std_res  = rstandard(model_lm)
)

p_rv_f <- ggplot(diag_df, aes(x = fitted, y = residual)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  geom_point(alpha = 0.55, color = "#1976D2", size = 1.8) +
  geom_smooth(method = "loess", se = FALSE, color = "firebrick",
    linewidth = 0.9, formula = y ~ x) +
  labs(title = "Residuals vs. Fitted",
       subtitle = "Good: points randomly scattered around 0",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()

p_qq <- ggplot(diag_df, aes(sample = std_res)) +
  stat_qq(alpha = 0.55, color = "#1976D2", size = 1.8) +
  stat_qq_line(color = "firebrick", linewidth = 0.9) +
  labs(title = "Normal Q-Q",
       subtitle = "Good: points hug the diagonal line",
       x = "Theoretical Quantiles", y = "Standardised Residuals") +
  theme_minimal()

grid.arrange(p_rv_f, p_qq, ncol = 2,
  top = "Residual Diagnostics — Checking Linear Regression Assumptions")

Both plots look healthy for our synthetic data: residuals scatter randomly around zero (no pattern) and closely follow the normal diagonal.

Part 2: Support Vector Regression (SVR)

Support Vector Machines (SVMs) are famous for classification, but they can be cleverly adapted for regression. This is called Support Vector Regression (SVR).

Instead of trying to find a line that minimizes the error for all points, SVR tries to fit as many data points as possible inside a “street” or “tube” while limiting margin violations.

The Epsilon (ε) Tube: The width of the street. Points inside the tube do not contribute to the error. This is why SVR is called epsilon-insensitive.
Support Vectors: The points that lie on the edge of or outside the tube. These are the critical points that “support” or define the regression line.
The C Hyperparameter: Controls the trade-off between having a wide street and limiting the number of margin violations (points outside the street). A small C creates a wider street but allows more violations. A large C creates a narrower street and is less tolerant of violations.

Visualizing SVR

Let’s visualize how SVR works on our data.

# Build an SVR model
# We use the e1071 package
# Note: SVR can be sensitive to the scale of the data, but we'll ignore that for this simple example
svr_model <- svm(y ~ X, data = linear_data, type = 'eps-regression', kernel = 'linear', epsilon = 0.5, cost = 1)

linear_data$svr_pred <- predict(svr_model, linear_data)

# Identify support vectors
support_vectors <- linear_data[svr_model$index,]

# Plot the results
ggplot(linear_data, aes(x = X, y = y)) +
  geom_point(alpha = 0.5) +
  geom_line(aes(y = svr_pred), color = "blue", linewidth = 1) +
  geom_ribbon(aes(ymin = svr_pred - svr_model$epsilon, ymax = svr_pred + svr_model$epsilon), 
              fill = "blue", alpha = 0.2) + # The epsilon tube
  geom_point(data = support_vectors, aes(x=X, y=y), color = "red", size = 4, shape = 1) + # Highlight support vectors
  labs(
    title = "Support Vector Regression (SVR)",
    subtitle = "The model fits a line inside the blue Epsilon-Insensitive Tube. Red points are Support Vectors."
  )

SVR is powerful because it can use the kernel trick (just like for classification) to model complex, non-linear relationships by changing the kernel parameter from linear to something like radial (RBF).

The Kernel Trick: Linear vs. RBF SVR

The kernel trick implicitly maps the data into a much higher-dimensional feature space where a linear boundary becomes non-linear in the original space. Let us demonstrate on data that follows a sine curve — something a linear kernel obviously cannot fit.

set.seed(MASTER_SEED + 2)
X_nl  <- sort(runif(90, 0, 4 * pi))
y_nl  <- sin(X_nl) + rnorm(90, sd = 0.25)
nl_df <- data.frame(X = X_nl, y = y_nl)

svr_lin <- svm(y ~ X, data = nl_df, type = "eps-regression",
               kernel = "linear", epsilon = 0.1, cost = 1)
svr_rbf <- svm(y ~ X, data = nl_df, type = "eps-regression",
               kernel = "radial", epsilon = 0.1, cost = 5, gamma = 0.3)

pred_grid_nl <- data.frame(X = seq(0, 4 * pi, length.out = 400))
pred_grid_nl$linear <- predict(svr_lin, pred_grid_nl)
pred_grid_nl$rbf    <- predict(svr_rbf, pred_grid_nl)

pred_long <- pred_grid_nl %>%
  pivot_longer(-X, names_to = "kernel", values_to = "pred") %>%
  mutate(kernel = recode(kernel,
    linear = "Linear Kernel  —  fits a straight line only",
    rbf    = "RBF Kernel  —  captures the sine curve"
  ))

ggplot(nl_df, aes(x = X, y = y)) +
  geom_point(alpha = 0.45, color = "grey50", size = 1.8) +
  geom_line(data = pred_long, aes(y = pred, color = kernel), linewidth = 1.2) +
  scale_color_manual(values = c(
    "Linear Kernel  —  fits a straight line only" = "#FF5722",
    "RBF Kernel  —  captures the sine curve"      = "#009688"
  )) +
  scale_x_continuous(breaks = c(0, pi, 2*pi, 3*pi, 4*pi),
    labels = c("0", "π", "2π", "3π", "4π")) +
  labs(
    title    = "SVR Kernel Trick: Linear vs. Radial Basis Function (RBF)",
    subtitle = "Data follows y = sin(x) + noise. The linear kernel cannot bend; RBF maps it into infinite dimensions.",
    x = "X", y = "y", color = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

The RBF (Gaussian) kernel effectively gives SVR the ability to fit any smooth curve. The gamma parameter controls how tightly the model follows individual points — larger values risk overfitting.

Part 3: Decision Tree Regression

As we saw in a previous tutorial, Decision Trees can also perform regression. Instead of predicting a class at each leaf node, a regression tree predicts a single continuous value.

How it Works: The tree recursively partitions the data into smaller and smaller regions. The goal at each split is to find the feature and split point that results in the greatest reduction in the Mean Squared Error (MSE) of the resulting child nodes.
Prediction: The final prediction for any data point is simply the average (mean) of all the training data points that fall into the same leaf node.

Visualizing a Regression Tree

Let’s use the rpart package to build a regression tree.

# Build a regression tree
dt_model <- rpart(y ~ X, data = linear_data, method = "anova")

linear_data$dt_pred <- predict(dt_model, linear_data)

# Plot the tree's predictions
ggplot(linear_data, aes(x = X, y = y)) +
  geom_point(alpha = 0.4, color="gray") +
  geom_step(aes(y = dt_pred), color = "darkgreen", linewidth = 1.2) +
  labs(
    title = "Decision Tree Regression",
    subtitle = "The tree approximates the data with constant, step-wise predictions"
  )

# Plot the tree structure itself
rpart.plot(dt_model, main = "Structure of the Regression Tree")

Decision Trees naturally capture non-linearities but are prone to overfitting. The predictions create a step-function, which might not be ideal for all problems. We control overfitting by pruning the tree or setting hyperparameters like maxdepth or minsplit.

Underfitting vs. Overfitting: Tree Depth

The maxdepth hyperparameter directly controls the complexity (and therefore the bias-variance trade-off) of a regression tree. Let us fit four trees at increasing depths and compare their prediction curves.

depths <- c(1, 2, 4, 20)

depth_df <- map_dfr(depths, function(d) {
  mdl  <- rpart(y ~ X, data = linear_data, method = "anova",
                control = rpart.control(maxdepth = d, minsplit = 2, cp = -1))
  pred_g <- data.frame(X = seq(min(linear_data$X), max(linear_data$X), length.out = 400))
  pred_g$pred  <- predict(mdl, newdata = pred_g)
  pred_g$depth <- factor(paste0("Max Depth = ", d),
                          levels = paste0("Max Depth = ", depths))
  pred_g
})

# Training RMSE per depth for subtitle annotation
rmse_per_depth <- map_dfr(depths, function(d) {
  mdl  <- rpart(y ~ X, data = linear_data, method = "anova",
                control = rpart.control(maxdepth = d, minsplit = 2, cp = -1))
  data.frame(depth = d,
             rmse  = round(sqrt(mean((linear_data$y - predict(mdl, linear_data))^2)), 3))
})

depth_labels <- paste0("Depth ", rmse_per_depth$depth,
                        " (train RMSE = ", rmse_per_depth$rmse, ")")

ggplot() +
  geom_point(data = linear_data, aes(x = X, y = y),
    alpha = 0.25, color = "grey55", size = 1.2) +
  geom_line(data = depth_df, aes(x = X, y = pred),
    color = "#E91E63", linewidth = 1.05) +
  geom_smooth(data = linear_data, aes(x = X, y = y),
    method = "lm", se = FALSE, color = "#607D8B",
    linewidth = 0.6, linetype = "dashed", formula = y ~ x) +
  facet_wrap(~ depth, nrow = 2) +
  labs(
    title    = "Decision Tree Depth: Underfitting → Overfitting",
    subtitle = "Shallow trees are too simple (high bias); deep trees memorise every data point (high variance). Dashed = linear baseline.",
    x = "Feature (X)", y = "Target (y)"
  ) +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold", size = 11))

Training RMSE drops as depth increases — but deep trees overfit and will generalise poorly.
Max Depth	Training RMSE
1	1.788
2	1.142
4	0.642
20	0.000

The tree with maxdepth = 20 achieves near-zero training RMSE by memorising the data. On unseen data it would perform much worse. This is the classic overfitting trap. A depth of 2–4 is usually a sensible starting point for regression trees.

Part 4: Random Forest Regression

Random Forests are an ensemble method that fixes many of the problems of individual Decision Trees, especially their instability and tendency to overfit.

How it Works:

Bootstrapping: It creates many (e.g., 500) slightly different versions of the training dataset by sampling with replacement.
Random Feature Selection: It builds a Decision Tree on each of these new datasets. However, at each split in the tree, it only considers a random subset of the features.
Averaging: To make a prediction, it gets a prediction from every tree in the forest and averages them. This “wisdom of the crowd” approach results in a much smoother and more robust prediction.

Visualizing a Random Forest

Let’s build a Random Forest and see how its predictions compare.

set.seed(MASTER_SEED)
# Build a Random Forest model
# ntree is the number of trees in the forest
rf_model <- randomForest(y ~ X, data = linear_data, ntree = 500)

linear_data$rf_pred <- predict(rf_model, linear_data)

# Plot the results
ggplot(linear_data, aes(x = X, y = y)) +
  geom_point(alpha = 0.4, color="gray") +
  geom_line(aes(y = rf_pred), color = "purple", linewidth = 1.2) +
  labs(
    title = "Random Forest Regression",
    subtitle = "Averaging 500 trees results in a much smoother and more robust fit"
  )

Notice how the Random Forest prediction is much smoother than the single Decision Tree. It captures the underlying trend of the data without the jagged steps.

Random Forests are one of the most powerful and widely used “out-of-the-box” models for both regression and classification because they are robust, highly accurate, and require less hyperparameter tuning than models like SVR.

OOB Error vs. Number of Trees

One of the most useful diagnostics for a Random Forest is the Out-of-Bag (OOB) error curve. Because each tree is trained on a bootstrap sample, roughly 37% of the data is not used to train each tree. This held-out portion can be used to estimate test error for free, without a separate validation set.

set.seed(MASTER_SEED)
rf_oob <- randomForest(y ~ X, data = linear_data, ntree = 300, keep.forest = FALSE)

oob_df <- data.frame(
  ntree   = seq_len(300),
  oob_mse = rf_oob$mse,
  oob_r2  = rf_oob$rsq
)

best_n   <- which.min(oob_df$oob_mse)
best_mse <- round(min(oob_df$oob_mse), 4)

p_oob_mse <- ggplot(oob_df, aes(x = ntree, y = oob_mse)) +
  geom_line(color = "#9C27B0", linewidth = 1.1) +
  geom_vline(xintercept = best_n, linetype = "dashed", color = "firebrick") +
  annotate("text", x = best_n + 8, y = max(oob_df$oob_mse) * 0.92,
    label = sprintf("min OOB MSE ≈ %.4f\n(at %d trees)", best_mse, best_n),
    color = "firebrick", size = 3.2, hjust = 0) +
  labs(title = "OOB MSE vs. Number of Trees",
       x = "ntree", y = "OOB MSE") +
  theme_minimal()

p_oob_r2 <- ggplot(oob_df, aes(x = ntree, y = oob_r2)) +
  geom_line(color = "#00796B", linewidth = 1.1) +
  labs(title = "OOB R² vs. Number of Trees",
       x = "ntree", y = "OOB R²") +
  theme_minimal()

grid.arrange(p_oob_mse, p_oob_r2, ncol = 2,
  top = "Random Forest: Out-of-Bag Error Convergence (no test set needed)")

The OOB error drops rapidly for the first 50–100 trees and then plateaus. Adding more trees beyond this point wastes computation without improving accuracy. The OOB curve is a free convergence diagnostic — use it to choose ntree without a separate validation split.

Feature Importance

An added benefit of Random Forests is that they can tell you which features were most important for making predictions.

set.seed(MASTER_SEED)
# For a multi-feature dataset, this would be more interesting
# Let's add a random, useless feature to see what happens
linear_data$random_feature <- rnorm(nrow(linear_data))

rf_model_2 <- randomForest(y ~ X + random_feature, data = linear_data, ntree = 500, importance = TRUE)

# Get and plot importance
importance(rf_model_2)

##                   %IncMSE IncNodePurity
## X              91.8406528     1093.2644
## random_feature  0.9900817      180.7725

varImpPlot(rf_model_2)

The plot clearly shows that our original feature X is far more important than the random_feature we added, which is exactly what we would expect.

Part 5: Model Comparison

Let us put all four models side by side on the same data and compare their predictions visually and numerically.

# Collect all predictions on the original data (sorted for smooth lines)
compare_df <- linear_data %>%
  select(X, y, svr_pred, dt_pred, rf_pred) %>%
  arrange(X) %>%
  pivot_longer(cols = c(svr_pred, dt_pred, rf_pred),
               names_to = "model", values_to = "pred") %>%
  mutate(model = recode(model,
    svr_pred = "SVR (linear)",
    dt_pred  = "Decision Tree",
    rf_pred  = "Random Forest"
  ))

# Also add lm predictions
lm_preds <- data.frame(
  X     = linear_data$X,
  y     = linear_data$y,
  pred  = predict(model_lm),
  model = "Linear Regression"
)

compare_all <- bind_rows(
  compare_df %>% rename(pred = pred),
  lm_preds %>% select(X, y, pred, model)
)

ggplot(compare_all %>% filter(!is.na(pred)),
       aes(x = X, y = y)) +
  geom_point(alpha = 0.3, color = "grey50", size = 1.5) +
  geom_line(aes(y = pred, color = model), linewidth = 1.1) +
  scale_color_manual(values = c(
    "Linear Regression" = "#2196F3",
    "SVR (linear)"      = "#4CAF50",
    "Decision Tree"     = "#FF9800",
    "Random Forest"     = "#9C27B0"
  )) +
  labs(
    title    = "All Four Regression Models on the Same Data",
    subtitle = "Linear models produce straight lines; tree-based models are non-parametric",
    x = "Feature (X)", y = "Target (y)", color = "Model"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Training RMSE of each regression model (lower is better)
Model	Training RMSE
Linear Regression	0.8547
SVR (linear)	0.8672
Decision Tree	0.8520
Random Forest	0.5201

Note: Training RMSE is shown here for illustration only. In practice, always evaluate on a held-out test set or use cross-validation to get an unbiased estimate of generalization performance.

Knowledge Check

Question 1: What does the Normal Equation compute directly, and when would you prefer Gradient Descent?

Answer: The Normal Equation computes the optimal parameter vector \(\hat{\theta} = (X^T X)^{-1} X^T y\) in closed form — no iteration needed. It is exact and instant for small datasets. However, inverting an \(n \times n\) matrix is \(O(n^3)\), so it becomes prohibitively slow when the number of features \(n\) is large (e.g., > 10,000). Gradient Descent scales to millions of features because each update is just a dot product.

Question 2: What is the epsilon-insensitive tube in SVR, and what role does the C parameter play?

Answer: The epsilon tube defines a margin of width \(2\varepsilon\) around the regression line. Points inside the tube incur zero loss — SVR doesn’t try to fit them more precisely than \(\varepsilon\). Points outside the tube (margin violations) are penalised. The C parameter controls the trade-off: small C → wide tube, tolerates violations, simpler model; large C → narrow tube, less tolerant, can overfit.

Question 3: Why does a Decision Tree produce step-function predictions for regression?

Answer: Each leaf node of a regression tree predicts the mean of all training observations that fall into it. Since the tree recursively splits the feature space into rectangular regions (axis-aligned partitions), the predicted value is constant within each region — resulting in a step function. Deeper trees produce more (smaller) steps and can overfit; shallower trees produce fewer steps and may underfit.

Question 4: How does a Random Forest reduce variance compared to a single Decision Tree?

Answer: A single tree has high variance — small changes in training data lead to very different trees. Random Forests average predictions from ntree trees each trained on a bootstrap sample with random feature subsets. By the bias-variance decomposition, if trees have variance \(\sigma^2\) and pairwise correlation \(\rho\), the ensemble variance is \(\rho\sigma^2 + \frac{1-\rho}{B}\sigma^2\). The second term shrinks with more trees; the random feature subsets reduce \(\rho\), further reducing variance.

Summary

Model	Assumption	Strengths	Weaknesses
Linear Regression	Linear relationship	Fast, interpretable, no tuning	Fails on non-linear patterns
SVR	Margin + kernel trick	Works in high dimensions, kernel flexibility	Sensitive to feature scaling, slow on large \(n\)
Decision Tree	Recursive partitions	Non-linear, no scaling needed	High variance, step-wise predictions, overfits
Random Forest	Ensemble of trees	Low variance, robust, built-in feature importance	Less interpretable, slower prediction

The right model depends on your data and problem:

Start with Linear Regression as a baseline — it is fast and interpretable.
Use SVR when you have a small-to-medium dataset with complex, potentially non-linear patterns (switch kernel = "radial").
Use Decision Trees when interpretability is essential.
Use Random Forests when you want the best out-of-the-box accuracy with minimal hyperparameter tuning.

References

[1] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd ed. O’Reilly Media, 2019, ch. 4–7.

[2] B. C. Boehmke and B. M. Greenwell, Hands-On Machine Learning with R. CRC Press, 2020, ch. 7–10. Available: https://bradleyboehmke.github.io/HOML/

[3] C. Chang and C. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1–27, 2011.

[4] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

[5] Z. Huang, “seedhash: Deterministic Seed Generation from String Inputs Using MD5 Hashing,” R package, 2026. Available: https://github.com/melhzy/seedhash

Week 04: A Beginner’s Guide to Regression Models

Ziyuan Huang

Last Updated: April 09, 2026