library(seedhash)
gen <- SeedHashGenerator$new("ANLY530-Week04-Regression")
MASTER_SEED <- gen$generate_seeds(1)
set.seed(MASTER_SEED)Random Seed Source: seedhash package —
input string: "ANLY530-Week04-Regression" → seed:
-887874702
Welcome to the world of regression! Regression analysis is a cornerstone of predictive modeling. At its heart, it’s about understanding and predicting a continuous value. Unlike classification, where we predict a category (like “Yes/No” or “Cat/Dog”), regression helps us answer questions like:
The goal is to find a mathematical relationship between a set of predictor variables (features) and a target variable (the continuous value we want to predict).
Learning Objectives:
We will use a few key packages for this tutorial.
tidyverse: For data manipulation and visualization
(ggplot2, dplyr).rpart & rpart.plot: For Decision
Trees.e1071: For Support Vector Machines.randomForest: For Random Forest models.gridExtra: For multi-panel plot layouts.# install.packages(c("tidyverse", "rpart", "rpart.plot", "e1071", "randomForest", "gridExtra"))
library(tidyverse)
library(rpart)
library(rpart.plot)
library(e1071)
library(randomForest)
library(gridExtra)
library(scales)Linear Regression is the simplest and most famous regression model. It assumes a linear relationship between the features and the target variable. In the case of a single feature, this means fitting a straight line to the data.
The Equation of a Line:
\[ \hat{y} = \theta_0 + \theta_1 x_1 \]
Our goal is to find the best values for \(\theta_0\) and \(\theta_1\) that make our line fit the data as closely as possible.
Let’s create some data that has a clear linear trend, but with some random noise, which is typical of real-world data.
set.seed(MASTER_SEED)
# Create a dataset with a clear linear trend + noise
X <- 2 * runif(100, 0, 2) # 100 data points
y <- 4 + 3 * X + rnorm(100) # y = 4 + 3x + noise
linear_data <- data.frame(X, y)
# Plot the data
ggplot(linear_data, aes(x = X, y = y)) +
geom_point(alpha = 0.7) +
labs(title = "Sample Data for Linear Regression", x = "Feature (X)", y = "Target (y)")Before we fit a model, we need to understand what we are trying to minimize. A residual is the vertical distance between a data point and the model’s prediction — the error the model makes on a single observation.
\[e_i = y_i - \hat{y}_i\]
# Fit the model first so we can show residuals
model_lm_tmp <- lm(y ~ X, data = linear_data)
fitted_tmp <- predict(model_lm_tmp)
set.seed(MASTER_SEED)
resid_show <- data.frame(
X = linear_data$X,
y = linear_data$y,
fitted = fitted_tmp,
resid = linear_data$y - fitted_tmp
) %>% slice_sample(n = 18) # highlight 18 representative points
ggplot(linear_data, aes(x = X, y = y)) +
geom_point(alpha = 0.25, color = "grey60", size = 1.5) +
geom_smooth(method = "lm", se = FALSE, color = "#1976D2", linewidth = 1.3) +
geom_segment(data = resid_show,
aes(x = X, xend = X, y = y, yend = fitted),
color = "firebrick", linewidth = 0.8, alpha = 0.85) +
geom_point(data = resid_show, aes(x = X, y = y),
color = "firebrick", size = 2.5) +
labs(
title = "Visualising Residuals",
subtitle = "Red vertical segments = residuals (actual − predicted). OLS minimises the sum of their squares.",
x = "Feature (X)", y = "Target (y)"
) +
theme_minimal()The Ordinary Least Squares (OLS) criterion finds the unique line that minimises the Mean Squared Error (MSE):
\[\text{MSE}(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2\]
Squaring the residuals has two benefits: it penalises large errors more heavily, and it makes the objective differentiable so we can solve it analytically.
We need a way to measure how well our line fits the data. The most common method is Ordinary Least Squares (OLS), where we minimize the Sum of Squared Errors (SSE) or Mean Squared Error (MSE).
We want to find the line (the \(\theta\) values) that makes this MSE as small as possible.
There are two main ways to do this:
The Normal Equation is a mathematical formula that gives you the optimal \(\theta\) values directly.
\[\hat{\theta} = (X^T X)^{-1} X^T y\]
This looks intimidating, but it’s just a series of matrix operations that R can handle easily.
# Add x0 = 1 to each instance for the intercept term
X_b <- cbind(1, linear_data$X)
# Calculate theta using the Normal Equation
theta_best <- solve(t(X_b) %*% X_b) %*% t(X_b) %*% linear_data$y
# Print the results
cat(paste("Intercept (theta0):", round(theta_best[1], 2), "\n"))## Intercept (theta0): 3.89
## Coefficient (theta1): 3.03
# R's built-in lm() function does the same thing
model_lm <- lm(y ~ X, data = linear_data)
print(summary(model_lm))##
## Call:
## lm(formula = y ~ X, data = linear_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9741 -0.5703 0.0813 0.6243 2.3544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88981 0.17246 22.55 <2e-16 ***
## X 3.02880 0.07438 40.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8634 on 98 degrees of freedom
## Multiple R-squared: 0.9442, Adjusted R-squared: 0.9436
## F-statistic: 1658 on 1 and 98 DF, p-value: < 2.2e-16
The Normal Equation is fast and precise for smaller datasets. However, it can become very slow if you have a huge number of features, as inverting a large matrix is computationally expensive.
It is easy to understand why Gradient Descent works when you can see the landscape it is navigating. For two parameters (\(\theta_0\) and \(\theta_1\)), the MSE is a bowl-shaped surface — also called a convex function. Gradient Descent always finds the global minimum.
# Build MSE grid over a range around the optimal parameters
t0_seq <- seq(theta_best[1] - 3, theta_best[1] + 3, length.out = 120)
t1_seq <- seq(theta_best[2] - 1.8, theta_best[2] + 1.8, length.out = 120)
cost_grid <- expand.grid(theta0 = t0_seq, theta1 = t1_seq)
cost_grid$mse <- apply(cost_grid, 1, function(row) {
pred <- row["theta0"] + row["theta1"] * linear_data$X
mean((linear_data$y - pred)^2)
})
# Simulate a GD path starting from a fixed off-minimum point
th_path <- c(theta_best[1] - 2.6, theta_best[2] - 1.5)
gd_path <- data.frame(t0 = th_path[1], t1 = th_path[2])
m_cs <- nrow(linear_data)
for (i in 1:60) {
X_b_cs <- cbind(1, linear_data$X)
th_mat <- matrix(th_path, ncol = 1)
grad <- 2 / m_cs * t(X_b_cs) %*% (X_b_cs %*% th_mat - linear_data$y)
th_path <- th_path - 0.1 * as.vector(grad)
gd_path <- rbind(gd_path, data.frame(t0 = th_path[1], t1 = th_path[2]))
}
ggplot(cost_grid, aes(x = theta0, y = theta1, z = mse)) +
geom_contour_filled(bins = 18, alpha = 0.85) +
geom_path(data = gd_path, aes(x = t0, y = t1, z = NULL),
color = "white", linewidth = 1.1,
arrow = arrow(length = unit(0.18, "cm"), ends = "last", type = "closed")) +
geom_point(data = data.frame(t0 = theta_best[1], t1 = theta_best[2]),
aes(x = t0, y = t1, z = NULL), color = "red", size = 5, shape = 8) +
labs(
title = "MSE Cost Surface — the Bowl that Gradient Descent Navigates",
subtitle = "Each contour is a level of equal MSE. White path = GD trajectory. Red star = optimum.",
x = expression(theta[0]~"(intercept)"),
y = expression(theta[1]~"(slope)"),
fill = "MSE"
) +
theme_minimal() +
theme(legend.position = "right")The concentric rings (contour lines) represent equal levels of MSE. The algorithm starts far from the centre and follows the steepest downhill direction at each step, spiralling inward to the red-star minimum.
Gradient Descent is a more general-purpose optimization algorithm. Imagine you are on a mountain in thick fog and you want to get to the bottom. A good strategy would be to feel the slope of the ground beneath your feet and take a step in the steepest downhill direction. You repeat this until you reach the valley floor.
That’s exactly what Gradient Descent does:
The Learning Rate (\(\eta\)): This is a critical hyperparameter that controls the size of the steps. Too small, and the algorithm will take too long to converge. Too large, and it might overshoot the minimum and diverge.
Let’s visualize the process of finding the best line.
set.seed(MASTER_SEED + 1)
# A simplified manual implementation for visualization
eta <- 0.1 # Learning rate
n_iterations <- 1000
m <- nrow(linear_data)
theta <- matrix(rnorm(2), ncol=1) # Random initialization
# Create a plot object
p <- ggplot(linear_data, aes(x = X, y = y)) + geom_point(alpha=0.6)
# Run Gradient Descent and plot the first few lines
for (iteration in 1:n_iterations) {
gradients <- 2/m * t(X_b) %*% (X_b %*% theta - y)
theta <- theta - eta * gradients
# Plot first 10 iterations
if(iteration <= 10) {
p <- p + geom_abline(intercept = theta[1], slope = theta[2], color=rainbow(10)[iteration], alpha=0.5)
}
}
# Add the final line
p <- p + geom_abline(intercept = theta[1], slope = theta[2], color = "black", linewidth = 1.2) +
labs(title = "Gradient Descent: Finding the Best Line",
subtitle = "Colored lines show first 10 iterations, black line is the final fit")
print(p)## Final Intercept: 3.89
## Final Coefficient: 3.03
As you can see, the algorithm starts with a random line and iteratively adjusts it until it converges on the best-fit line, which is identical to the one found by the Normal Equation.
The learning rate \(\eta\) is perhaps the most important hyperparameter to tune. The plot below shows what happens with three very different values.
simulate_gd_mse <- function(eta, n_iter = 80, start_offset = c(-2.6, -1.5)) {
th <- matrix(theta_best + start_offset, ncol = 1)
X_b_lr <- cbind(1, linear_data$X)
m_lr <- nrow(linear_data)
path <- numeric(n_iter + 1)
path[1] <- mean((linear_data$y - (X_b_lr %*% th))^2)
for (i in seq_len(n_iter)) {
grad <- 2 / m_lr * t(X_b_lr) %*% (X_b_lr %*% th - linear_data$y)
th <- th - eta * grad
path[i + 1] <- mean((linear_data$y - (X_b_lr %*% th))^2)
}
data.frame(Iteration = 0:n_iter, MSE = path,
Rate = sprintf("η = %.2f", eta)) # sprintf keeps trailing zeros
}
lr_df <- bind_rows(
simulate_gd_mse(0.02),
simulate_gd_mse(0.10),
simulate_gd_mse(0.50)
) %>%
mutate(
MSE = pmin(MSE, 500), # cap diverging values for readability
Rate = factor(Rate, levels = c("η = 0.02", "η = 0.10", "η = 0.50"))
)
ggplot(lr_df, aes(x = Iteration, y = MSE, color = Rate)) +
geom_line(linewidth = 1.2) +
scale_color_manual(values = c(
"η = 0.02" = "#FF9800",
"η = 0.10" = "#4CAF50",
"η = 0.50" = "#E91E63"
)) +
scale_y_log10(labels = scales::comma) +
geom_hline(yintercept = min(lr_df$MSE[lr_df$Rate == "η = 0.10"]),
linetype = "dashed", color = "grey50") +
labs(
title = "Learning Rate Effect on Gradient Descent Convergence",
subtitle = "Too small (orange) → slow. Just right (green) → fast. Too large (pink) → overshoots and diverges.",
x = "Iteration", y = "MSE (log scale)", color = "Learning Rate"
) +
theme_minimal() +
theme(legend.position = "bottom")Rule of thumb: Start with \(\eta = 0.1\) and halve it if training loss oscillates; double it if convergence is too slow. Modern optimisers like Adam adapt \(\eta\) automatically.
Once we have a model, how do we know if it’s any good?
y can be explained by
X.# Get summary of our lm() model
model_summary <- summary(model_lm)
# Extract metrics
r_squared <- model_summary$r.squared
residuals <- model_summary$residuals
mse <- mean(residuals^2)
rmse <- sqrt(mse)
cat(paste("R-squared:", round(r_squared, 3), "\n"))## R-squared: 0.944
## MSE: 0.73
## RMSE: 0.855
A fitted regression line is just an estimate. There are two distinct types of uncertainty we can quantify:
\[\text{CI:}\quad \hat{y} \pm t^* \cdot \text{SE}(\hat{y}) \qquad \text{PI:}\quad \hat{y} \pm t^* \cdot \sqrt{\text{SE}(\hat{y})^2 + \hat{\sigma}^2}\]
# compute both intervals over a fine prediction grid
pred_grid <- data.frame(X = seq(min(linear_data$X), max(linear_data$X), length.out = 300))
ci_mat <- predict(model_lm, newdata = pred_grid, interval = "confidence", level = 0.95)
pi_mat <- predict(model_lm, newdata = pred_grid, interval = "prediction", level = 0.95)
band_df <- data.frame(pred_grid,
fit = ci_mat[, "fit"],
ci_lwr = ci_mat[, "lwr"], ci_upr = ci_mat[, "upr"],
pi_lwr = pi_mat[, "lwr"], pi_upr = pi_mat[, "upr"]
)
ggplot(band_df, aes(x = X)) +
geom_ribbon(aes(ymin = pi_lwr, ymax = pi_upr), fill = "#BBDEFB", alpha = 0.55) +
geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr), fill = "#1976D2", alpha = 0.40) +
geom_line(aes(y = fit), color = "#0D47A1", linewidth = 1.3) +
geom_point(data = linear_data, aes(x = X, y = y), alpha = 0.35, color = "grey40", size = 1.5) +
annotate("text", x = 2.8, y = 16.5,
label = "95 % Prediction Interval\n(where a new observation will land)",
color = "#1565C0", size = 3.4, hjust = 0) +
annotate("text", x = 2.8, y = 14.5,
label = "95 % Confidence Interval\n(uncertainty in the true mean line)",
color = "#0D47A1", size = 3.4, hjust = 0) +
labs(
title = "Confidence Band vs. Prediction Band",
subtitle = "The prediction band is always wider — it accounts for individual-level noise on top of line-estimation uncertainty.",
x = "Feature (X)", y = "Target (y)"
) +
theme_minimal()After fitting any model, always examine the residuals. Three key assumptions of linear regression can be checked visually:
| Plot | What to look for | Violation signal |
|---|---|---|
| Residuals vs. Fitted | Horizontal band around zero — no pattern | Funnel shape → heteroscedasticity; curve → non-linearity |
| Normal Q-Q | Points on the diagonal line | Heavy tails or S-curve → non-normal errors |
diag_df <- data.frame(
fitted = fitted(model_lm),
residual = residuals(model_lm),
std_res = rstandard(model_lm)
)
p_rv_f <- ggplot(diag_df, aes(x = fitted, y = residual)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
geom_point(alpha = 0.55, color = "#1976D2", size = 1.8) +
geom_smooth(method = "loess", se = FALSE, color = "firebrick",
linewidth = 0.9, formula = y ~ x) +
labs(title = "Residuals vs. Fitted",
subtitle = "Good: points randomly scattered around 0",
x = "Fitted values", y = "Residuals") +
theme_minimal()
p_qq <- ggplot(diag_df, aes(sample = std_res)) +
stat_qq(alpha = 0.55, color = "#1976D2", size = 1.8) +
stat_qq_line(color = "firebrick", linewidth = 0.9) +
labs(title = "Normal Q-Q",
subtitle = "Good: points hug the diagonal line",
x = "Theoretical Quantiles", y = "Standardised Residuals") +
theme_minimal()
grid.arrange(p_rv_f, p_qq, ncol = 2,
top = "Residual Diagnostics — Checking Linear Regression Assumptions")Both plots look healthy for our synthetic data: residuals scatter randomly around zero (no pattern) and closely follow the normal diagonal.
Support Vector Machines (SVMs) are famous for classification, but they can be cleverly adapted for regression. This is called Support Vector Regression (SVR).
Instead of trying to find a line that minimizes the error for all points, SVR tries to fit as many data points as possible inside a “street” or “tube” while limiting margin violations.
C Hyperparameter: Controls the
trade-off between having a wide street and limiting the number of margin
violations (points outside the street). A small C creates a
wider street but allows more violations. A large C creates
a narrower street and is less tolerant of violations.Let’s visualize how SVR works on our data.
# Build an SVR model
# We use the e1071 package
# Note: SVR can be sensitive to the scale of the data, but we'll ignore that for this simple example
svr_model <- svm(y ~ X, data = linear_data, type = 'eps-regression', kernel = 'linear', epsilon = 0.5, cost = 1)
linear_data$svr_pred <- predict(svr_model, linear_data)
# Identify support vectors
support_vectors <- linear_data[svr_model$index,]
# Plot the results
ggplot(linear_data, aes(x = X, y = y)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = svr_pred), color = "blue", linewidth = 1) +
geom_ribbon(aes(ymin = svr_pred - svr_model$epsilon, ymax = svr_pred + svr_model$epsilon),
fill = "blue", alpha = 0.2) + # The epsilon tube
geom_point(data = support_vectors, aes(x=X, y=y), color = "red", size = 4, shape = 1) + # Highlight support vectors
labs(
title = "Support Vector Regression (SVR)",
subtitle = "The model fits a line inside the blue Epsilon-Insensitive Tube. Red points are Support Vectors."
)SVR is powerful because it can use the kernel trick
(just like for classification) to model complex, non-linear
relationships by changing the kernel parameter from
linear to something like radial (RBF).
The kernel trick implicitly maps the data into a much higher-dimensional feature space where a linear boundary becomes non-linear in the original space. Let us demonstrate on data that follows a sine curve — something a linear kernel obviously cannot fit.
set.seed(MASTER_SEED + 2)
X_nl <- sort(runif(90, 0, 4 * pi))
y_nl <- sin(X_nl) + rnorm(90, sd = 0.25)
nl_df <- data.frame(X = X_nl, y = y_nl)
svr_lin <- svm(y ~ X, data = nl_df, type = "eps-regression",
kernel = "linear", epsilon = 0.1, cost = 1)
svr_rbf <- svm(y ~ X, data = nl_df, type = "eps-regression",
kernel = "radial", epsilon = 0.1, cost = 5, gamma = 0.3)
pred_grid_nl <- data.frame(X = seq(0, 4 * pi, length.out = 400))
pred_grid_nl$linear <- predict(svr_lin, pred_grid_nl)
pred_grid_nl$rbf <- predict(svr_rbf, pred_grid_nl)
pred_long <- pred_grid_nl %>%
pivot_longer(-X, names_to = "kernel", values_to = "pred") %>%
mutate(kernel = recode(kernel,
linear = "Linear Kernel — fits a straight line only",
rbf = "RBF Kernel — captures the sine curve"
))
ggplot(nl_df, aes(x = X, y = y)) +
geom_point(alpha = 0.45, color = "grey50", size = 1.8) +
geom_line(data = pred_long, aes(y = pred, color = kernel), linewidth = 1.2) +
scale_color_manual(values = c(
"Linear Kernel — fits a straight line only" = "#FF5722",
"RBF Kernel — captures the sine curve" = "#009688"
)) +
scale_x_continuous(breaks = c(0, pi, 2*pi, 3*pi, 4*pi),
labels = c("0", "π", "2π", "3π", "4π")) +
labs(
title = "SVR Kernel Trick: Linear vs. Radial Basis Function (RBF)",
subtitle = "Data follows y = sin(x) + noise. The linear kernel cannot bend; RBF maps it into infinite dimensions.",
x = "X", y = "y", color = NULL
) +
theme_minimal() +
theme(legend.position = "bottom")The RBF (Gaussian) kernel effectively gives SVR the ability to fit
any smooth curve. The gamma parameter controls how
tightly the model follows individual points — larger values risk
overfitting.
As we saw in a previous tutorial, Decision Trees can also perform regression. Instead of predicting a class at each leaf node, a regression tree predicts a single continuous value.
Let’s use the rpart package to build a regression
tree.
# Build a regression tree
dt_model <- rpart(y ~ X, data = linear_data, method = "anova")
linear_data$dt_pred <- predict(dt_model, linear_data)
# Plot the tree's predictions
ggplot(linear_data, aes(x = X, y = y)) +
geom_point(alpha = 0.4, color="gray") +
geom_step(aes(y = dt_pred), color = "darkgreen", linewidth = 1.2) +
labs(
title = "Decision Tree Regression",
subtitle = "The tree approximates the data with constant, step-wise predictions"
)Decision Trees naturally capture non-linearities but are prone to
overfitting. The predictions create a step-function, which might not be
ideal for all problems. We control overfitting by
pruning the tree or setting hyperparameters like
maxdepth or minsplit.
The maxdepth hyperparameter directly controls the
complexity (and therefore the bias-variance trade-off) of a regression
tree. Let us fit four trees at increasing depths and compare their
prediction curves.
depths <- c(1, 2, 4, 20)
depth_df <- map_dfr(depths, function(d) {
mdl <- rpart(y ~ X, data = linear_data, method = "anova",
control = rpart.control(maxdepth = d, minsplit = 2, cp = -1))
pred_g <- data.frame(X = seq(min(linear_data$X), max(linear_data$X), length.out = 400))
pred_g$pred <- predict(mdl, newdata = pred_g)
pred_g$depth <- factor(paste0("Max Depth = ", d),
levels = paste0("Max Depth = ", depths))
pred_g
})
# Training RMSE per depth for subtitle annotation
rmse_per_depth <- map_dfr(depths, function(d) {
mdl <- rpart(y ~ X, data = linear_data, method = "anova",
control = rpart.control(maxdepth = d, minsplit = 2, cp = -1))
data.frame(depth = d,
rmse = round(sqrt(mean((linear_data$y - predict(mdl, linear_data))^2)), 3))
})
depth_labels <- paste0("Depth ", rmse_per_depth$depth,
" (train RMSE = ", rmse_per_depth$rmse, ")")
ggplot() +
geom_point(data = linear_data, aes(x = X, y = y),
alpha = 0.25, color = "grey55", size = 1.2) +
geom_line(data = depth_df, aes(x = X, y = pred),
color = "#E91E63", linewidth = 1.05) +
geom_smooth(data = linear_data, aes(x = X, y = y),
method = "lm", se = FALSE, color = "#607D8B",
linewidth = 0.6, linetype = "dashed", formula = y ~ x) +
facet_wrap(~ depth, nrow = 2) +
labs(
title = "Decision Tree Depth: Underfitting → Overfitting",
subtitle = "Shallow trees are too simple (high bias); deep trees memorise every data point (high variance). Dashed = linear baseline.",
x = "Feature (X)", y = "Target (y)"
) +
theme_minimal() +
theme(strip.text = element_text(face = "bold", size = 11))| Max Depth | Training RMSE |
|---|---|
| 1 | 1.788 |
| 2 | 1.142 |
| 4 | 0.642 |
| 20 | 0.000 |
The tree with
maxdepth = 20achieves near-zero training RMSE by memorising the data. On unseen data it would perform much worse. This is the classic overfitting trap. A depth of 2–4 is usually a sensible starting point for regression trees.
Random Forests are an ensemble method that fixes many of the problems of individual Decision Trees, especially their instability and tendency to overfit.
How it Works:
Let’s build a Random Forest and see how its predictions compare.
set.seed(MASTER_SEED)
# Build a Random Forest model
# ntree is the number of trees in the forest
rf_model <- randomForest(y ~ X, data = linear_data, ntree = 500)
linear_data$rf_pred <- predict(rf_model, linear_data)
# Plot the results
ggplot(linear_data, aes(x = X, y = y)) +
geom_point(alpha = 0.4, color="gray") +
geom_line(aes(y = rf_pred), color = "purple", linewidth = 1.2) +
labs(
title = "Random Forest Regression",
subtitle = "Averaging 500 trees results in a much smoother and more robust fit"
)Notice how the Random Forest prediction is much smoother than the single Decision Tree. It captures the underlying trend of the data without the jagged steps.
Random Forests are one of the most powerful and widely used “out-of-the-box” models for both regression and classification because they are robust, highly accurate, and require less hyperparameter tuning than models like SVR.
One of the most useful diagnostics for a Random Forest is the Out-of-Bag (OOB) error curve. Because each tree is trained on a bootstrap sample, roughly 37% of the data is not used to train each tree. This held-out portion can be used to estimate test error for free, without a separate validation set.
set.seed(MASTER_SEED)
rf_oob <- randomForest(y ~ X, data = linear_data, ntree = 300, keep.forest = FALSE)
oob_df <- data.frame(
ntree = seq_len(300),
oob_mse = rf_oob$mse,
oob_r2 = rf_oob$rsq
)
best_n <- which.min(oob_df$oob_mse)
best_mse <- round(min(oob_df$oob_mse), 4)
p_oob_mse <- ggplot(oob_df, aes(x = ntree, y = oob_mse)) +
geom_line(color = "#9C27B0", linewidth = 1.1) +
geom_vline(xintercept = best_n, linetype = "dashed", color = "firebrick") +
annotate("text", x = best_n + 8, y = max(oob_df$oob_mse) * 0.92,
label = sprintf("min OOB MSE ≈ %.4f\n(at %d trees)", best_mse, best_n),
color = "firebrick", size = 3.2, hjust = 0) +
labs(title = "OOB MSE vs. Number of Trees",
x = "ntree", y = "OOB MSE") +
theme_minimal()
p_oob_r2 <- ggplot(oob_df, aes(x = ntree, y = oob_r2)) +
geom_line(color = "#00796B", linewidth = 1.1) +
labs(title = "OOB R² vs. Number of Trees",
x = "ntree", y = "OOB R²") +
theme_minimal()
grid.arrange(p_oob_mse, p_oob_r2, ncol = 2,
top = "Random Forest: Out-of-Bag Error Convergence (no test set needed)")The OOB error drops rapidly for the first 50–100 trees and then
plateaus. Adding more trees beyond this point wastes computation without
improving accuracy. The OOB curve is a free convergence diagnostic — use
it to choose ntree without a separate validation split.
An added benefit of Random Forests is that they can tell you which features were most important for making predictions.
set.seed(MASTER_SEED)
# For a multi-feature dataset, this would be more interesting
# Let's add a random, useless feature to see what happens
linear_data$random_feature <- rnorm(nrow(linear_data))
rf_model_2 <- randomForest(y ~ X + random_feature, data = linear_data, ntree = 500, importance = TRUE)
# Get and plot importance
importance(rf_model_2) ## %IncMSE IncNodePurity
## X 91.8406528 1093.2644
## random_feature 0.9900817 180.7725
The plot clearly shows that our original feature X is
far more important than the random_feature we added, which
is exactly what we would expect.
Let us put all four models side by side on the same data and compare their predictions visually and numerically.
# Collect all predictions on the original data (sorted for smooth lines)
compare_df <- linear_data %>%
select(X, y, svr_pred, dt_pred, rf_pred) %>%
arrange(X) %>%
pivot_longer(cols = c(svr_pred, dt_pred, rf_pred),
names_to = "model", values_to = "pred") %>%
mutate(model = recode(model,
svr_pred = "SVR (linear)",
dt_pred = "Decision Tree",
rf_pred = "Random Forest"
))
# Also add lm predictions
lm_preds <- data.frame(
X = linear_data$X,
y = linear_data$y,
pred = predict(model_lm),
model = "Linear Regression"
)
compare_all <- bind_rows(
compare_df %>% rename(pred = pred),
lm_preds %>% select(X, y, pred, model)
)
ggplot(compare_all %>% filter(!is.na(pred)),
aes(x = X, y = y)) +
geom_point(alpha = 0.3, color = "grey50", size = 1.5) +
geom_line(aes(y = pred, color = model), linewidth = 1.1) +
scale_color_manual(values = c(
"Linear Regression" = "#2196F3",
"SVR (linear)" = "#4CAF50",
"Decision Tree" = "#FF9800",
"Random Forest" = "#9C27B0"
)) +
labs(
title = "All Four Regression Models on the Same Data",
subtitle = "Linear models produce straight lines; tree-based models are non-parametric",
x = "Feature (X)", y = "Target (y)", color = "Model"
) +
theme_minimal() +
theme(legend.position = "bottom")| Model | Training RMSE |
|---|---|
| Linear Regression | 0.8547 |
| SVR (linear) | 0.8672 |
| Decision Tree | 0.8520 |
| Random Forest | 0.5201 |
Note: Training RMSE is shown here for illustration only. In practice, always evaluate on a held-out test set or use cross-validation to get an unbiased estimate of generalization performance.
C parameter controls
the trade-off: small C → wide tube, tolerates violations,
simpler model; large C → narrow tube, less tolerant, can
overfit.
ntree trees each trained on a bootstrap
sample with random feature subsets. By the bias-variance decomposition,
if trees have variance \(\sigma^2\) and
pairwise correlation \(\rho\), the
ensemble variance is \(\rho\sigma^2 +
\frac{1-\rho}{B}\sigma^2\). The second term shrinks with more
trees; the random feature subsets reduce \(\rho\), further reducing variance.
| Model | Assumption | Strengths | Weaknesses |
|---|---|---|---|
| Linear Regression | Linear relationship | Fast, interpretable, no tuning | Fails on non-linear patterns |
| SVR | Margin + kernel trick | Works in high dimensions, kernel flexibility | Sensitive to feature scaling, slow on large \(n\) |
| Decision Tree | Recursive partitions | Non-linear, no scaling needed | High variance, step-wise predictions, overfits |
| Random Forest | Ensemble of trees | Low variance, robust, built-in feature importance | Less interpretable, slower prediction |
The right model depends on your data and problem:
kernel = "radial").[1] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd ed. O’Reilly Media, 2019, ch. 4–7.
[2] B. C. Boehmke and B. M. Greenwell, Hands-On Machine Learning with R. CRC Press, 2020, ch. 7–10. Available: https://bradleyboehmke.github.io/HOML/
[3] C. Chang and C. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1–27, 2011.
[4] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[5] Z. Huang, “seedhash: Deterministic Seed Generation from String Inputs Using MD5 Hashing,” R package, 2026. Available: https://github.com/melhzy/seedhash