library(seedhash)
gen <- SeedHashGenerator$new("ANLY530-Week06-NaiveBayes")
MASTER_SEED <- gen$generate_seeds(1)
set.seed(MASTER_SEED)Random Seed Source: seedhash package —
input string: "ANLY530-Week06-NaiveBayes" → seed:
-120589735
Over the past five weeks we have built classifiers that work very differently:
This week we take yet another approach — one rooted in probability. Instead of drawing boundaries or asking questions, a Naive Bayes classifier asks: “Given what I see, what is the most probable class?”
The idea is surprisingly simple. You start with a rough guess about how common each class is (the prior). Then, for every piece of evidence you observe, you update that guess using Bayes’ Theorem. The word “naive” comes from a bold simplifying assumption: every feature is independent of every other feature, given the class. This assumption is almost never true in reality — yet the classifier works remarkably well anyway.
“Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the ‘naive’ assumption of conditional independence between every pair of features given the value of the class variable.” — scikit-learn documentation [1]
“Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.” — Wikipedia, “Naive Bayes classifier” [2]
Learning Objectives:
e1071 package.if (!require("pacman")) install.packages("pacman")
pacman::p_load(
tidyverse,
e1071,
caret,
gridExtra,
scales,
rpart,
randomForest,
knitr,
kableExtra
)## package 'randomForest' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\roozb\AppData\Local\Temp\Rtmp4ohGum\downloaded_packages
Imagine your email inbox. About 20% of all incoming messages are spam. You receive a new message containing the word “Urgent”. Should you be worried?
Here is what you know:
Bayes’ Theorem lets you update your prior belief using the new evidence:
\[P(\text{spam} \mid \text{Urgent}) = \frac{P(\text{Urgent} \mid \text{spam}) \times P(\text{spam})}{P(\text{Urgent})} = \frac{0.60 \times 0.20}{0.15} = 0.80\]
After seeing the word “Urgent”, your belief that this email is spam jumps from 20% to 80%. That is the power of Bayes’ Theorem — it turns prior knowledge plus new evidence into an updated, more informed belief.
spam_df <- data.frame(
Stage = factor(c("Prior Belief\n(before seeing any words)",
"After Seeing 'Urgent'\n(posterior)"),
levels = c("Prior Belief\n(before seeing any words)",
"After Seeing 'Urgent'\n(posterior)")),
Spam = c(0.20, 0.80),
Ham = c(0.80, 0.20)
) %>%
pivot_longer(cols = c(Spam, Ham), names_to = "Class", values_to = "Probability")
ggplot(spam_df, aes(x = Class, y = Probability, fill = Class)) +
geom_col(width = 0.55) +
geom_text(aes(label = scales::percent(Probability, accuracy = 1)),
vjust = -0.4, fontface = "bold", size = 4.5) +
facet_wrap(~ Stage) +
scale_fill_manual(values = c(Ham = "#4CAF50", Spam = "#E53935")) +
scale_y_continuous(labels = scales::percent, limits = c(0, 1.05)) +
labs(
title = "Bayes' Theorem in Action: Is This Email Spam?",
subtitle = "Seeing the word 'Urgent' shifts our belief from 20% spam to 80% spam",
x = NULL, y = "Probability"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none",
strip.text = element_text(face = "bold", size = 11))\[P(C \mid X) = \frac{P(X \mid C) \times P(C)}{P(X)}\]
Every term has a name and a plain-English meaning:
| Symbol | Name | Plain English |
|---|---|---|
| \(P(C)\) | Prior | How common is this class before we look at any features? (Our initial guess.) |
| \(P(X \mid C)\) | Likelihood | If the class really is C, how likely are we to see these particular feature values? |
| \(P(X)\) | Evidence (Marginal) | How likely are these feature values across all classes? (A normalising constant.) |
| \(P(C \mid X)\) | Posterior | After seeing the features, how probable is class C? (Our updated, informed belief.) |
Key insight: The evidence \(P(X)\) is the same for every class. When comparing classes, it cancels out. So in practice we only need: \[P(C \mid X) \propto P(X \mid C) \times P(C)\] Pick the class with the largest numerator — that is the prediction.
Let’s visualise the full Bayesian update for three competing classes. Imagine a doctor diagnosing a patient with symptoms \(X\). There are three possible diseases, each with a different prior probability and a different likelihood of producing these symptoms.
diseases <- data.frame(
Disease = c("Cold", "Flu", "Allergy"),
Prior = c(0.50, 0.30, 0.20),
Likelihood = c(0.20, 0.70, 0.40)
)
diseases$Numerator <- diseases$Prior * diseases$Likelihood
diseases$Posterior <- diseases$Numerator / sum(diseases$Numerator)
diseases_long <- diseases %>%
select(Disease, Prior, Likelihood, Posterior) %>%
pivot_longer(-Disease, names_to = "Component", values_to = "Value") %>%
mutate(Component = factor(Component, levels = c("Prior", "Likelihood", "Posterior")))
ggplot(diseases_long, aes(x = Disease, y = Value, fill = Disease)) +
geom_col(width = 0.55) +
geom_text(aes(label = sprintf("%.0f%%", Value * 100)),
vjust = -0.4, fontface = "bold", size = 4) +
facet_wrap(~ Component, scales = "free_y") +
scale_fill_manual(values = c(Cold = "#42A5F5", Flu = "#EF5350", Allergy = "#FFA726")) +
scale_y_continuous(labels = scales::percent, limits = c(0, 0.85)) +
labs(
title = "Bayesian Update: From Prior to Posterior",
subtitle = "Cold is most common (prior=50%), but Flu best explains the symptoms (likelihood=70%) → Flu wins the posterior",
x = NULL, y = NULL
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
strip.text = element_text(face = "bold", size = 12))Even though Cold is the most common disease (highest prior), the Flu’s much higher likelihood of producing these symptoms makes it the most probable diagnosis after we consider the evidence.
This is a classic dataset from the lecture. We have 14 days of weather observations and whether someone chose to play tennis.
tennis <- data.frame(
Day = 1:14,
Outlook = c("Sunny","Sunny","Overcast","Rain","Rain","Rain","Overcast",
"Sunny","Sunny","Rain","Sunny","Overcast","Overcast","Rain"),
Temp = c("Hot","Hot","Hot","Mild","Cool","Cool","Cool",
"Mild","Cool","Mild","Mild","Mild","Hot","Mild"),
Humidity = c("High","High","High","High","Normal","Normal","Normal",
"High","Normal","Normal","Normal","High","Normal","High"),
Windy = c("No","Yes","No","No","No","Yes","Yes",
"No","No","No","Yes","Yes","No","Yes"),
Play = c("No","No","Yes","Yes","Yes","No","Yes",
"No","Yes","Yes","Yes","Yes","Yes","No")
)
knitr::kable(tennis, caption = "The Play Tennis Dataset (14 observations)")| Day | Outlook | Temp | Humidity | Windy | Play |
|---|---|---|---|---|---|
| 1 | Sunny | Hot | High | No | No |
| 2 | Sunny | Hot | High | Yes | No |
| 3 | Overcast | Hot | High | No | Yes |
| 4 | Rain | Mild | High | No | Yes |
| 5 | Rain | Cool | Normal | No | Yes |
| 6 | Rain | Cool | Normal | Yes | No |
| 7 | Overcast | Cool | Normal | Yes | Yes |
| 8 | Sunny | Mild | High | No | No |
| 9 | Sunny | Cool | Normal | No | Yes |
| 10 | Rain | Mild | Normal | No | Yes |
| 11 | Sunny | Mild | Normal | Yes | Yes |
| 12 | Overcast | Mild | High | Yes | Yes |
| 13 | Overcast | Hot | Normal | No | Yes |
| 14 | Rain | Mild | High | Yes | No |
To apply Naive Bayes, we first count how often each feature value appears for each class.
features <- c("Outlook", "Temp", "Humidity", "Windy")
freq_plots <- lapply(features, function(feat) {
freq_df <- tennis %>%
count(!!sym(feat), Play) %>%
group_by(Play) %>%
mutate(Proportion = n / sum(n))
ggplot(freq_df, aes(x = !!sym(feat), y = Proportion, fill = Play)) +
geom_col(position = "dodge", width = 0.6) +
geom_text(aes(label = sprintf("%d\n(%.0f%%)", n, Proportion * 100)),
position = position_dodge(width = 0.6), vjust = -0.3, size = 3) +
scale_fill_manual(values = c(Yes = "#4CAF50", No = "#E53935")) +
scale_y_continuous(limits = c(0, 0.7), labels = scales::percent) +
labs(title = feat, x = NULL, y = "P(feature | Play)") +
theme_minimal(base_size = 10) +
theme(legend.position = "bottom")
})
do.call(grid.arrange, c(freq_plots, ncol = 2,
top = "Frequency Tables: P(Feature Value | Play = Yes/No)"))How to read: Each bar shows the proportion of times a feature value appears within each class. For example, when Play = Yes, “Overcast” accounts for 4 out of 9 days (44%).
Suppose today’s weather is: Outlook = Sunny, Temp = Cool, Humidity = High, Windy = Yes.
We compute the numerator for each class:
\[P(\text{Yes} \mid X) \propto P(X \mid \text{Yes}) \times P(\text{Yes})\] \[P(\text{No} \mid X) \propto P(X \mid \text{No}) \times P(\text{No})\]
Using the naive independence assumption, \(P(X \mid C)\) becomes a product:
\[P(X \mid C) = P(\text{Sunny} \mid C) \times P(\text{Cool} \mid C) \times P(\text{High} \mid C) \times P(\text{Yes(Windy)} \mid C)\]
yes_count <- sum(tennis$Play == "Yes") # 9
no_count <- sum(tennis$Play == "No") # 5
total <- nrow(tennis) # 14
prior_yes <- yes_count / total
prior_no <- no_count / total
tennis_yes <- tennis[tennis$Play == "Yes", ]
tennis_no <- tennis[tennis$Play == "No", ]
# P(feature | Play = Yes)
p_sunny_yes <- sum(tennis_yes$Outlook == "Sunny") / yes_count
p_cool_yes <- sum(tennis_yes$Temp == "Cool") / yes_count
p_high_yes <- sum(tennis_yes$Humidity == "High") / yes_count
p_windy_yes <- sum(tennis_yes$Windy == "Yes") / yes_count
# P(feature | Play = No)
p_sunny_no <- sum(tennis_no$Outlook == "Sunny") / no_count
p_cool_no <- sum(tennis_no$Temp == "Cool") / no_count
p_high_no <- sum(tennis_no$Humidity == "High") / no_count
p_windy_no <- sum(tennis_no$Windy == "Yes") / no_count
numerator_yes <- p_sunny_yes * p_cool_yes * p_high_yes * p_windy_yes * prior_yes
numerator_no <- p_sunny_no * p_cool_no * p_high_no * p_windy_no * prior_no
posterior_yes <- numerator_yes / (numerator_yes + numerator_no)
posterior_no <- numerator_no / (numerator_yes + numerator_no)
cat("--- Likelihoods ---\n")## --- Likelihoods ---
cat(sprintf("P(X | Yes) = %.4f × %.4f × %.4f × %.4f = %.6f\n",
p_sunny_yes, p_cool_yes, p_high_yes, p_windy_yes,
p_sunny_yes * p_cool_yes * p_high_yes * p_windy_yes))## P(X | Yes) = 0.2222 × 0.3333 × 0.3333 × 0.3333 = 0.008230
cat(sprintf("P(X | No) = %.4f × %.4f × %.4f × %.4f = %.6f\n\n",
p_sunny_no, p_cool_no, p_high_no, p_windy_no,
p_sunny_no * p_cool_no * p_high_no * p_windy_no))## P(X | No) = 0.6000 × 0.2000 × 0.8000 × 0.6000 = 0.057600
## --- Numerators (Likelihood × Prior) ---
cat(sprintf("Score(Yes) = %.6f × %.4f = %.6f\n",
p_sunny_yes * p_cool_yes * p_high_yes * p_windy_yes, prior_yes, numerator_yes))## Score(Yes) = 0.008230 × 0.6429 = 0.005291
cat(sprintf("Score(No) = %.6f × %.4f = %.6f\n\n",
p_sunny_no * p_cool_no * p_high_no * p_windy_no, prior_no, numerator_no))## Score(No) = 0.057600 × 0.3571 = 0.020571
## --- Posterior Probabilities ---
## P(Yes | X) = 0.2046 (20.5%)
## P(No | X) = 0.7954 (79.5%)
cat(sprintf("Prediction: %s\n", ifelse(posterior_yes > posterior_no, "Play Tennis!", "Don't Play")))## Prediction: Don't Play
manual_df <- data.frame(
Class = c("Play = Yes", "Play = No"),
Posterior = c(posterior_yes, posterior_no)
)
ggplot(manual_df, aes(x = Class, y = Posterior, fill = Class)) +
geom_col(width = 0.5) +
geom_text(aes(label = sprintf("%.1f%%", Posterior * 100)),
vjust = -0.4, fontface = "bold", size = 5) +
scale_fill_manual(values = c("Play = Yes" = "#4CAF50", "Play = No" = "#E53935")) +
scale_y_continuous(labels = scales::percent, limits = c(0, 1.1)) +
labs(
title = "Manual Naive Bayes Prediction",
subtitle = "Today: Sunny, Cool, High Humidity, Windy",
x = NULL, y = "Posterior Probability"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")The full Bayes formula requires computing \(P(X \mid C)\) — the probability of seeing all features together, given a class. For \(n\) features, this joint probability has an astronomical number of combinations to estimate.
The naive assumption says: given the class, every feature is independent of every other feature. Mathematically:
\[P(x_1, x_2, \ldots, x_n \mid C) = \prod_{i=1}^{n} P(x_i \mid C)\]
This turns one impossibly complex probability into a simple product of \(n\) easy-to-estimate probabilities.
Analogy: Imagine predicting whether a student will pass an exam. Features include hours studied, hours slept, and cups of coffee. In reality, someone who studied many hours probably slept fewer hours and drank more coffee — the features are correlated. The naive assumption pretends these are unrelated, treating each one separately. It sounds absurd, but it works surprisingly well because:
Let’s check with the Iris dataset — the same one we’ve used in previous weeks.
data(iris)
cor_mat <- cor(iris[, 1:4])
cor_df <- as.data.frame(as.table(cor_mat))
names(cor_df) <- c("Feature1", "Feature2", "Correlation")
ggplot(cor_df, aes(x = Feature1, y = Feature2, fill = Correlation)) +
geom_tile(color = "white", linewidth = 1) +
geom_text(aes(label = sprintf("%.2f", Correlation)),
color = "black", size = 4, fontface = "bold") +
scale_fill_gradient2(low = "#E53935", mid = "white", high = "#1976D2",
midpoint = 0, limits = c(-1, 1)) +
labs(
title = "Feature Correlation Matrix — Iris Dataset",
subtitle = "Petal.Length and Petal.Width are highly correlated (0.96) — clearly NOT independent!",
x = NULL, y = NULL
) +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))Petal.Length and Petal.Width have a correlation of 0.96 — nearly perfectly correlated. The naive independence assumption is clearly violated. Yet, as we will see shortly, Naive Bayes still achieves excellent accuracy on this dataset. This is why the method is both “naive” and surprisingly effective.
| Reason | Explanation |
|---|---|
| Classification only needs the correct ranking | NB doesn’t need perfect probability estimates — it just needs the correct class to have the HIGHEST score. Even biased estimates often preserve the right ordering. |
| Estimation errors cancel out | If P(x₁|C) is overestimated and P(x₂|C) is underestimated, the product can still be approximately correct. |
| Simple models generalise well | With fewer parameters than complex models, NB is less prone to overfitting, especially with small training sets. |
| Few parameters to estimate | For p features and K classes, NB estimates only p×K parameters. A full joint model would need K × (exponential in p) parameters. |
Different types of data call for different assumptions about how feature values are distributed within each class. This is what distinguishes the three main Naive Bayes variants.
When features are continuous (like height, weight, or petal length), we assume each feature follows a normal (Gaussian) distribution within each class:
\[P(x_i \mid C_k) = \frac{1}{\sqrt{2\pi\sigma_{k}^2}} \exp\left(-\frac{(x_i - \mu_{k})^2}{2\sigma_{k}^2}\right)\]
In plain English: for each class, we compute the mean (\(\mu\)) and standard deviation (\(\sigma\)) of each feature. To evaluate a new data point, we ask: “How likely is this value under each class’s bell curve?”
set.seed(MASTER_SEED)
iris_long <- iris %>%
pivot_longer(cols = 1:4, names_to = "Feature", values_to = "Value")
ggplot(iris_long, aes(x = Value, fill = Species, color = Species)) +
geom_density(alpha = 0.35, linewidth = 0.8) +
facet_wrap(~ Feature, scales = "free", ncol = 2) +
scale_fill_manual(values = c(setosa = "#42A5F5", versicolor = "#66BB6A", virginica = "#EF5350")) +
scale_color_manual(values = c(setosa = "#1565C0", versicolor = "#2E7D32", virginica = "#C62828")) +
labs(
title = "Gaussian NB Intuition: Each Feature Has a Bell Curve Per Class",
subtitle = "For a new flower, we ask: 'Under which species' bell curve is this measurement most likely?'",
x = "Feature Value", y = "Density"
) +
theme_minimal(base_size = 11) +
theme(legend.position = "bottom",
strip.text = element_text(face = "bold"))How to read: Each colored curve is a Gaussian distribution estimated from training data. When a new flower arrives with Petal.Length = 5.0, we evaluate that value under each species’ curve. The species whose curve gives the highest density “claims” that evidence.
Let’s zoom into a single test prediction and see how each feature contributes.
set.seed(MASTER_SEED)
test_point <- iris[100, ] # a versicolor
stats_df <- iris %>%
group_by(Species) %>%
summarise(across(1:4, list(mean = mean, sd = sd), .names = "{.col}_{.fn}"))
feature_names <- names(iris)[1:4]
species_levels <- levels(iris$Species)
contrib_list <- lapply(feature_names, function(feat) {
x_val <- test_point[[feat]]
x_range <- seq(min(iris[[feat]]) - 0.5, max(iris[[feat]]) + 0.5, length.out = 300)
curves <- lapply(species_levels, function(sp) {
mu <- stats_df[[paste0(feat, "_mean")]][stats_df$Species == sp]
sg <- stats_df[[paste0(feat, "_sd")]][stats_df$Species == sp]
data.frame(x = x_range, density = dnorm(x_range, mu, sg), Species = sp)
}) %>% do.call(rbind, .)
densities_at_x <- sapply(species_levels, function(sp) {
mu <- stats_df[[paste0(feat, "_mean")]][stats_df$Species == sp]
sg <- stats_df[[paste0(feat, "_sd")]][stats_df$Species == sp]
dnorm(x_val, mu, sg)
})
ggplot(curves, aes(x = x, y = density, fill = Species, color = Species)) +
geom_area(alpha = 0.25, position = "identity") +
geom_line(linewidth = 0.8) +
geom_vline(xintercept = x_val, linetype = "dashed", linewidth = 1, color = "black") +
annotate("text", x = x_val, y = max(curves$density) * 0.95,
label = sprintf("x = %.1f", x_val),
hjust = -0.15, fontface = "bold", size = 3.5) +
scale_fill_manual(values = c(setosa = "#42A5F5", versicolor = "#66BB6A", virginica = "#EF5350")) +
scale_color_manual(values = c(setosa = "#1565C0", versicolor = "#2E7D32", virginica = "#C62828")) +
labs(title = feat, x = NULL, y = "Density") +
theme_minimal(base_size = 9) +
theme(legend.position = "none")
})
do.call(grid.arrange, c(contrib_list, ncol = 2,
top = sprintf("Gaussian NB: How Each Feature Evaluates Test Point #100 (Actual: %s)\nDashed line = observed value; tallest curve at the dashed line = strongest evidence for that species",
as.character(test_point$Species))))When features represent counts or frequencies (such as word counts in a document), we use the Multinomial distribution:
\[P(\mathbf{x} \mid C_k) = \frac{(\sum_i x_i)!}{\prod_i x_i!} \prod_{i=1}^{n} p_{ki}^{x_i}\]
This is the go-to model for text classification. Each document is represented as a vector of word frequencies (the “bag of words” model), and the classifier learns which words are most frequent in each category.
When features are binary (present/absent, yes/no, 0/1), we use the Bernoulli distribution:
\[P(\mathbf{x} \mid C_k) = \prod_{i=1}^{n} p_{ki}^{x_i} (1 - p_{ki})^{(1 - x_i)}\]
Unlike Multinomial NB, Bernoulli NB explicitly models the absence of features. This makes it especially useful for short text classification where knowing a word is missing is informative.
| Variant | Feature Type | Typical Use Case | Key Assumption | R / Python |
|---|---|---|---|---|
| Gaussian NB | Continuous (real-valued) | Sensor readings, measurements, medical data | Features follow a normal distribution within each class | e1071::naiveBayes() / GaussianNB |
| Multinomial NB | Counts / Frequencies | Text classification (word counts), spam filtering | Features are generated by a multinomial distribution | Not in e1071 (use quanteda) / MultinomialNB |
| Bernoulli NB | Binary (0/1) | Short text classification, survey data (yes/no) | Features are binary; explicitly penalises absent features | Not in e1071 (custom) / BernoulliNB |
We use the same Iris dataset and the same split ratio as previous weeks for a fair comparison.
data(iris)
set.seed(MASTER_SEED)
train_idx <- sample(1:nrow(iris), floor(0.7 * nrow(iris)))
iris_train <- iris[train_idx, ]
iris_test <- iris[-train_idx, ]
cat(sprintf("Training set: %d observations\nTest set: %d observations\n",
nrow(iris_train), nrow(iris_test)))## Training set: 105 observations
## Test set: 45 observations
R’s e1071::naiveBayes() automatically detects continuous
features and fits Gaussian distributions.
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## setosa versicolor virginica
## 0.2952381 0.3714286 0.3333333
##
## Conditional probabilities:
## Sepal.Length
## Y [,1] [,2]
## setosa 4.996774 0.3851267
## versicolor 5.917949 0.4988786
## virginica 6.634286 0.5651355
##
## Sepal.Width
## Y [,1] [,2]
## setosa 3.454839 0.4365308
## versicolor 2.735897 0.3013017
## virginica 2.991429 0.2934294
##
## Petal.Length
## Y [,1] [,2]
## setosa 1.435484 0.1835727
## versicolor 4.256410 0.4649815
## virginica 5.551429 0.5083835
##
## Petal.Width
## Y [,1] [,2]
## setosa 0.2612903 0.1229564
## versicolor 1.3128205 0.1949082
## virginica 2.0485714 0.2779880
Reading the output: For each feature and each species, the model reports a mean and a standard deviation. These are the \(\mu\) and \(\sigma\) parameters of the Gaussian distributions we saw in Part 4.
param_list <- lapply(1:4, function(i) {
feat <- names(iris)[i]
tbl <- nb_model$tables[[i]]
param_df <- data.frame(
Species = rownames(tbl),
Mean = tbl[, 1],
SD = tbl[, 2]
)
x_range <- seq(min(iris[[feat]]) - 1, max(iris[[feat]]) + 1, length.out = 300)
curves <- lapply(1:3, function(s) {
data.frame(
x = x_range,
density = dnorm(x_range, param_df$Mean[s], param_df$SD[s]),
Species = param_df$Species[s]
)
}) %>% do.call(rbind, .)
ggplot(curves, aes(x = x, y = density, fill = Species, color = Species)) +
geom_area(alpha = 0.3, position = "identity") +
geom_line(linewidth = 0.9) +
scale_fill_manual(values = c(setosa = "#42A5F5", versicolor = "#66BB6A", virginica = "#EF5350")) +
scale_color_manual(values = c(setosa = "#1565C0", versicolor = "#2E7D32", virginica = "#C62828")) +
labs(title = feat, x = feat, y = "Density") +
theme_minimal(base_size = 9) +
theme(legend.position = "bottom")
})
do.call(grid.arrange, c(param_list, ncol = 2,
top = "Gaussian NB: The Learned Bell Curves for Each Feature × Species\n(These curves ARE the model — no boundaries, no trees, just distributions)"))This is the entire model. Unlike Decision Trees (which store a tree structure) or SVMs (which store support vectors), a Gaussian NB model is just a table of means and standard deviations. This makes it extremely compact and fast.
nb_pred <- predict(nb_model, iris_test)
nb_acc <- mean(nb_pred == iris_test$Species)
cat(sprintf("Naive Bayes Test Accuracy: %.1f%%\n\n", nb_acc * 100))## Naive Bayes Test Accuracy: 93.3%
## Actual
## Predicted setosa versicolor virginica
## setosa 19 0 0
## versicolor 0 10 2
## virginica 0 1 13
conf_df <- as.data.frame(conf_mat)
names(conf_df) <- c("Predicted", "Actual", "Count")
ggplot(conf_df, aes(x = Actual, y = Predicted, fill = Count)) +
geom_tile(color = "white", linewidth = 1.5) +
geom_text(aes(label = Count), size = 6, fontface = "bold") +
scale_fill_gradient(low = "white", high = "#1976D2") +
labs(
title = "Confusion Matrix — Naive Bayes on Iris Test Set",
subtitle = sprintf("Overall Accuracy: %.1f%%", nb_acc * 100),
x = "Actual Class", y = "Predicted Class"
) +
theme_minimal(base_size = 12) +
coord_equal()Unlike SVMs, Naive Bayes naturally produces probability estimates for each class.
nb_probs <- predict(nb_model, iris_test, type = "raw")
prob_df <- as.data.frame(nb_probs) %>%
mutate(
obs = 1:n(),
actual = iris_test$Species,
predicted = nb_pred,
correct = (predicted == actual),
confidence = apply(nb_probs, 1, max)
)
prob_long <- prob_df %>%
pivot_longer(cols = c(setosa, versicolor, virginica),
names_to = "class", values_to = "prob")
ggplot(prob_long, aes(x = class, y = factor(obs), fill = prob)) +
geom_tile(color = "white", linewidth = 0.3) +
geom_text(aes(label = sprintf("%.2f", prob)), size = 2.2) +
facet_grid(actual ~ ., scales = "free_y", space = "free") +
scale_fill_gradient(low = "white", high = "#1976D2") +
labs(
title = "Naive Bayes: Predicted Class Probabilities for Each Test Observation",
subtitle = "Each row is a test observation; columns show probability of each species",
x = "Predicted Class", y = "Observation", fill = "Probability"
) +
theme_minimal(base_size = 10) +
theme(strip.text.y = element_text(angle = 0, face = "bold"))Let’s visualise the decision regions — the same 2D view we used in Weeks 02-05 for a direct comparison.
iris_2d <- iris[, c("Petal.Length", "Petal.Width", "Species")]
train_2d <- iris_2d[train_idx, ]
test_2d <- iris_2d[-train_idx, ]
nb_2d <- naiveBayes(Species ~ ., data = train_2d)
grid_nb <- expand.grid(
Petal.Length = seq(0.5, 7.5, length.out = 200),
Petal.Width = seq(0, 2.8, length.out = 200)
)
grid_nb$pred <- predict(nb_2d, grid_nb)
palette_iris <- c(setosa = "#AED6F1", versicolor = "#A9DFBF", virginica = "#F9E79F")
point_colors <- c(setosa = "#1A5276", versicolor = "#1E8449", virginica = "#922B21")
ggplot() +
geom_tile(data = grid_nb,
aes(x = Petal.Length, y = Petal.Width, fill = pred), alpha = 0.5) +
geom_point(data = train_2d,
aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species),
size = 2) +
scale_fill_manual(values = palette_iris, guide = "none") +
scale_color_manual(values = point_colors) +
labs(
title = "Naive Bayes Decision Boundary (Petal Features, Gaussian)",
subtitle = "Smooth, curved boundaries emerge from the overlap of Gaussian distributions",
x = "Petal Length (cm)", y = "Petal Width (cm)", color = "Species"
) +
theme_minimal() +
theme(legend.position = "bottom")Compare with previous weeks: Decision Trees create rectangular regions. Random Forests smooth those rectangles. SVMs draw curved margins around support vectors. Naive Bayes creates smooth, elliptical regions — reflecting the underlying Gaussian distributions.
Suppose in our tennis dataset, “Overcast” weather never appears when Play = No. Then:
\[P(\text{Overcast} \mid \text{No}) = 0\]
Since Naive Bayes multiplies all feature probabilities together, one zero wipes out the entire product — no matter how strong the evidence from other features:
\[P(X \mid \text{No}) = 0 \times \ldots = 0\]
This is clearly wrong. Just because we haven’t seen a combination in training doesn’t mean it’s impossible.
The solution is simple: add a small count (typically 1) to every feature-class combination:
\[P(x_i \mid C_k) = \frac{\text{count}(x_i, C_k) + \alpha}{\text{count}(C_k) + \alpha \times n_i}\]
where \(\alpha\) is the smoothing parameter (1 for Laplace smoothing) and \(n_i\) is the number of possible values for feature \(i\).
Analogy: Imagine you’re rating restaurants. You’ve eaten at Restaurant A ten times and loved it every time (10/10). Restaurant B is brand new — zero reviews. Does that mean Restaurant B has a 0% chance of being good? Of course not. Laplace smoothing is like giving every restaurant 1 positive and 1 negative review before anyone eats there — a modest starting point that gets overwhelmed by real data as reviews accumulate.
outlook_vals <- c("Sunny", "Overcast", "Rain")
raw_counts_no <- sapply(outlook_vals, function(v) sum(tennis_no$Outlook == v))
raw_probs_no <- raw_counts_no / no_count
alpha <- 1
n_categories <- length(outlook_vals)
smoothed_probs_no <- (raw_counts_no + alpha) / (no_count + alpha * n_categories)
smooth_df <- data.frame(
Outlook = rep(outlook_vals, 2),
Method = rep(c("Without Smoothing", "With Laplace Smoothing (α=1)"), each = 3),
Probability = c(raw_probs_no, smoothed_probs_no)
)
smooth_df$Method <- factor(smooth_df$Method,
levels = c("Without Smoothing", "With Laplace Smoothing (α=1)"))
ggplot(smooth_df, aes(x = Outlook, y = Probability, fill = Method)) +
geom_col(position = "dodge", width = 0.6) +
geom_text(aes(label = sprintf("%.3f", Probability)),
position = position_dodge(width = 0.6), vjust = -0.4,
fontface = "bold", size = 3.8) +
scale_fill_manual(values = c("Without Smoothing" = "#E53935",
"With Laplace Smoothing (α=1)" = "#4CAF50")) +
scale_y_continuous(limits = c(0, 0.7)) +
geom_hline(yintercept = 0, color = "grey30") +
annotate("text", x = "Overcast", y = 0.05,
label = "Zero!\nThis kills\nthe product",
color = "#E53935", fontface = "italic", size = 3.5) +
labs(
title = "Laplace Smoothing: Fixing the Zero-Frequency Problem",
subtitle = "P(Outlook | Play = No) — 'Overcast' never appeared with No, causing a zero probability",
x = "Outlook Value", y = "P(Outlook | Play = No)"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")Before smoothing: P(Overcast | No) = 0 → entire product becomes 0. After smoothing: P(Overcast | No) = 0.125 → small but non-zero, allowing other features to still influence the decision.
| Alpha Value | Effect |
|---|---|
| α = 0 | No smoothing — zero probabilities possible, can crash the model |
| α = 1 (Laplace) | Standard fix — adds 1 pseudo-count to every category. Most common choice. |
| α < 1 (Lidstone) | Milder smoothing — useful when you have large training sets and want less bias |
| α → ∞ | All probabilities become 1/K (uniform) — the model ignores the data entirely |
Text classification is Naive Bayes’s most famous application. The idea:
Let’s build a tiny spam filter from scratch to see the mechanics.
set.seed(MASTER_SEED)
emails <- data.frame(
text = c(
"buy cheap pills now discount",
"win free prize money today",
"urgent offer limited time deal",
"buy discount offer free shipping",
"meeting tomorrow at the office",
"lunch with team at noon today",
"project deadline is next week",
"please review the attached report",
"can we schedule a meeting today",
"free upgrade for your account now"
),
label = c("spam","spam","spam","spam",
"ham","ham","ham","ham","ham","spam")
)
knitr::kable(emails, col.names = c("Email Text", "Label"),
caption = "A Tiny Email Corpus for Spam Classification")| Email Text | Label |
|---|---|
| buy cheap pills now discount | spam |
| win free prize money today | spam |
| urgent offer limited time deal | spam |
| buy discount offer free shipping | spam |
| meeting tomorrow at the office | ham |
| lunch with team at noon today | ham |
| project deadline is next week | ham |
| please review the attached report | ham |
| can we schedule a meeting today | ham |
| free upgrade for your account now | spam |
all_words <- emails %>%
mutate(words = strsplit(text, " ")) %>%
unnest(words) %>%
count(label, words) %>%
group_by(label) %>%
mutate(freq = n / sum(n)) %>%
ungroup()
top_words <- all_words %>%
group_by(words) %>%
summarise(total = sum(n)) %>%
top_n(15, total) %>%
pull(words)
word_plot_df <- all_words %>% filter(words %in% top_words)
ggplot(word_plot_df, aes(x = reorder(words, freq), y = freq, fill = label)) +
geom_col(position = "dodge", width = 0.6) +
coord_flip() +
scale_fill_manual(values = c(ham = "#4CAF50", spam = "#E53935")) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Word Frequencies: Spam vs. Ham",
subtitle = "Words like 'free', 'buy', 'discount' are strong spam indicators",
x = NULL, y = "Frequency Within Class", fill = "Class"
) +
theme_minimal(base_size = 11) +
theme(legend.position = "bottom")spam_emails <- emails$text[emails$label == "spam"]
ham_emails <- emails$text[emails$label == "ham"]
spam_words <- unlist(strsplit(paste(spam_emails, collapse = " "), " "))
ham_words <- unlist(strsplit(paste(ham_emails, collapse = " "), " "))
vocabulary <- unique(c(spam_words, ham_words))
V <- length(vocabulary)
prior_spam_txt <- sum(emails$label == "spam") / nrow(emails)
prior_ham_txt <- sum(emails$label == "ham") / nrow(emails)
# Laplace-smoothed word probabilities
p_word_spam <- sapply(vocabulary, function(w) {
(sum(spam_words == w) + 1) / (length(spam_words) + V)
})
p_word_ham <- sapply(vocabulary, function(w) {
(sum(ham_words == w) + 1) / (length(ham_words) + V)
})
# Classify a new email
new_email <- "free discount meeting today"
new_words <- strsplit(new_email, " ")[[1]]
log_spam <- log(prior_spam_txt) + sum(log(p_word_spam[new_words]), na.rm = TRUE)
log_ham <- log(prior_ham_txt) + sum(log(p_word_ham[new_words]), na.rm = TRUE)
cat(sprintf("New email: '%s'\n\n", new_email))## New email: 'free discount meeting today'
## Log-score (spam): -14.3931
## Log-score (ham): -15.4323
# Convert to probabilities using log-sum-exp trick
max_log <- max(log_spam, log_ham)
p_spam_new <- exp(log_spam - max_log) / (exp(log_spam - max_log) + exp(log_ham - max_log))
p_ham_new <- 1 - p_spam_new
cat(sprintf("P(spam | email) = %.1f%%\n", p_spam_new * 100))## P(spam | email) = 73.9%
## P(ham | email) = 26.1%
## Prediction: SPAM
Note: We use log-probabilities instead of raw probabilities. Multiplying many small numbers leads to numerical underflow (the result becomes indistinguishable from zero). Taking logarithms converts products to sums, avoiding this problem.
Let’s put all four classification methods we’ve learned on the same 2D Iris task.
set.seed(MASTER_SEED)
grid_comp <- expand.grid(
Petal.Length = seq(0.5, 7.5, length.out = 200),
Petal.Width = seq(0, 2.8, length.out = 200)
)# Decision Tree
dt_comp <- rpart(Species ~ Petal.Length + Petal.Width, data = train_2d,
method = "class", control = rpart.control(maxdepth = 4))
# Random Forest
rf_comp <- randomForest(Species ~ Petal.Length + Petal.Width, data = train_2d, ntree = 200)
# SVM (RBF)
svm_comp <- svm(Species ~ Petal.Length + Petal.Width, data = train_2d,
kernel = "radial", cost = 5, gamma = 1, scale = TRUE)
# Naive Bayes (already trained as nb_2d)
models_list <- list(
list(name = "Decision Tree", pred_fn = function(g) predict(dt_comp, g, type = "class")),
list(name = "Random Forest", pred_fn = function(g) predict(rf_comp, g)),
list(name = "SVM (RBF)", pred_fn = function(g) predict(svm_comp, g)),
list(name = "Naive Bayes", pred_fn = function(g) predict(nb_2d, g))
)
plots_comp <- lapply(models_list, function(item) {
grid_comp$pred <- item$pred_fn(grid_comp)
test_pred <- item$pred_fn(test_2d)
test_acc <- mean(test_pred == test_2d$Species)
ggplot() +
geom_tile(data = grid_comp, aes(x = Petal.Length, y = Petal.Width, fill = pred), alpha = 0.5) +
geom_point(data = test_2d, aes(x = Petal.Length, y = Petal.Width, color = Species),
size = 1.8, alpha = 0.8) +
scale_fill_manual(values = palette_iris, guide = "none") +
scale_color_manual(values = point_colors, guide = "none") +
labs(
title = item$name,
subtitle = sprintf("Test accuracy: %.1f%%", test_acc * 100),
x = "Petal Length", y = "Petal Width"
) +
theme_minimal(base_size = 9) +
theme(plot.title = element_text(face = "bold"))
})
do.call(grid.arrange, c(plots_comp, ncol = 4,
top = "Decision Boundary Comparison: DT vs. RF vs. SVM vs. Naive Bayes (Iris Petal Features)"))| Model | Test Accuracy (%) |
|---|---|
| Decision Tree | 93.3 |
| Random Forest | 95.6 |
| SVM (RBF) | 93.3 |
| Naive Bayes | 95.6 |
| Model | Boundary Shape | What Drives It |
|---|---|---|
| Decision Tree | Rectangular, axis-aligned — each split is a horizontal or vertical line | Greedy, one-feature-at-a-time splits |
| Random Forest | Smoothed rectangles — averaging many trees softens the edges | Averaging over hundreds of diverse trees |
| SVM (RBF) | Smooth, curved margins — the kernel trick creates flexible boundaries | Maximising the margin between support vectors |
| Naive Bayes (Gaussian) | Smooth, elliptical — boundaries follow the contours of Gaussian distributions | The intersection of class-conditional probability density curves |
One of Naive Bayes’s biggest advantages is speed.
set.seed(MASTER_SEED)
n_reps <- 50
times <- data.frame(
Model = character(),
Time_ms = numeric()
)
for (r in 1:n_reps) {
t_nb <- system.time(naiveBayes(Species ~ ., data = iris_train))["elapsed"]
t_dt <- system.time(rpart(Species ~ ., data = iris_train, method = "class"))["elapsed"]
t_rf <- system.time(randomForest(Species ~ ., data = iris_train, ntree = 200))["elapsed"]
t_svm <- system.time(svm(Species ~ ., data = iris_train, kernel = "radial"))["elapsed"]
times <- rbind(times, data.frame(
Model = c("Naive Bayes", "Decision Tree", "Random Forest", "SVM (RBF)"),
Time_ms = c(t_nb, t_dt, t_rf, t_svm) * 1000
))
}
time_summary <- times %>%
group_by(Model) %>%
summarise(Mean_ms = mean(Time_ms), .groups = "drop") %>%
mutate(Model = factor(Model, levels = c("Naive Bayes", "Decision Tree", "SVM (RBF)", "Random Forest")))
ggplot(time_summary, aes(x = reorder(Model, Mean_ms), y = Mean_ms, fill = Model)) +
geom_col(width = 0.55) +
geom_text(aes(label = sprintf("%.2f ms", Mean_ms)),
hjust = -0.1, fontface = "bold", size = 4) +
coord_flip(xlim = NULL, ylim = c(0, max(time_summary$Mean_ms) * 1.3)) +
scale_fill_manual(values = c(
"Naive Bayes" = "#4CAF50",
"Decision Tree" = "#FF9800",
"SVM (RBF)" = "#2196F3",
"Random Forest" = "#9C27B0"
)) +
labs(
title = "Training Time Comparison (Iris Dataset, average of 50 runs)",
subtitle = "Naive Bayes is the fastest — it only computes means and standard deviations",
x = NULL, y = "Mean Training Time (milliseconds)"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")Why is NB so fast? Training a Gaussian NB model requires only one pass through the data to compute the mean and standard deviation of each feature for each class. There is no iterative optimisation, no tree-building, no kernel computation. For \(n\) samples, \(p\) features, and \(K\) classes, the training complexity is \(O(n \times p \times K)\).
| Property | Naive Bayes Assessment |
|---|---|
| Training speed | Excellent — one of the fastest classifiers to train (just compute means and counts) |
| Prediction speed | Excellent — prediction is a simple product of probabilities |
| Small training sets | Excellent — few parameters means less overfitting with limited data |
| High-dimensional data | Good — handles thousands of features (e.g., word vocabularies) without trouble |
| Text classification | Excellent — Multinomial NB is the classic baseline for spam filtering and document classification |
| Probability estimates | Native — NB directly outputs class probabilities (unlike SVM which needs Platt scaling) |
| Feature independence | Poor assumption — features are rarely independent; model still works but probabilities may be poorly calibrated |
| Continuous features | Moderate — Gaussian assumption may not hold; consider discretising or using kernel density estimation |
| Complex decision boundaries | Limited — boundaries are always smooth ellipses (Gaussian) or linear (Multinomial), cannot capture XOR-like patterns |
| Outlier robustness | Moderate — less robust than tree-based methods to extreme outliers |
| Feature scaling required | Not required — NB is not distance-based, so feature scaling does not affect results |
| Scenario | Recommendation |
|---|---|
| Small dataset, need fast results | Naive Bayes — few parameters, fast training, low overfitting risk |
| Text classification / spam filtering | Multinomial Naive Bayes — the standard baseline for bag-of-words tasks |
| Need probability estimates | Naive Bayes — produces probabilities natively (but may be poorly calibrated) |
| Features are highly correlated | Random Forest or SVM — NB’s independence assumption is most harmful here |
| Complex, non-linear boundary needed | SVM (RBF) or Random Forest — NB boundaries are too simple |
| Very large dataset (millions of rows) | Naive Bayes — scales linearly with data size; RF and SVM are much slower |
| Need interpretable model | Decision Tree — NB doesn’t explain which features matter for individual predictions |
| Baseline model for comparison | Naive Bayes — always try NB first as a fast, simple baseline |
| Real-time / streaming prediction | Naive Bayes — prediction is essentially instant |
naiveBayes() uses Laplace smoothing by default for
categorical features (set via the laplace argument).Naive Bayes is famous for performing well with small training sets. Let’s see how all four models behave as we vary the amount of training data.
set.seed(MASTER_SEED)
train_fracs <- seq(0.1, 0.9, by = 0.1)
n_reps_lc <- 30
lc_results <- lapply(train_fracs, function(frac) {
accs <- replicate(n_reps_lc, {
idx <- sample(1:nrow(iris), floor(frac * nrow(iris)))
tr <- iris[idx, ]
te <- iris[-idx, ]
nb_m <- naiveBayes(Species ~ ., data = tr)
dt_m <- rpart(Species ~ ., data = tr, method = "class")
rf_m <- randomForest(Species ~ ., data = tr, ntree = 100)
svm_m <- svm(Species ~ ., data = tr, kernel = "radial", cost = 5, gamma = 0.5)
c(
NB = mean(predict(nb_m, te) == te$Species),
DT = mean(predict(dt_m, te, type = "class") == te$Species),
RF = mean(predict(rf_m, te) == te$Species),
SVM = mean(predict(svm_m, te) == te$Species)
)
})
data.frame(
frac = frac,
Model = rep(c("Naive Bayes", "Decision Tree", "Random Forest", "SVM"), each = 1),
Accuracy = rowMeans(accs)
)
}) %>% do.call(rbind, .)
ggplot(lc_results, aes(x = frac, y = Accuracy, color = Model)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2.5) +
scale_x_continuous(labels = scales::percent, breaks = train_fracs) +
scale_y_continuous(labels = scales::percent) +
scale_color_manual(values = c(
"Naive Bayes" = "#4CAF50",
"Decision Tree" = "#FF9800",
"Random Forest" = "#9C27B0",
"SVM" = "#2196F3"
)) +
labs(
title = "Learning Curves: Test Accuracy vs. Training Set Size",
subtitle = "Naive Bayes reaches good accuracy with very little data; complex models need more",
x = "Fraction of Data Used for Training",
y = "Mean Test Accuracy (30 repetitions)",
color = NULL
) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")Key observation: With only 10-20% of the data, Naive Bayes often outperforms more complex models. As training data grows, Random Forest and SVM catch up and eventually surpass NB. This makes NB an excellent choice when data is scarce.
| Concept | Key Idea | R Implementation |
|---|---|---|
| Bayes’ Theorem | Update your beliefs as new evidence arrives: posterior ∝ likelihood × prior | Manual or naiveBayes() |
| Prior P(C) | How common is each class before seeing any features? (Class frequencies in training data) | Computed automatically from training data |
| Likelihood P(X|C) | How likely are these feature values if the class is C? (The bell curve evaluation) | Estimated from per-class feature statistics |
| Posterior P(C|X) | The final, updated probability of each class given the observed features | predict(nb_model, newdata, type = ‘raw’) |
| Naive Assumption | Assume all features are independent given the class → simplifies joint probability to a product | Built into naiveBayes() — no option needed |
| Gaussian NB | Continuous features follow a normal distribution within each class → estimate mean and SD | naiveBayes() with continuous features (automatic) |
| Multinomial NB | Features are word counts or frequencies → used for text classification | quanteda or tm packages for text data |
| Bernoulli NB | Features are binary (present/absent) → explicitly models absence | Custom implementation (binary features) |
| Laplace Smoothing | Add α (usually 1) to all counts to prevent zero probabilities from killing the product | naiveBayes(…, laplace = 1) |
| Log-Probabilities | Use log(P) instead of P to avoid numerical underflow when multiplying many small numbers | Used internally; manual demo with log() |
| Model Comparison | NB creates smooth elliptical boundaries; fastest to train; best with small data or high dimensions | Compare with rpart(), randomForest(), svm() |
The Big Picture: Naive Bayes takes a fundamentally different approach from all the classifiers we’ve seen so far. Instead of building trees, drawing boundaries, or finding margins, it simply asks: “Which class makes the observed data most probable?” The naive independence assumption makes this question tractable — instead of estimating a complex joint distribution, we estimate simple per-feature distributions and multiply them together. The result is the simplest, fastest, and most probabilistically principled classifier in our toolbox. It won’t win every competition, but as a fast baseline that works surprisingly well — especially with small data, high dimensions, or text — Naive Bayes earns its place alongside Decision Trees, Random Forests, and SVMs.
[1] scikit-learn developers, “1.9. Naive Bayes,” scikit-learn 1.8.0 documentation, 2026. [Online]. Available: https://scikit-learn.org/stable/modules/naive_bayes.html
[2] Wikipedia contributors, “Naive Bayes classifier,” Wikipedia, The Free Encyclopedia, 2026. [Online]. Available: https://en.wikipedia.org/wiki/Naive_Bayes_classifier
[3] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd ed. Sebastopol, CA: O’Reilly Media, 2019, ch. 3.
[4] B. C. Boehmke and B. M. Greenwell, Hands-On Machine Learning with R. Boca Raton, FL: CRC Press, 2020, ch. 15. [Online]. Available: https://bradleyboehmke.github.io/HOML/
[5] D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012, ch. 10.
[6] H. Zhang, “The Optimality of Naive Bayes,” Proc. FLAIRS Conference, 2004. [Online]. Available: https://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf
[7] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch, “e1071: Misc Functions of the Department of Statistics, Probability Theory Group,” R package, 2023. [Online]. Available: https://CRAN.R-project.org/package=e1071
[8] Z. Huang, “seedhash: Deterministic Seed Generation from String Inputs Using MD5 Hashing,” R package, 2026. [Online]. Available: https://github.com/melhzy/seedhash