Naive Bayes

Introduction to Naive Bayes

Naive Bayes is a probability-based machine learning algorithm. It is based on Bayes’ Theorem and assumes that all features are independent of each other. This independence assumption is why it is called “naive”. Although this assumption is often not fully true in real-world data, Naive Bayes still performs very well in many practical tasks.

Common applications include:

Text classification
Spam email detection
Sentiment analysis
Document classification
News classification

Naive Bayes is simple to implement, fast to train, and efficient for prediction.

Bayes’ Theorem

The core of Naive Bayes is Bayes’ Theorem:

$P(Y|X)=\frac{P(X|Y)P(Y)}{P(X)}$

Where:

(P(Y|X)): posterior probability, the probability of class (Y) given feature (X).
(P(X|Y)): conditional probability, the probability of feature (X) appearing given class (Y).
(P(Y)): prior probability, the probability of class (Y).
(P(X)): evidence, the probability of feature (X).

Example: Spam Email Detection

Suppose we want to determine whether an email is spam.

Let:

(Y = spam)
(X =) the email contains the word “free”

Then:

(P(Y|X)): the probability that an email is spam given that it contains the word “free”.
(P(X|Y)): the probability that a spam email contains the word “free”.
(P(Y)): the probability that an email is spam.
(P(X)): the probability that an email contains the word “free”.

In classification, we want to calculate the posterior probability (P(Y|X)).

The class with the highest posterior probability becomes the predicted class.

The Naive Independence Assumption

In real-world data, features may depend on each other. However, Naive Bayes assumes that all features are conditionally independent given the class. This assumption greatly simplifies the calculation of conditional probabilities.
If the input features are:

$X = (X_1, X_2, \dots, X_n)$

Then:

$P(X_1, X_2, \dots, X_n | Y) = P(X_1|Y) \cdot P(X_2|Y) \cdots P(X_n|Y)$

This can also be written as:

$P(X_1, X_2, \dots, X_n | Y) = \prod_{j=1}^{n}P(X_j|Y)$

This is the key idea behind Naive Bayes.

Instead of estimating one complicated joint probability, we estimate many simple conditional probabilities.

Learning in Naive Bayes

In Naive Bayes, learning means estimating two types of probabilities:

Prior probability
Conditional probability

The prior probability tells us how likely each class is before seeing any features. The conditional probability tells us how likely a feature value is under a specific class. These probabilities can be estimated from the training data.

Maximum Likelihood Estimation

Maximum Likelihood Estimation, or MLE, is commonly used to estimate probabilities in Naive Bayes. Suppose there are (K) classes:

$C_1, C_2, \dots, C_K$

The prior probability of class (C_k) can be estimated as:

$P(Y=C_k) = \frac{\sum_{i=1}^{N}I(y_i=C_k)}{N}$

Where:

(N): total number of training samples.
(C_k): the (k)-th class.
(I(y_i=C_k)): indicator function.
(I(y_i=C_k)=1) if sample (i) belongs to class (C_k), otherwise it equals 0.

In simple words:

The prior probability of a class is the proportion of samples belonging to that class.

Conditional Probability Estimation

For a feature (Xj), suppose it can take value (a{jl}).
The conditional probability is:

$P(X_j=a_{jl}|Y=C_k)$

This means:

Given that the class is (Ck), what is the probability that feature (X_j) has value (a{jl})?

Using Maximum Likelihood Estimation:

$P(X_j=a_{jl}|Y=C_k) = \frac{ \sum_{i=1}^{N}I(x_{ji}=a_{jl}, y_i=C_k) }{ \sum_{i=1}^{N}I(y_i=C_k) }$

Where:

(j=1,2,\dots,n): feature index.
(l=1,2,\dots,L): possible value index of the feature.
(k=1,2,\dots,K): class index.
(I) is an indicator function.

In simple words:

The conditional probability is the proportion of samples in class (Ck) whose feature (X_j) equals (a{jl}).

The Zero Probability Problem

Maximum Likelihood Estimation may cause a serious problem. If a certain feature value never appears in a class during training, its estimated probability will be 0.

For example:

$P(X_j=a_{jl}|Y=C_k)=0$

Since Naive Bayes multiplies conditional probabilities together:

$P(Y=C_k)\prod_{j=1}^{n}P(X_j=x_j|Y=C_k)$

If any conditional probability is 0, the whole product becomes 0. This may incorrectly eliminate a class, even if other features strongly support it.
This is called the zero probability problem.

Bayesian Estimation and Laplace Smoothing

To solve the zero probability problem, we can use Bayesian estimation. The idea is to add a smoothing parameter (\lambda). The Bayesian estimate of the prior probability is:

$P_{\lambda}(Y=C_k) = \frac{ \sum_{i=1}^{N}I(y_i=C_k)+\lambda }{ N+K\lambda }$

The Bayesian estimate of the conditional probability is:

$P_{\lambda}(X_j=a_{jl}|Y=C_k) = \frac{ \sum_{i=1}^{N}I(x_{ji}=a_{jl}, y_i=C_k)+\lambda }{ \sum_{i=1}^{N}I(y_i=C_k)+L\lambda }$

Where:

(\lambda \geq 0)
(K) is the number of classes.
(L) is the number of possible values for feature (X_j).

When (\lambda=0), Bayesian estimation becomes Maximum Likelihood Estimation.

When (\lambda=1), it is called Laplace Smoothing.

Laplace Smoothing prevents probabilities from becoming 0.

Learning Process

During training, Naive Bayes calculates:

Prior probabilities

$P(Y=C_k), \quad k=1,2,\dots,K$

These represent the probability of each class.

Conditional probabilities

$P(X_j=a_{jl}|Y=C_k)$

Where:

$j=1,2,\dots,n$ $l=1,2,\dots,L$ $k=1,2,\dots,K$

These probabilities represent how likely each feature value appears under each class.

Classification Process

During prediction, given a new sample:

$x=(x_1,x_2,\dots,x_n)$

Naive Bayes calculates the posterior score for each class:

$P(Y=C_k)\prod_{j=1}^{n}P(X_j=x_j|Y=C_k)$

Then it chooses the class with the largest value:

$y = \arg\max_{C_k} P(Y=C_k)\prod_{j=1}^{n}P(X_j=x_j|Y=C_k)$

In simple words:

Calculate the probability score for every class, then choose the class with the highest score.

Why Can We Ignore (P(X))?

According to Bayes’ Theorem:

$P(Y|X)=\frac{P(X|Y)P(Y)}{P(X)}$

During classification, (P(X)) is the same for all classes.

Therefore, it does not affect which class has the highest probability.

So we only need to compare:

$P(X|Y)P(Y)$

With the independence assumption:

$P(X|Y)P(Y) = P(Y)\prod_{j=1}^{n}P(X_j|Y)$

This is why the final Naive Bayes decision rule is:

$y = \arg\max_{C_k} P(Y=C_k)\prod_{j=1}^{n}P(X_j=x_j|Y=C_k)$

Simple Intuition

Naive Bayes works like this:

Look at how common each class is.
Look at how common each feature is within each class.
For a new sample, calculate which class is most likely.
Choose the class with the highest probability.

For example, in spam detection:

If words like “free”, “discount”, and “winner” often appear in spam emails,
and rarely appear in normal emails,
then an email containing these words may have a high probability of being spam.

Advantages of Naive Bayes

Naive Bayes has several advantages:

It is simple and easy to understand.
It is fast to train.
It is fast to predict.
It works well with high-dimensional data.
It is especially useful for text classification.
It performs well even with relatively small datasets.

Limitations of Naive Bayes

Naive Bayes also has some limitations:

The feature independence assumption is often too strong.
It may perform poorly when features are highly correlated.
Probability estimates may not always be accurate.
Without smoothing, the zero probability problem may occur.

However, despite these limitations, Naive Bayes is still very useful in many real-world applications.

Summary

Naive Bayes is a probabilistic classification algorithm based on Bayes’ Theorem. It assumes that features are conditionally independent given the class. The learning process estimates prior probabilities and conditional probabilities. Maximum Likelihood Estimation can be used to estimate these probabilities. However, MLE may cause zero probabilities. Bayesian estimation with Laplace Smoothing can solve this problem.

During classification, Naive Bayes chooses the class that maximizes:

$P(Y=C_k)\prod_{j=1}^{n}P(X_j=x_j|Y=C_k)$

Naive Bayes is simple, efficient, and widely used in tasks such as spam detection, sentiment analysis, and text classification.

Decision Tree

Introduction to Decision Trees

A Decision Tree is a machine learning algorithm based on a tree-like structure. It makes predictions by asking a sequence of questions about the input features. Each question splits the data into smaller groups until the model reaches a final prediction.

Decision Trees can be used for both:

Classification tasks
Regression tasks

A Decision Tree usually consists of four parts:

Root node: the starting node that contains all training data.
Internal nodes: decision points based on feature conditions.
Branches: paths created by different decision outcomes.
Leaf nodes: final prediction results.

Decision Trees are useful when we need a model that is easy to understand, rule-based, and fast to make decisions. They often perform well when the features are clear and the dataset size is moderate.
In more complex tasks, Decision Trees are commonly used as base models in ensemble methods, such as:

Random Forest
Gradient Boosting Decision Tree
XGBoost
LightGBM

How a Decision Tree Works

The learning process of a Decision Tree usually includes three steps:

Feature selection
Tree generation
Tree pruning

If there are many features, we may first select useful features before building the tree. The goal is to keep features that have strong classification ability. During training, the algorithm recursively selects the best feature to split the dataset.

The process starts from the root node.All training samples are placed in the root node first. Then, the algorithm selects the best feature and splits the dataset into several subsets according to that feature. If a subset can already be classified correctly, it becomes a leaf node. If a subset still cannot be classified well, the algorithm selects another feature and continues splitting.

This process continues until:

The samples can be classified correctly enough.
There are no useful features left.
A stopping condition is reached.

Finally, each sample group is assigned to a leaf node, and each leaf node gives a prediction. Each split in a Decision Tree can be understood as introducing a boundary in the feature space. For classification, the tree keeps splitting the feature space into smaller regions.

Overfitting in Decision Trees

A Decision Tree may classify the training data very well. However, it may perform poorly on unseen test data. This problem is called overfitting. Overfitting happens when the tree becomes too complex and learns too many details or noise from the training data. To reduce overfitting, we need pruning. Pruning simplifies the tree by removing unnecessary branches or leaf nodes. A simpler tree usually has better generalization ability. Decision Tree generation focuses mainly on local optimal choices. Decision Tree pruning considers the overall structure and tries to improve global performance.

Feature Selection and Tree Generation

Feature selection means choosing features that are useful for splitting the data. A good feature should separate different classes clearly. If splitting by a feature gives almost the same result as random guessing, then this feature has weak classification ability. Removing such features usually does not significantly reduce model accuracy.

Common feature selection criteria include:

Information Gain
Information Gain Ratio
Gini Index

Different Decision Tree algorithms use different criteria.
For example:

ID3 uses Information Gain.
C4.5 uses Information Gain Ratio.
CART uses Gini Index for classification trees.
CART uses squared error for regression trees.

Entropy

Entropy is a measure of uncertainty. The greater the entropy, the higher the uncertainty.

Suppose (X) is a discrete random variable with possible values:

$x_1, x_2, \dots, x_n$

Its probability distribution is:

$P(X=x_i)=p_i, \quad i=1,2,\dots,n$

The entropy of (X) is defined as:

$H(X)=-\sum_{i=1}^{n}p_i\log p_i$

If (p_i=0), we define:

$0\log0=0$

The logarithm is usually based on 2 or (e). Entropy only depends on the probability distribution of (X), not on the actual values of (X). In Decision Trees, entropy is used to measure how mixed or uncertain a dataset is. If all samples belong to the same class, entropy is 0.
If samples are evenly distributed among different classes, entropy is high.

Conditional Entropy

Conditional entropy measures the uncertainty of one variable when another variable is known.
Suppose we have two random variables (X) and (Y).
The conditional entropy of (Y) given (X) is:

$H(Y|X)=\sum_{i=1}^{n}P(X=x_i)H(Y|X=x_i)$

In Decision Trees, conditional entropy tells us:

After splitting the dataset by feature (X), how uncertain is the class label (Y)?

A good feature should reduce uncertainty as much as possible.

Information Gain and ID3

Information Gain measures how much uncertainty is reduced after splitting the dataset by a feature.
Given dataset (D) and feature (A):

(H(D)) represents the uncertainty before splitting.
(H(D|A)) represents the uncertainty after splitting by feature (A).

The Information Gain is:

$g(D,A)=H(D)-H(D|A)$

A larger Information Gain means feature (A) has stronger classification ability.

During feature selection, the algorithm calculates the Information Gain of each feature and chooses the feature with the largest value.

How to Calculate Information Gain

Suppose:

The training dataset (D) has (|D|) samples.
Feature (A) has (n) possible values:

$a_1,a_2,\dots,a_n$

Based on feature (A), dataset (D) is divided into subsets:

$D_1,D_2,\dots,D_n$

There are (K) classes:

$C_1,C_2,\dots,C_K$

(D_{ik}) represents the samples in subset (D_i) that belong to class (C_k).

First, calculate the entropy of dataset (D):

$H(D) = -\sum_{k=1}^{K} \frac{|C_k|}{|D|} \log_2 \frac{|C_k|}{|D|}$

Then, calculate the conditional entropy of (D) given feature (A):

$H(D|A) = \sum_{i=1}^{n} \frac{|D_i|}{|D|}H(D_i)$

That is:

$H(D|A) = -\sum_{i=1}^{n} \frac{|D_i|}{|D|} \sum_{k=1}^{K} \frac{|D_{ik}|}{|D_i|} \log_2 \frac{|D_{ik}|}{|D_i|}$

Finally, calculate Information Gain:

$g(D,A)=H(D)-H(D|A)$

ID3 Algorithm

The ID3 algorithm builds a Decision Tree using Information Gain. At each node, ID3 calculates the Information Gain of all available features. Then it selects the feature with the largest Information Gain as the splitting feature.
The process is recursive:

Start from the root node.
Calculate Information Gain for each feature.
Select the feature with the largest Information Gain.
Split the dataset by this feature.
Repeat the process for each child node.
Stop when all samples are classified well or no useful feature remains.

ID3 is simple and intuitive. However, it tends to prefer features with many possible values. For example, an ID-like feature may have very high Information Gain but may not generalize well.

Information Gain Ratio and C4.5

Information Gain may prefer features with many distinct values. To fix this problem, C4.5 uses Information Gain Ratio.

The Information Gain Ratio of feature (A) on dataset (D) is defined as:

$g_R(D,A)=\frac{g(D,A)}{H_A(D)}$

Where:

$H_A(D) = -\sum_{i=1}^{n} \frac{|D_i|}{|D|} \log_2 \frac{|D_i|}{|D|}$

Here, (H_A(D)) measures the entropy of the dataset split by feature (A).

The Information Gain Ratio adjusts Information Gain by considering how many branches the feature creates. C4.5 is an improvement over ID3.

The main difference is:

ID3 uses Information Gain.
C4.5 uses Information Gain Ratio.

Gini Index and CART

The Gini Index is another measure of impurity. Suppose there are (K) classes, and the probability that a sample belongs to class (k) is (p_k).

The Gini Index is:

$Gini(p) = \sum_{k=1}^{K}p_k(1-p_k) = 1-\sum_{k=1}^{K}p_k^2$

For a given dataset (D), the Gini Index is:

$Gini(D) = 1-\sum_{k=1}^{K} \left( \frac{|C_k|}{|D|} \right)^2$

Where:

(C_k) is the subset of samples belonging to class (k).
(K) is the number of classes.

A larger Gini Index means the dataset is more uncertain or more mixed. A smaller Gini Index means the dataset is purer.

Gini Index After Splitting

If dataset (D) is split into two subsets (D_1) and (D_2) according to feature (A), then the Gini Index after splitting is:

$Gini(D,A) = \frac{|D_1|}{|D|}Gini(D_1) + \frac{|D_2|}{|D|}Gini(D_2)$

In CART classification trees, the algorithm chooses the feature and split point that produce the smallest Gini Index. The smaller the Gini Index after splitting, the purer the child nodes are.

CART Classification Tree

CART stands for Classification and Regression Tree. A CART classification tree is a binary tree. This means each split divides the dataset into two parts. For each feature (A) and each possible split value (a), CART calculates the Gini Index after splitting. Then it selects the feature and split point with the smallest Gini Index.
The process continues recursively until one of the stopping conditions is met, such as:

The number of samples in a node is smaller than a threshold.
The Gini Index is already small enough.
There are no more useful features.
The maximum tree depth is reached.

CART is widely used because it is simple, efficient, and works for both classification and regression tasks.

CART Regression Tree

A regression tree is used when the target variable is continuous. A regression tree divides the input feature space into several regions.
Suppose the input space is divided into (M) regions:

$R_1,R_2,\dots,R_M$

Each region (R_m) has a fixed output value (c_m).

The regression tree model can be written as:

$f(x) = \sum_{m=1}^{M}c_mI(x\in R_m)$

Where:

(I(x\in R_m)) is an indicator function.
If (x) belongs to region (R_m), then (I(x\in R_m)=1).
Otherwise, it equals 0.

In a regression tree, the prediction of a leaf node is usually the average target value of samples in that leaf.

Splitting in CART Regression Trees

In CART regression trees, each split tries to minimize squared error. For a feature (j) and a split point (s), the input space is divided into two regions:

$R_1(j,s)=\{x|x_j\leq s\}$ $R_2(j,s)=\{x|x_j>s\}$

The goal is to find the best feature (j) and the best split point (s) that minimize:

$\min_{j,s} \left[ \min_{c_1} \sum_{x_i\in R_1(j,s)} (y_i-c_1)^2 + \min_{c_2} \sum_{x_i\in R_2(j,s)} (y_i-c_2)^2 \right]$

Once the best split is found, the dataset is divided into two regions. The same process is repeated recursively.

For each final leaf region (R_m), the output value is:

$\hat{c}_m = ave(y_i|x_i\in R_m)$

This means the prediction is the average target value of samples in that region. This type of tree is also called a least squares regression tree.

Decision Tree Pruning

Decision Trees can easily overfit the training data. Overfitting happens when the tree becomes too deep or too complex. To reduce overfitting, we use pruning. Pruning removes unnecessary branches or leaf nodes and makes the tree simpler. A simpler tree often generalizes better to unseen data.

There are two common types of pruning:

Pre-Pruning
Post-Pruning

Pre-Pruning

Pre-Pruning stops the tree from growing too much during the training process. It prevents excessive splitting by setting stopping conditions in advance.
Common pre-pruning strategies include:

Limiting the maximum tree depth.
Setting the minimum number of samples required to split a node.
Setting the minimum number of samples required in a leaf node.
Setting the minimum impurity decrease.
Limiting the maximum number of leaf nodes.

Pre-Pruning is efficient because it prevents the tree from becoming too large. However, if the stopping conditions are too strict, the tree may become too simple.

This may cause underfitting. Therefore, choosing good stopping thresholds is important.

Post-Pruning

Post-Pruning first allows the tree to grow fully. Then it removes unnecessary branches from the bottom up. The goal is to simplify the tree while keeping or improving validation performance.

Common post-pruning methods include:

Cost-Complexity Pruning, CCP
Reduced Error Pruning, REP

Post-Pruning usually gives better results than very aggressive pre-pruning because it evaluates the tree after it has already learned the data structure.

Cost-Complexity Pruning

Cost-Complexity Pruning balances training error and tree complexity. It defines the cost-complexity function as:

$C_{\alpha}(T)=C(T)+\alpha|T|$

Where:

(T) is a subtree.
(|T|) is the number of leaf nodes in the subtree.
(C(T)) is the prediction error on the training data.
$\alpha \geq 0$ controls the trade-off between model fit and model complexity.
$C_{\alpha}(T)$ is the overall cost of subtree $T$.

When $\alpha$ is small, the model focuses more on fitting the training data.

When $\alpha$ is large, the model prefers a simpler tree.

The pruning process starts from $\alpha = 0$ and gradually increases $\alpha$.

For each value of $\alpha$, the algorithm prunes the subtree that minimizes $C_{\alpha}(T)$.

This creates a sequence of pruned trees.

Finally, cross-validation can be used to select the best $\alpha$ and the best tree.

Reduced Error Pruning

Reduced Error Pruning uses a validation set to decide whether to prune.

The process is:

Train a full Decision Tree.
Traverse internal nodes from bottom to top.
Temporarily replace a subtree with a leaf node.
Compare validation error before and after replacement.
If validation error does not increase, prune the subtree.
Repeat until no more useful pruning can be done.

Reduced Error Pruning is easy to understand.

Its main idea is:

If removing a subtree does not make validation performance worse, then the subtree is unnecessary.

Comparison of ID3, C4.5, and CART

Algorithm	Splitting Criterion	Tree Type	Task Type
ID3	Information Gain	Multi-branch tree	Classification
C4.5	Information Gain Ratio	Multi-branch tree	Classification
CART	Gini Index or Squared Error	Binary tree	Classification and Regression

ID3 is simple but may prefer features with many values. C4.5 improves ID3 by using Information Gain Ratio. CART is widely used in modern machine learning because it supports both classification and regression.

Advantages of Decision Trees

Decision Trees have several advantages:

Easy to understand.
Easy to visualize.
Can handle both classification and regression.
Can handle nonlinear relationships.
Requires little data preprocessing.
Does not require feature scaling.
Makes fast predictions.
Can handle both numerical and categorical features.

These advantages make Decision Trees very popular in practical machine learning.

Limitations of Decision Trees

Decision Trees also have some limitations:

They can easily overfit.
Small changes in data may lead to a very different tree.
A single Decision Tree may have limited prediction accuracy.
Greedy splitting may not find the globally optimal tree.
Deep trees may become complex and hard to generalize.

To overcome these limitations, Decision Trees are often combined into ensemble models, such as Random Forest and Gradient Boosting.

Summary

A Decision Tree is a tree-based machine learning algorithm.

It makes predictions by recursively splitting the dataset according to feature conditions.

A Decision Tree consists of root nodes, internal nodes, branches, and leaf nodes.

The main learning process includes:

Feature selection
Tree generation
Tree pruning

Common feature selection criteria include:

Information Gain
Information Gain Ratio
Gini Index

ID3 uses Information Gain.

C4.5 uses Information Gain Ratio.

CART uses Gini Index for classification and squared error for regression.

Decision Trees are easy to interpret, fast, and useful for many tasks.

However, they can easily overfit, so pruning is important.

Pre-Pruning stops the tree early during training.

Post-Pruning simplifies the tree after it is fully grown.

Decision Trees are also the foundation of many powerful ensemble learning algorithms, such as Random Forest, XGBoost, and LightGBM.

Support Vector Machine

Introduction to Support Vector Machine

Support Vector Machine (SVM) is a binary classification model.
Its core objective is to find a hyperplane with the maximum margin to separate data points from different classes.

In different dimensions, the hyperplane can be understood as:

In two-dimensional space, it is a straight line.
In three-dimensional space, it is a plane.
In higher-dimensional space, it is a hyperplane.

The maximum margin means that the distance between the hyperplane and the nearest data points is maximized.
This gives SVM strong generalization ability.

SVM learning methods include three models from simple to complex:

Linearly Separable Support Vector Machine
Linear Support Vector Machine
Nonlinear Support Vector Machine

Linearly Separable Support Vector Machine: Hard Margin

Hard Margin

When the training samples are linearly separable, a linearly separable SVM can be learned by maximizing the hard margin.

The hard margin means that the hyperplane can completely separate samples from different classes.

The sample points closest to the hyperplane are called support vectors.
They directly determine the position and direction of the hyperplane.
As long as the support vectors remain unchanged, the hyperplane will not change.

Representation of the Hyperplane

In the sample space, the hyperplane can be represented as:

1	w^T x + b = 0

where:

w = (w_1, w_2, ..., w_n) is the normal vector, which determines the direction of the hyperplane.
b is the bias term, which determines the distance between the hyperplane and the origin.

Therefore, the hyperplane can be denoted as:

(w, b)

The corresponding classification function is called the linearly separable support vector machine:

1	f(x) = sign(w^T x + b)

Margin and Maximum Margin

Let x' be a point on the hyperplane. Then:

1	w^T x' + b = 0

The distance from any point x in the sample space to the hyperplane (w, b) is:

1	r = \|w^T(x - x')\| / \|\|w\|\|

Since:

1	w^T x' + b = 0

the distance can be written as:

1	r = \|w^T x + b\| / \|\|w\|\|

Functional Margin

Let the class label of each sample point x_i be y_i.
The functional margin of this sample point is:

1	γ̂_i = y_i(w^T x_i + b)

The functional margin represents the correctness and confidence of the classification prediction.

If the hyperplane (w, b) correctly classifies the samples, then:

1 2	w^T x_i + b > 0, y_i = +1 w^T x_i + b < 0, y_i = -1

Therefore:

1	y_i(w^T x_i + b) = \|w^T x_i + b\| > 0

The distance from sample point x_i to the hyperplane is:

1	r_i = \|w^T x_i + b\| / \|\|w\|\|

It can also be written as:

1	r_i = y_i(w^T x_i + b) / \|\|w\|\|

That is:

1	r_i = γ̂_i / \|\|w\|\|

Maximum Margin

Notice that scaling w and b:

1 2	w → λw b → λb

does not change the hyperplane, nor does it change the geometric distance r.

However, the functional margin:

1	γ̂_i = y_i(w^T x_i + b)

changes when w and b are scaled.

In other words, the functional margin γ̂_i can be arbitrarily scaled by scaling w and b, while the geometric distance r_i remains unchanged.

Therefore, we can set the functional margin of the support vectors to:

γ̂_i = 1

At this time, the distance from the support vector to the hyperplane is:

1	r_i = γ̂_i / \|\|w\|\| = 1 / \|\|w\|\|

The sum of the distances from two support vectors of different classes to the hyperplane is:

1	γ = 2 / \|\|w\|\|

Here, γ is called the margin.

Maximum Margin Optimization Problem

To find the hyperplane with the maximum margin, we need to maximize γ under the constraint:

max    2 / ||w||
w,b

s.t.   y_i(w^T x_i + b) ≥ 1

This is equivalent to:

min    ||w||^2 / 2
w,b

s.t.   y_i(w^T x_i + b) ≥ 1

Here, s.t. stands for Subject to, meaning the constraints that must be satisfied.

The above formula is the basic form of the support vector machine.

By applying the Lagrange multiplier method to the above problem, we can obtain its dual problem.
By solving the Lagrange multipliers in the dual problem, we can further obtain w and b, and finally obtain the maximum-margin hyperplane.

Linear Support Vector Machine: Soft Margin

Previously, we assumed that the training samples were linearly separable in the sample space.
However, in real-world situations, this is often not the case.

In other words, it may be impossible to find a suitable hyperplane that correctly separates all sample points.

Usually, there are some outliers in the training data.
If these outliers are removed, most of the remaining samples may become linearly separable.

In this case, we can relax the condition and allow some samples to be misclassified.
This leads to the concept of the soft margin.

Slack Variables

Linear inseparability means that some sample points (x_i, y_i) cannot satisfy the constraint:

1	y_i(w^T x_i + b) ≥ 1

To solve this problem, we introduce a slack variable for each sample point:

ξ_i ≥ 0

This allows the functional margin plus the slack variable to satisfy the required condition.

The constraint then becomes:

1	y_i(w^T x_i + b) ≥ 1 - ξ_i

Objective Function of Soft Margin

To maximize the margin while keeping the number of samples that violate the constraint as small as possible, a penalty term for misclassification is introduced into the objective function:

1	\|\|w\|\|^2 / 2 + C∑ξ_i

where:

C > 0 is the penalty coefficient.
A larger value of C means a larger penalty for misclassification.

The learning problem of the linear support vector machine for linearly inseparable data can be expressed as:

min    ||w||^2 / 2 + C∑ξ_i
w,b,ξ

s.t.   y_i(w^T x_i + b) ≥ 1 - ξ_i
       ξ_i ≥ 0

This is the soft-margin support vector machine.

Nonlinear Support Vector Machine: Kernel Function

Nonlinear Classification Problem

A nonlinear classification problem refers to a problem that can only be classified well by using a nonlinear model.

In this type of problem, the data cannot be directly classified using a hyperplane.

In this case, we can use a kernel function to map the data from the original space to a high-dimensional feature space.
This makes the data linearly separable in the high-dimensional feature space.

In this way, the original nonlinear problem is transformed into a linear problem.

Learning a nonlinear SVM using the kernel trick is equivalent to implicitly learning a linear SVM in a high-dimensional feature space.

Feature Mapping

Let:

ϕ(x)

represent the feature vector after mapping x.

In the feature space, the hyperplane can be represented as:

1	w^T ϕ(x) + b = 0

The corresponding optimization problem is:

min    ||w||^2 / 2
w,b

s.t.   y_i(w^T ϕ(x_i) + b) ≥ 1

Kernel Function

When converting the above problem into its dual form and solving it, we need to compute:

1	ϕ(x_i)^T ϕ(x_j)

This is the inner product of sample x_i and sample x_j after they are mapped into the feature space.

Since the feature space may have a very high dimension, directly computing:

1	ϕ(x_i)^T ϕ(x_j)

is usually very difficult.

To solve this problem, we introduce a function:

1	κ(x_i, x_j) = <ϕ(x_i), ϕ(x_j)>

That is:

1	κ(x_i, x_j) = ϕ(x_i)^T ϕ(x_j)

Here, κ is called the kernel function.

Obviously, if the mapping function ϕ is known, the corresponding kernel function can be written.
However, in real tasks, we usually do not know the specific form of ϕ.

So how can we determine whether a given function κ is a valid kernel function?

In fact, as long as the kernel matrix corresponding to a symmetric function is positive semi-definite, it can be used as a kernel function.

This definition is useful when constructing kernel functions.
However, for a specific function κ, checking whether it is a positive definite kernel function is not easy.
Therefore, in practice, commonly used existing kernel functions are often selected.

Common Kernel Functions

The following are several commonly used kernel functions:

Kernel Function	Expression
Linear Kernel	`κ(x_i, x_j) = x_i^T x_j`
Polynomial Kernel	`κ(x_i, x_j) = (x_i^T x_j)^d`
Gaussian Kernel	`κ(x_i, x_j) = exp(-	x_i - x_j	^2 / 2σ^2)`
Laplacian Kernel	`κ(x_i, x_j) = exp(-	x_i - x_j	/ σ)`
Sigmoid Kernel	`κ(x_i, x_j) = tanh(βx_i^T x_j + θ)`

Summary

The core idea of Support Vector Machine is:

To find a hyperplane that maximizes the classification margin, thereby improving the generalization ability of the model.

According to whether the data is linearly separable, SVM can be divided into:

Hard-Margin SVM: suitable for linearly separable data.
Soft-Margin SVM: allows a small number of misclassified samples and is suitable for approximately linearly separable data.
Nonlinear SVM: uses kernel functions to map data into a high-dimensional space to solve nonlinear classification problems.