Logistic Regression

Logistic Regression is a binary classification technique that assumes numeric input variables with a Gaussian distribution. Although this assumption is not mandatory, the algorithm performs well even if the data does not adhere to this pattern. Logistic regression calculates coefficients for each input variable, combining them linearly into a regression function and applying a logistic transformation. While it is simple and fast, its effectiveness depends on the characteristics of the dataset. For this dataset, logistic regression may face limitations due to the non-Gaussian distribution of many attributes.

What Is Logistic Regression?

Logistic Regression is a statistical method used for classification problems.

It is especially common in binary classification tasks.

Although the name contains the word “Regression”, Logistic Regression is mainly used for classification, not regression.

The core idea is:

Use a linear model to calculate a score, then convert this score into a probability between 0 and 1.

The linear output is:

$z = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_nx_n$

Then Logistic Regression applies the sigmoid function to map this value into the range ([0, 1]):

$sigmoid(z)=\frac{1}{1+e^{-z}}$

The derivative of the sigmoid function is:

$sigmoid'(z)=sigmoid(z)(1-sigmoid(z))$

Therefore, the Logistic Regression prediction can be written as:

$P(y=1|x)=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_nx_n)}}$

Where:

(P(y=1|x)) represents the probability that the sample belongs to class 1.
(\beta_0) is the intercept.
(\beta_1, \beta_2, …, \beta_n) are model coefficients.
(x_1, x_2, …, x_n) are input features.

After getting the probability, we compare it with a threshold.

Usually, the threshold is 0.5.

If the predicted probability is greater than 0.5, the model predicts class 1.

Otherwise, it predicts class 0.

Simple Example

Suppose we have the following feature matrix:

$X = \begin{bmatrix} 0.5 & 0 & 0.7 \\\\ 0.5 & 0.5 & 0.9 \\\\ 0.1 & 1 & 0.6 \\\\ 0.6 & 0.1 & 0 \end{bmatrix}$

And the parameter vector:

$\beta = \begin{bmatrix} -1 \\\\ 2 \\\\ 0.5 \end{bmatrix}$

First, we calculate the linear output:

$X\beta = \begin{bmatrix} -0.15 \\\\ 0.95 \\\\ 2.2 \\\\ -0.4 \end{bmatrix}$

Then we apply the sigmoid function:

$sigmoid(X\beta)= \begin{bmatrix} 0.4626 \\\\ 0.7211 \\\\ 0.9002 \\\\ 0.4013 \end{bmatrix}$

If the threshold is 0.5, the final prediction is:

$\hat{y}= \begin{bmatrix} 0 \\\\ 1 \\\\ 1 \\\\ 0 \end{bmatrix}$

This means:

Sample 1 is predicted as class 0.
Sample 2 is predicted as class 1.
Sample 3 is predicted as class 1.
Sample 4 is predicted as class 0.

Common Use Cases of Logistic Regression

Logistic Regression can be used in many real-world classification tasks.

Common examples include:

Credit scoring: predicting whether a customer will default.
Fraud detection: predicting whether a transaction is fraudulent.
Spam detection: predicting whether an email is spam.
Ad click prediction: predicting whether a user will click an advertisement.
Image classification: classifying images into categories such as cats, dogs, or birds.
Sentiment analysis: classifying text as positive, negative, or neutral.
Product quality classification: predicting whether a product is qualified.
Medical diagnosis: predicting whether a patient has a disease.
Protein function prediction: predicting protein function based on sequence and structural features.

Loss Function of Logistic Regression

Logistic Regression usually uses Log Loss, also known as Binary Cross-Entropy Loss.

It measures the difference between the predicted probability distribution and the true label.

The loss function of Logistic Regression comes from Maximum Likelihood Estimation, MLE.

For a binary classification problem:

$P(y=1|x;\beta)=\frac{1}{1+e^{-\beta^Tx}}$ $P(y=0|x;\beta)=1-P(y=1|x;\beta)$

These two cases can be combined as:

$P(y|x;\beta)=P(y=1|x;\beta)^y(1-P(y=1|x;\beta))^{1-y}$

Substituting the sigmoid function:

$P(y|x;\beta) = \left(\frac{1}{1+e^{-\beta^Tx}}\right)^y \left(1-\frac{1}{1+e^{-\beta^Tx}}\right)^{1-y}$

Likelihood Function

For one sample, the likelihood function is:

$L(\beta)=P(y|x;\beta)$

For (n) samples, the likelihood function is:

$L(\beta)=\prod_{i=1}^{n}P(y_i|x_i;\beta)$

That is:

$L(\beta)= \prod_{i=1}^{n} P(y_i=1|x_i;\beta)^{y_i} (1-P(y_i=1|x_i;\beta))^{1-y_i}$

Taking the logarithm, we get the log-likelihood function:

$\log L(\beta) = \sum_{i=1}^{n} \left[ y_i\log P(y_i=1|x_i;\beta) + (1-y_i)\log(1-P(y_i=1|x_i;\beta)) \right]$

The training process tries to maximize the likelihood function.

For easier optimization, we usually minimize the negative average log-likelihood:

$Loss = -\frac{1}{n}\log L(\beta)$

Therefore:

$Loss = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i\log P(y_i=1|x_i;\beta) + (1-y_i)\log(1-P(y_i=1|x_i;\beta)) \right]$

This is the Binary Cross-Entropy Loss.

Intuition Behind Log Loss

Log Loss punishes confident wrong predictions heavily.

For example, if the true label is 1:

Predicting 0.9 gives a small loss.
Predicting 0.6 gives a larger loss.
Predicting 0.01 gives a very large loss.

This encourages the model to output high probabilities for correct classes and low probabilities for incorrect classes.

Gradient of the Loss Function

The loss function can be written in vectorized form:

$Loss = -\frac{1}{n}\left[y^T\log p + (1-y)^T\log(1-p)\right]$

Where:

$y = \begin{bmatrix} y_1 \\\\ y_2 \\\\ \vdots \\\\ y_n \end{bmatrix}$

and:

$p = \begin{bmatrix} P(y_1=1|x_1;\beta) \\\\ P(y_2=1|x_2;\beta) \\\\ \vdots \\\\ P(y_n=1|x_n;\beta) \end{bmatrix}$

Since:

$p = \frac{1}{1+e^{-X\beta}}$

The gradient of the loss function is:

$\nabla Loss = \frac{1}{n}X^T(p-y)$

Or:

$\nabla Loss = \frac{1}{n}X^T \left( \frac{1}{1+e^{-X\beta}} - y \right)$

This gradient is used by optimization algorithms to update model parameters.

Scikit-Learn API: LogisticRegression

In scikit-learn, LogisticRegression is used to implement Logistic Regression.

It supports both binary classification and multi-class classification.

It also provides different optimization algorithms and regularization options.

Important parameters include:

solver: the optimization algorithm.
penalty: the regularization type.
C: inverse of regularization strength.
class_weight: class weight strategy.

Common solver options:

lbfgs: quasi-Newton method, default solver, supports L2 regularization.
newton-cg: Newton method, supports L2 regularization.
liblinear: coordinate descent method, suitable for small datasets, supports L1 and L2 regularization.
sag: Stochastic Average Gradient, suitable for large datasets, supports L2 regularization.
saga: improved stochastic optimization, suitable for large datasets, supports L1, L2, and ElasticNet regularization.

Example:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    solver="lbfgs",
    penalty="l2",
    C=1,
    class_weight="balanced"
)

Explanation:

solver="lbfgs" means using the L-BFGS optimization algorithm.
penalty="l2" means using L2 regularization.
C=1 controls regularization strength.
Smaller C means stronger regularization.
class_weight="balanced" adjusts class weights automatically for imbalanced datasets.

Why Use StandardScaler?

Logistic Regression is optimized using iterative algorithms.

If numerical features have very different scales, optimization can become slow or unstable.

For example:

Age may range from 20 to 80.
Cholesterol may range from 100 to 400.
ST depression may range from 0 to 6.

StandardScaler transforms numerical features using:

$z = \frac{x-\mu}{\sigma}$

Where:

(\mu) is the mean.
(\sigma) is the standard deviation.

After standardization:

The mean is approximately 0.
The standard deviation is approximately 1.

This helps the optimization algorithm converge faster.

Why Use OneHotEncoder?

Some features are categorical.

For example, chest pain type may have values such as 1, 2, 3, and 4.

These numbers represent categories, not numerical order.

If we directly use them as numbers, the model may incorrectly assume that category 4 is larger than category 1.

OneHotEncoder converts categorical values into separate binary columns.

Example:

Chest Pain Type
1
2
3
4

After one-hot encoding with drop="first":

ChestPain_2  ChestPain_3  ChestPain_4
0            0            0
1            0            0
0            1            0
0            0            1

The first category is dropped to reduce multicollinearity.

Why Use Passthrough for Binary Features?

Binary features already contain only two values, such as 0 and 1.

For example:

Sex: 0 or 1.
Fasting blood sugar: 0 or 1.
Exercise-induced angina: 0 or 1.

These features can usually be passed directly into the model.

That is why the code uses:

1	("binary", "passthrough", binary_features)

Multi-Class Classification

Logistic Regression is commonly used for binary classification.

However, it can also be extended to multi-class classification.

Two common methods are:

One-vs-Rest, OvR.
Softmax Regression, also known as Multinomial Logistic Regression.

One-vs-Rest, OvR

In One-vs-Rest classification, if there are (C) classes, the model trains (C) binary classifiers.

Each classifier treats one class as the positive class and all other classes as the negative class.

During prediction, the model calculates the probability from each classifier and selects the class with the highest probability.

For class (c):

$P(y=c|x;\beta)=\frac{1}{1+e^{-\beta^Tx}}$

The loss function is:

$Loss = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i\log P(y_i=c|x_i;\beta) + (1-y_i)\log(1-P(y_i=c|x_i;\beta)) \right]$

Advantages:

Simple and easy to implement.
Suitable when the number of classes is small.

Disadvantages:

It trains one classifier for each class.
Training time can become long when there are many classes.

API example:

1
2
3

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(multi_class="ovr")

Or:

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

model = OneVsRestClassifier(LogisticRegression())

Softmax Regression

Softmax Regression directly extends Logistic Regression to multi-class classification.

Instead of training multiple binary classifiers, it trains one model to output probabilities for all classes.

If there are (C) classes, the model outputs (C) scores.

The Softmax function converts these scores into probabilities.

For class (c):

$P(y=c|x)= \frac{e^{\beta_c^Tx}} {\sum_{j=1}^{C}e^{\beta_j^Tx}}$

The loss function is:

$Loss = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} I(y_i=c)\log P(y_i=c|x_i)$

Where:

(I(y_i=c)) is an indicator function.
If (y_i=c), then (I(y_i=c)=1).
Otherwise, (I(y_i=c)=0).

Advantages:

Only one model is trained.
It is usually more efficient and more consistent for multi-class classification.

Disadvantages:

Softmax needs to calculate exponentials for all classes.
The computational cost can be high when there are many classes.

API example:

1
2
3

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(multi_class="multinomial")

In many modern versions of scikit-learn, LogisticRegression can automatically handle multi-class classification.

Therefore, the parameter can often be omitted:

1	model = LogisticRegression()

Summary

Logistic Regression is a classification algorithm.

It uses a linear model to calculate a score.

Then it uses the sigmoid function to convert the score into a probability.

For binary classification, the model predicts class 1 if the probability is greater than the threshold.

The most common loss function is Binary Cross-Entropy Loss, also known as Log Loss.

The loss function comes from Maximum Likelihood Estimation.

Logistic Regression can be used in many practical tasks, such as fraud detection, spam detection, medical diagnosis, and ad click prediction.

For categorical features, OneHotEncoder is usually needed.

For numerical features, StandardScaler is often used to improve optimization.

For binary features, we can usually keep them unchanged.

Logistic Regression can also be extended to multi-class classification using One-vs-Rest or Softmax Regression.
```