Logistic Regression

Linear Regression

Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables. It is used to predict the value of the dependent variable based on the values of the independent variables.

Working Principle

The linear regression model is based on the assumption that there is a linear relationship between the dependent variable and the independent variables. The model is trained on a dataset of input-output pairs, where the input is a vector of independent variables and the output is the value of the dependent variable.

The goal of linear regression is to find the best-fitting line that minimizes the sum of the squared differences between the predicted and actual values. This line is called the regression line or the best-fit line.

formula:

$y = b_0 + b_1x_1 + b_2x_2 + \cdots + b_nx_n$

where:

y is the dependent variable
x is the independent variable
b0 is the intercept, which represents the value of y when all independent variables are zero
b1 is the slope, which represents the change in y for a one-unit increase in x
b2 is the second slope, which represents the change in y for a two-unit increase in x
and so on

Key Parameters

Intercept: The value of y when all independent variables are zero.
Slope: The change in y for a one-unit increase in x.
Weights: The importance of each independent variable in the regression.
Error: The difference between the predicted and actual values.

Advantages and Disadvantages

Advantages:
- Simple and easy to understand.
- Can handle both classification and regression tasks.
- Can handle large datasets.
- Can be used for both supervised and unsupervised learning.
Disadvantages:
- Can be computationally expensive for large datasets.
- May not perform well if the dataset is imbalanced.
- May not perform well if the dataset has a high dimensionality.
- Can be sensitive to the choice of K and the initialization method.

API

from sklearn.linear_model import LinearRegression

X = [[1], [2], [3], [4], [5]]
y = [2, 3, 5, 7, 11]

# Create a linear regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

print(model.coef_)  # Coefficients (slopes) 系数
print(model.intercept_)  # Intercept 截距

Applications

Linear regression is widely used in various fields, including:

Finance: Linear regression is used to predict stock prices, interest rates, and other financial variables.
Economics: Linear regression is used to analyze the relationship between economic variables, such as GDP and unemployment rates.
Healthcare: Linear regression is used to predict patient outcomes based on various factors, such as age and medical history.

Loss Function

In linear regression, there is usually a difference between the predicted value and the true value. A loss function is used to measure the error between model predictions and actual values.By minimizing the loss function, we can find the optimal model parameters.When the loss function reaches its minimum value, the corresponding coefficients are considered the best solution.

Mean Squared Error, MSE

The most commonly used loss function in regression tasks is Mean Squared Error.

$MSE = \frac{1}{n}\sum_{i=1}^{n}(f(x_i)-y_i)^2$

Where:

$n$: number of samples
$y_i$: true value of the $i$-th sample
$f(x_i)$: predicted value of the $i$-th sample

MSE measures the average squared difference between predicted values and actual values. Minimizing MSE is also known as the Least Squares Method.

In linear regression, the least squares method tries to find a line, or a hyperplane, that minimizes the total squared error between all samples and the prediction line.

Characteristics of MSE

MSE is sensitive to large errors because the squared term amplifies bigger mistakes.
For example:

1 2	Error = 2 -> Squared error = 4 Error = 10 -> Squared error = 100

MSE is a convex function in linear regression, which means it has a global minimum. The squared term also makes the loss function differentiable, which is convenient for optimization. The analytical solution of least squares can be calculated directly using matrix operations:

$\beta = (X^T X)^{-1}X^T y$

If the error follows a normal distribution, minimizing MSE is equivalent to Maximum Likelihood Estimation.

Relationship Between MSE and Maximum Likelihood Estimation

Suppose the relationship between the dependent variable $y$ and the independent variable $x$ is:

$y_i = \beta^T x_i + \epsilon_i$

Where $\epsilon_i$ is the error term.

If the errors are independent, identically distributed, and follow a normal distribution:

$p(\epsilon_i)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\epsilon_i^2}{2\sigma^2}\right)$

Then:

$p(y_i|x_i;\beta)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y_i-\beta^T x_i)^2}{2\sigma^2}\right)$

The likelihood function is:

$L(\beta)=\prod_{i=1}^{n}p(y_i|x_i;\beta)$

The log-likelihood function is:

$\ln L(\beta) = -n\ln(\sqrt{2\pi}\sigma) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i-\beta^T x_i)^2$

To maximize the log-likelihood, we need to minimize:

$\sum_{i=1}^{n}(y_i-\beta^T x_i)^2$

This is directly related to MSE.

Therefore, when the errors follow a normal distribution, maximizing the likelihood is equivalent to minimizing MSE.

Mean Absolute Error, MAE

Another common regression loss function is Mean Absolute Error.

$MAE = \frac{1}{n}\sum_{i=1}^{n}|f(x_i)-y_i|$

MAE is less sensitive to outliers than MSE. However, it gives weaker punishment to small errors. MAE is suitable when the dataset contains significant outliers, such as in financial risk prediction.

Analytical Solution for Simple Linear Regression

For simple linear regression, the prediction function is:

$f(x_i)=\beta_0+\beta_1x_i$

The MSE is:

$MSE=\frac{1}{n}\sum_{i=1}^{n}(\beta_0+\beta_1x_i-y_i)^2$

To find the optimal parameters, we take partial derivatives with respect to $\beta_0$ and $\beta_1$.

The final solution is:

$\beta_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$ $\beta_0 = \bar{y} - \beta_1\bar{x}$

Where:

$\bar{x}$: mean of $x$
$\bar{y}$: mean of $y$
$\beta_1$: slope
$\beta_0$: intercept

Scikit-Learn API: LinearRegression

from sklearn.linear_model import LinearRegression

# fit_intercept: whether to calculate the intercept
model = LinearRegression(fit_intercept=True)

model.fit([[0, 3], [1, 2], [2, 1]], [0, 1, 2])

# coef_: coefficients
print(model.coef_)

# intercept_: intercept
print(model.intercept_)

Gradient Descent

Gradient Descent is an iterative optimization algorithm used to minimize an objective function. The main idea is to update parameters step by step in the opposite direction of the gradient. The gradient points to the direction where the function increases the fastest. Therefore, the negative gradient direction is the direction where the function decreases the fastest.
The update rule is:

$\beta_{t+1} = \beta_t - \alpha \nabla J(\beta_t)$

Where:

$\beta_t$: current parameter vector
$\alpha$: learning rate
$\nabla J(\beta_t)$: gradient of the objective function

Gradient for Linear Regression

For linear regression, the loss function can be written as:

$J(\beta)=\frac{1}{n}\lVert X\beta-y \rVert_2^2$

The gradient is:

$\nabla J(\beta)=\frac{2}{n}X^T(X\beta-y)$

Using Gradient Descent, we repeatedly update the parameters until the loss becomes very small or the maximum number of iterations is reached.

Gradient Descent Example

We use the same study-hours dataset:

import numpy as np

def J(beta):
    """Objective function"""
    return np.sum((X @ beta - y) ** 2, axis=0).reshape(-1, 1) / n

def gradient(beta):
    """Gradient"""
    return X.T @ (X @ beta - y) / n * 2

X = np.array([[5], [8], [10], [12], [15], [3], [7], [9], [14], [6]])
y = np.array([[55], [65], [70], [75], [85], [50], [60], [72], [80], [58]])

beta = np.array([[1], [1]])
n = X.shape[0]

# Add a column of ones for the intercept
X = np.hstack([np.ones((n, 1)), X])

alpha = 1e-2
epoch = 0

while (j := J(beta)) > 1e-10 and (epoch := epoch + 1) <= 10000:
    grad = gradient(beta)

    if epoch % 1000 == 0:
        print(f"beta={beta.reshape(-1)}\tJ={j.reshape(-1)}")

    beta = beta - alpha * grad

After enough iterations, the result approaches:

$\beta_0 \approx 41.4507$ $\beta_1 \approx 2.8707$

This is consistent with the analytical solution.

Choosing the Learning Rate

The learning rate controls the step size of each parameter update. If the learning rate is too large, the algorithm may skip the optimal solution or even diverge. If the learning rate is too small, convergence will be very slow. Adaptive optimizers such as Adam and Adagrad can dynamically adjust the learning rate to improve training efficiency.

Common Issues in Gradient Descent

Feature Scaling

Gradient Descent usually requires feature scaling, such as standardization or normalization. Feature scaling helps the algorithm converge faster and more smoothly.

Local Minima and Saddle Points

For non-convex functions, Gradient Descent may get stuck in local minima or saddle points.
Possible solutions include:

Momentum
Adaptive optimizers such as Adam
Second-order methods such as Newton’s Method

For ordinary linear regression with MSE, the loss function is convex, so there is a global minimum.

Scikit-Learn API: SGDRegressor

from sklearn.linear_model import SGDRegressor

model = SGDRegressor(
    loss="squared_error",      # Loss function, default is MSE
    fit_intercept=True,        # Whether to calculate the intercept
    learning_rate="constant",  # Learning rate strategy
    eta0=0.1,                  # Initial learning rate
    max_iter=1000,             # Maximum number of iterations
    tol=1e-8                   # Stop when the loss improvement is smaller than tol
)

model.fit([[0, 3], [1, 2], [2, 1]], [0, 1, 2])

# coef_: coefficients
print(model.coef_)

# intercept_: intercept
print(model.intercept_)

Summary

Linear regression aims to find the best parameters that minimize prediction error.

MSE is the most commonly used loss function for regression problems.

The least squares method minimizes MSE.

The Normal Equation provides an analytical solution:

$\beta = (X^T X)^{-1}X^T y$

Gradient Descent provides an iterative solution:

$\beta_{t+1} = \beta_t - \alpha \nabla J(\beta_t)$

The Normal Equation is suitable for small feature dimensions.

Gradient Descent is more suitable for large-scale datasets or high-dimensional feature spaces.

In practice, libraries such as Scikit-Learn provide convenient APIs like LinearRegression and SGDRegressor.

Understanding both the mathematical solution and the iterative optimization process helps us better understand how linear regression really works.