KNN Algorithms

K-Nearest Neighbors (KNN) is a simple, non-parametric, and lazy learning algorithm used for classification and regression tasks.

Working Principle

The KNN algorithm works by finding the K nearest neighbors to a given data point and making predictions based on the majority class (for classification) or the average value (for regression) of those neighbors.
Normally, when we do prediction, there are two main methods: classification and regression. KNN can be used for both of these methods.

Classification: Calculate the number of K nearest neighbors that belong to each class and assign the class with the highest count to the new data point.
Regression: Calculate the average value of the K nearest neighbors and assign that value to the new data point.

For example, let’s say we have a dataset of houses and we want to predict the price of a new house based on its features. We can use the KNN algorithm to find the K nearest neighbors to the new house and then use those neighbors’ prices to make a prediction.

Key Parameters

K: The number of nearest neighbors to consider for making predictions. A smaller K value can lead to a more flexible model, while a larger K value can lead to a smoother decision boundary.
Distance Metric: The method used to calculate the distance between data points. Common distance metrics include Euclidean, Manhattan, and Minkowski distances.
Weighting Scheme: The method used to calculate the weights of the K nearest neighbors. Common weighting schemes include uniform, distance-based, and kernel-based weighting.
Initialization: The method used to initialize the weights of the K nearest neighbors. Common initialization methods include random and k-means clustering.

Advantages and Disadvantages

Advantages:
- Simple and easy to understand.
- Can handle both classification and regression tasks.
- Can handle large datasets.
- Can be used for both supervised and unsupervised learning.
Disadvantages:
- Can be computationally expensive for large datasets.
- May not perform well if the dataset is imbalanced.
- May not perform well if the dataset has a high dimensionality.
- Can be sensitive to the choice of K and the initialization method.

API

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3, weights='uniform')
X = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]
y = [0, 0, 0, 1, 1, 1]
knn.fit(X, y)

# predict
knn.predict([[1, 1]])

Normalization vs Standardization

Feature scaling is an important preprocessing step in machine learning. Two commonly used methods are normalization and standardization. Although they both aim to put features on a comparable scale, they are used in slightly different situations.

1. Normalization

Definition

Normalization rescales data into a fixed range, commonly [0, 1] or [-1, 1].
For scaling data into the range [0, 1], the formula is:

1	x' = (x - x_min) / (x_max - x_min)

For scaling data into the range [-1, 1], the formula is:

x' = 2 × (x - x_min) / (x_max - x_min) - 1
x      = original value
x'     = normalized value
x_min  = minimum value in the feature
x_max  = maximum value in the feature

Purpose

Normalization is useful for the following reasons:

Remove scale differences: Different features may have very different units or ranges. For example, height may be measured in meters, while weight may be measured in kilograms. Normalization helps prevent features with larger ranges from dominating the model.
Speed up model convergence: For optimization algorithms such as gradient descent, features on a similar scale can make the optimization process smoother and faster.
Support scale-sensitive models: Some machine learning models are sensitive to the range of input data. Examples include:
- Neural Networks
- K-Nearest Neighbors (KNN)
- Support Vector Machines (SVM)

When to Use Normalization

Normalization does not change the shape of the original data distribution, but it is sensitive to outliers. It is usually a good choice when:

The data has clear boundaries
The model is sensitive to input ranges
The feature values need to be restricted to a fixed range

from sklearn.preprocessing import MinMaxScaler

X = [[2, 1], [3, 1], [1, 4], [2, 6]]

# Normalize data into the range (-1, 1)
X_scaled = MinMaxScaler(feature_range=(-1, 1)).fit_transform(X)

print(X_scaled)

2. Standardization

Definition

Standardization transforms data so that it has a mean of 0 and a standard deviation of 1.
The formula is:

x' = (x - μ) / σ
x  = original value
x' = standardized value
μ  = mean of the feature
σ  = standard deviation of the feature

μ = (x1 + x2 + ... + xn) / n
σ = sqrt(((x1 - μ)^2 + (x2 - μ)^2 + ... + (xn - μ)^2) / n)

Purpose

Standardization is useful for the following reasons:

Adapt to the data distribution: Standardization transforms the data into a distribution with mean 0 and standard deviation 1. It is especially useful for models that assume the data is approximately normally distributed.
Examples include:

Linear Regression
Logistic Regression
Principal Component Analysis, PCA
Some clustering algorithms

Stabilize model training: Compared with normalization, standardization is usually less sensitive to outliers. It can make model training more stable when the data contains mild outliers.
Unify feature scales: Similar to normalization, standardization also removes scale differences between features. However, instead of forcing data into a fixed range, it focuses on the statistical distribution of the data.

When to Use Standardization

Standardization is generally more widely used in machine learning, especially when:

The data distribution is unknown
The data may contain mild outliers
The model depends on feature scale
The model assumes or benefits from normally distributed data

It is commonly used for:

Linear models
Logistic regression
SVM
PCA
K-Means clustering
Many traditional machine learning algorithms

from sklearn.preprocessing import StandardScaler

X = [[2, 1], [3, 1], [1, 4], [2, 6]]

# Standardize data
X_scaled = StandardScaler().fit_transform(X)

print(X_scaled)