Outliers

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.notebook_repr_html", False)  # disable "rich" output
plt.style.use("seaborn")

Unidimensional Data

x = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/blobs2.txt")
plt.subplot(121)
sns.boxplot(data=x, orient="h")
plt.subplot(122)
sns.histplot(x, binwidth=1)
plt.show()

Multidimensional Data

x = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/blobs1.txt", delimiter=",")
plt.subplot(221)
sns.boxplot(data=x[:, 0], orient="h")
plt.subplot(222)
sns.histplot(x[:, 0], bins=20)
plt.title("x[:, 0]")
plt.subplot(223)
sns.boxplot(data=x[:, 1], orient="h")
plt.subplot(224)
sns.histplot(x[:, 1], bins=20)
plt.title("x[:, 1]")
plt.show()

Multidimensional Data

x = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/blobs1.txt", delimiter=",")
plt.subplot(221)
sns.boxplot(data=x[:, 0], orient="h")
plt.subplot(222)
sns.histplot(x[:, 0], bins=20)
plt.title("x[:, 0]")
plt.subplot(223)
sns.boxplot(data=x[:, 1], orient="h")
plt.subplot(224)
sns.histplot(x[:, 1], bins=20)
plt.title("x[:, 1]")
plt.show()

plt.scatter(x[:, 0], x[:, 1])
plt.axis("equal")
plt.show()

x[-8:, :]

import scipy.spatial
t = scipy.spatial.KDTree(x)
n = t.query_ball_tree(t, 0.2)  # epsilon=0.2 (radius)
c = np.array([len(e) for e in n])
c[[0, 1, -2, -1]]  # preview

plt.scatter(x[c > 1, 0], x[c > 1, 1], label="normal point")
plt.scatter(x[c == 1, 0], x[c == 1, 1], marker="v", label="outlier")
plt.axis("equal")
plt.legend()
plt.show()

Distance-based outlier detection

K-nearest neighbour (KNN)

[1] Edwin Knorr and Raymond Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. of the VLDB Conference, pages 392–403, New York, USA, September 1998.
Read Section 3.1 of the paper

This paper proposes an intuitive and general distance-based definition of outliers, where a data point’s “degree of being an outlier” is measured by its distance to the k-th nearest neighbor, and the top n points with the largest distances are considered outliers. To address the high computational cost on large datasets, the authors designed an efficient partition-based pruning algorithm that can exclude partitions that cannot contain outliers in advance, significantly reducing computation. Experimental results show that this method performs well and scales effectively on large and high-dimensional datasets, making it suitable for applications such as database cleaning, fraud detection, network intrusion detection, and IoT sensor data analysis.

However, the method still depends on the choice of parameters n and k, may suffer from distance degradation in high-dimensional spaces, and is sensitive to data standardization and distance metrics. Overall, this paper is pioneering in the field of outlier detection, offering both theoretical significance and practical applicability.

Density-based outlier detection

Local outlier factor (LOF) algorithm

Read Definition 3, of the paper

The LOF (Local Outlier Factor) algorithm is a density-based local outlier detection method. It measures the degree of abnormality of each data point by comparing its local density with that of its neighbors. Specifically, it computes the local reachability density (lrd) of each point and then forms the LOF value as the ratio of the average density of its neighbors to its own density. An LOF value close to 1 indicates that the point has a density similar to its neighbors and is considered normal, while a significantly larger value indicates local sparsity and a potential outlier. The algorithm is particularly effective for detecting local outliers in datasets with varying density or irregular cluster shapes.

The advantages of LOF include its independence from predefined clusters or distribution assumptions, its continuous outlier scoring, and strong adaptability to complex data structures. However, it is sensitive to the choice of the neighborhood parameter k, requires experience-based thresholding for LOF values, and has relatively high computational complexity, making it less efficient for large-scale, high-dimensional, or streaming data. LOF is widely applied in anomaly detection, data cleaning, IoT sensor data analysis, and network traffic monitoring.

Read Page 6 of the paper

Method Category	Core Idea	Main Algorithms / Representative Methods	Advantages	Limitations / Notes	Suitable Scenarios
Statistical-based	Assume data follows a certain distribution; detect outliers based on probability or deviation	Z-score, Grubbs Test, Mixture Models	Simple and intuitive; works well for low-dimensional, known-distribution data	Depends on distribution assumption; less effective in high dimensions	Data quality checking, low-dimensional sensor data
Distance / Nearest Neighbor	Points far from others are likely outliers	k-NN Outlier, Distance-based Outlier	Distribution-free; intuitive	High computational cost in high-dimensional or large datasets; needs parameter k	IoT, RFID, trajectory anomaly detection
Density-based	Compare local density of a point with its neighbors	LOF (Local Outlier Factor), DBSCAN Outlier	Detects local outliers; handles varying density	Sensitive to parameter k; computationally intensive	Local outlier detection, datasets with varying density
Clustering-based	Outliers do not belong to major clusters or belong to small clusters	K-means Outlier, DBSCAN Outlier	Can detect isolated points outside clusters	Sensitive to clustering quality; hard with noise or high-dimensional data	Data exploration, cluster-based anomaly detection
Learning / ML-based	Use supervised, semi-supervised, or unsupervised models to predict outliers	One-class SVM, Isolation Forest, Autoencoder	Handles high-dimensional and complex data	Requires training; parameter tuning can be complex	Network intrusion detection, anomaly pattern recognition
Hybrid / Ensemble	Combine multiple methods to improve robustness	Feature Bagging + LOF, Ensemble k-NN	Improves detection accuracy; reduces false positives	High complexity; high computational cost	High-dimensional, complex, or large-scale data

lof

Python scikit-learn (aka sklearn) has a page about LOF

Clustering-based outlier detection

Modified k-means and minimum spanning tree

Read Algorithm 3.1 and 3.2 of the paper

This paper proposes an outlier detection method based on a two-phase clustering process combined with a minimum spanning tree (MST). In the first phase, a modified k-means clustering is applied, where points far from existing cluster centers are treated as new cluster centers, allowing isolated or sparse outliers to form independent clusters. In the second phase, an MST is constructed using the cluster centers, and the longest edges are removed to split the tree into subtrees, identifying outliers in small or isolated clusters. This approach leverages both clustering and graph structure, effectively detecting few and sparsely distributed outliers while providing good interpretability.

The method’s advantages include independence from data distribution assumptions and the ability to detect isolated/sparse outliers, making it suitable for static and multi-modal datasets. Its limitations are sensitivity to parameters (such as initial k and distance thresholds), higher computational cost for high-dimensional or large-scale data, and reduced effectiveness for dense anomalous clusters or streaming data. The method is particularly suitable for scenarios like IoT, RFID, and sensor trajectory data where outliers are sparse and batch processing is feasible.

Density-based spatial clustering of applications with noise (DBSCAN) Section

Read Section 7.1 of the paper

Method Category	Core Idea	Representative Methods / Algorithms	Advantages	Limitations / Notes	Suitable Scenarios
Statistical-based	Assume data follows a certain distribution; detect outliers based on deviation or probability	Z-score, Grubbs Test, Mixture Models	Simple, intuitive, works well for low-dimensional, known-distribution data	Assumes data distribution; less effective in high dimensions or unknown distribution	Low-dimensional data, data cleaning, error detection
Distance / Nearest-Neighbor-based	Points far from most others are potential outliers	k-NN Outlier, Distance-based Outlier	Distribution-free; intuitive; works for moderate dimensions	Computationally expensive for high-dimensional/large data; sensitive to distance metric and parameter k	IoT, RFID, trajectory anomaly detection, moderate-size datasets
Density-based	Compare local density of a point with its neighbors; low-density points are outliers	LOF (Local Outlier Factor), DBSCAN Outlier	Detects local outliers; handles clusters with varying density or shape	Sensitive to parameters (neighborhood size, density threshold); computationally intensive; not ideal for streaming/high-dimensional data	Local anomaly detection, multi-modal or irregular cluster data
Clustering-based	Outliers do not belong to main clusters or belong to small/sparse clusters	K-means Outlier, DBSCAN Outlier	Easy to interpret; effective if cluster structure exists	Sensitive to clustering quality; affected by noise; less effective in high-dimensional or complex data	Data with clear cluster structure; batch/static data; exploratory analysis
Learning-based / Model-based	Use supervised, semi-supervised, or unsupervised learning to identify outliers	One-class SVM, Isolation Forest, Autoencoder	Handles high-dimensional, complex, and unstructured data; scalable for large datasets	Requires training; parameter tuning; interpretability may be low; sensitive to training data	Network intrusion detection, financial fraud, complex sensor/IoT data
Ensemble / Hybrid Methods	Combine multiple methods to improve robustness and accuracy	Feature Bagging + LOF, Ensemble k-NN	Leverages strengths of multiple methods; reduces false positives	Complex to implement; computationally expensive; interpretation may be harder	High-dimensional, large-scale, or complex datasets; critical systems requiring high reliability

Wikipedia has a page about DBSCAN

Applications: Cluster Analysis (**)

import numpy as np
import scipy.stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("seaborn")  # plot style
pd.set_option("display.notebook_repr_html", False)  # disable "rich" output
np.random.seed(123)  # initialise the pseudorandom number generator

Euclidean Distance

$\|\mathbf{u} - \mathbf{v}\| = \sqrt{(u_1 - v_1)^2 + (u_2 - v_2)^2 + \dots + (u_m - v_m)^2} = \sqrt{\sum_{i=1}^{m} (u_i - v_i)^2}$

X = np.array([
    [0,    0],
    [1,    0],
    [-1.5, 1],
    [1,    1]
])
import scipy.spatial.distance
D = scipy.spatial.distance.cdist(X, X)
D

plt.plot(X[:, 0], X[:, 1], "ko")
for i in range(X.shape[0]-1):
    for j in range(i+1, X.shape[0]):
        plt.plot(X[[i,j], 0], X[[i,j], 1], "k-", alpha=0.2)
        plt.text(np.mean(X[[i,j], 0]), np.mean(X[[i,j], 1]), np.round(D[i, j], 2))
plt.show()

Centroids

$\min_{c} \sum_{i=1}^{n} \| \mathbf{x}_i - \mathbf{c} \|^2$ $\mathbf{c} = \frac{1}{n} (\mathbf{x}_1 + \mathbf{x}_2 + \dots + \mathbf{x}_n) = \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_i$ $c_j = \frac{1}{n} \sum_{i=1}^{n} x_{i,j}$

1 2	c = np.mean(X, axis=0) c

1	np.sqrt(np.mean(scipy.spatial.distance.cdist(X, c.reshape(1, -1))**2))

Clustering with k-Means

$\min_{c_1, c_2, \dots, c_k} \sum_{i=1}^{n} \min \Big\{ \| \mathbf{x}_i - \mathbf{c}_1 \|^2, \| \mathbf{x}_i - \mathbf{c}_2 \|^2, \dots, \| \mathbf{x}_i - \mathbf{c}_k \|^2 \Big\}$

X = np.loadtxt("https://raw.githubusercontent.com/gagolews/teaching-data/master/marek/blobs1.txt", delimiter=",")

import scipy.cluster.vq
C, l = scipy.cluster.vq.kmeans2(X, 2)

my_colours = np.array(["lightcoral", "cornflowerblue"])
plt.scatter(X[:, 0], X[:, 1], c=my_colours[l])
plt.plot(C[:, 0], C[:, 1], "kX")
plt.axis("equal")
plt.show()

np.bincount(l)  # or, e.g., pd.Series(l).value_counts()

Xl = pd.DataFrame(dict(X1=X[:, 0], X2=X[:, 1], l=l))
Xl.sample(5, random_state=42)  # some randomly chosen rows

Xl.groupby("l").mean()

np.argmin(scipy.spatial.distance.cdist(X, C), axis=1)