Outliers

1
2
3
4
5
6
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.notebook_repr_html", False) # disable "rich" output
plt.style.use("seaborn")

Unidimensional Data

1
2
3
4
5
6
7
x = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/blobs2.txt")
plt.subplot(121)
sns.boxplot(data=x, orient="h")
plt.subplot(122)
sns.histplot(x, binwidth=1)
plt.show()

Multidimensional Data

1
2
3
4
5
6
7
8
9
10
11
12
13
x = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/blobs1.txt", delimiter=",")
plt.subplot(221)
sns.boxplot(data=x[:, 0], orient="h")
plt.subplot(222)
sns.histplot(x[:, 0], bins=20)
plt.title("x[:, 0]")
plt.subplot(223)
sns.boxplot(data=x[:, 1], orient="h")
plt.subplot(224)
sns.histplot(x[:, 1], bins=20)
plt.title("x[:, 1]")
plt.show()

Multidimensional Data

1
2
3
4
5
6
7
8
9
10
11
12
13
x = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/blobs1.txt", delimiter=",")
plt.subplot(221)
sns.boxplot(data=x[:, 0], orient="h")
plt.subplot(222)
sns.histplot(x[:, 0], bins=20)
plt.title("x[:, 0]")
plt.subplot(223)
sns.boxplot(data=x[:, 1], orient="h")
plt.subplot(224)
sns.histplot(x[:, 1], bins=20)
plt.title("x[:, 1]")
plt.show()
1
2
3
4
5
plt.scatter(x[:, 0], x[:, 1])
plt.axis("equal")
plt.show()

x[-8:, :]
1
2
3
4
5
import scipy.spatial
t = scipy.spatial.KDTree(x)
n = t.query_ball_tree(t, 0.2) # epsilon=0.2 (radius)
c = np.array([len(e) for e in n])
c[[0, 1, -2, -1]] # preview
1
2
3
4
5
plt.scatter(x[c > 1, 0], x[c > 1, 1], label="normal point")
plt.scatter(x[c == 1, 0], x[c == 1, 1], marker="v", label="outlier")
plt.axis("equal")
plt.legend()
plt.show()

Distance-based outlier detection

K-nearest neighbour (KNN)

[1] Edwin Knorr and Raymond Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. of the VLDB Conference, pages 392–403, New York, USA, September 1998.
Read Section 3.1 of the paper

This paper proposes an intuitive and general distance-based definition of outliers, where a data point’s “degree of being an outlier” is measured by its distance to the k-th nearest neighbor, and the top n points with the largest distances are considered outliers. To address the high computational cost on large datasets, the authors designed an efficient partition-based pruning algorithm that can exclude partitions that cannot contain outliers in advance, significantly reducing computation. Experimental results show that this method performs well and scales effectively on large and high-dimensional datasets, making it suitable for applications such as database cleaning, fraud detection, network intrusion detection, and IoT sensor data analysis.

However, the method still depends on the choice of parameters n and k, may suffer from distance degradation in high-dimensional spaces, and is sensitive to data standardization and distance metrics. Overall, this paper is pioneering in the field of outlier detection, offering both theoretical significance and practical applicability.

Density-based outlier detection

Local outlier factor (LOF) algorithm

Read Definition 3, of the paper

The LOF (Local Outlier Factor) algorithm is a density-based local outlier detection method. It measures the degree of abnormality of each data point by comparing its local density with that of its neighbors. Specifically, it computes the local reachability density (lrd) of each point and then forms the LOF value as the ratio of the average density of its neighbors to its own density. An LOF value close to 1 indicates that the point has a density similar to its neighbors and is considered normal, while a significantly larger value indicates local sparsity and a potential outlier. The algorithm is particularly effective for detecting local outliers in datasets with varying density or irregular cluster shapes.

The advantages of LOF include its independence from predefined clusters or distribution assumptions, its continuous outlier scoring, and strong adaptability to complex data structures. However, it is sensitive to the choice of the neighborhood parameter k, requires experience-based thresholding for LOF values, and has relatively high computational complexity, making it less efficient for large-scale, high-dimensional, or streaming data. LOF is widely applied in anomaly detection, data cleaning, IoT sensor data analysis, and network traffic monitoring.


Read Page 6 of the paper

Method Category Core Idea Main Algorithms / Representative Methods Advantages Limitations / Notes Suitable Scenarios
Statistical-based Assume data follows a certain distribution; detect outliers based on probability or deviation Z-score, Grubbs Test, Mixture Models Simple and intuitive; works well for low-dimensional, known-distribution data Depends on distribution assumption; less effective in high dimensions Data quality checking, low-dimensional sensor data
Distance / Nearest Neighbor Points far from others are likely outliers k-NN Outlier, Distance-based Outlier Distribution-free; intuitive High computational cost in high-dimensional or large datasets; needs parameter k IoT, RFID, trajectory anomaly detection
Density-based Compare local density of a point with its neighbors LOF (Local Outlier Factor), DBSCAN Outlier Detects local outliers; handles varying density Sensitive to parameter k; computationally intensive Local outlier detection, datasets with varying density
Clustering-based Outliers do not belong to major clusters or belong to small clusters K-means Outlier, DBSCAN Outlier Can detect isolated points outside clusters Sensitive to clustering quality; hard with noise or high-dimensional data Data exploration, cluster-based anomaly detection
Learning / ML-based Use supervised, semi-supervised, or unsupervised models to predict outliers One-class SVM, Isolation Forest, Autoencoder Handles high-dimensional and complex data Requires training; parameter tuning can be complex Network intrusion detection, anomaly pattern recognition
Hybrid / Ensemble Combine multiple methods to improve robustness Feature Bagging + LOF, Ensemble k-NN Improves detection accuracy; reduces false positives High complexity; high computational cost High-dimensional, complex, or large-scale data

lof

Python scikit-learn (aka sklearn) has a page about LOF

Clustering-based outlier detection

Modified k-means and minimum spanning tree

Read Algorithm 3.1 and 3.2 of the paper

This paper proposes an outlier detection method based on a two-phase clustering process combined with a minimum spanning tree (MST). In the first phase, a modified k-means clustering is applied, where points far from existing cluster centers are treated as new cluster centers, allowing isolated or sparse outliers to form independent clusters. In the second phase, an MST is constructed using the cluster centers, and the longest edges are removed to split the tree into subtrees, identifying outliers in small or isolated clusters. This approach leverages both clustering and graph structure, effectively detecting few and sparsely distributed outliers while providing good interpretability.

The method’s advantages include independence from data distribution assumptions and the ability to detect isolated/sparse outliers, making it suitable for static and multi-modal datasets. Its limitations are sensitivity to parameters (such as initial k and distance thresholds), higher computational cost for high-dimensional or large-scale data, and reduced effectiveness for dense anomalous clusters or streaming data. The method is particularly suitable for scenarios like IoT, RFID, and sensor trajectory data where outliers are sparse and batch processing is feasible.


Density-based spatial clustering of applications with noise (DBSCAN) Section

Read Section 7.1 of the paper

Method Category Core Idea Representative Methods / Algorithms Advantages Limitations / Notes Suitable Scenarios
Statistical-based Assume data follows a certain distribution; detect outliers based on deviation or probability Z-score, Grubbs Test, Mixture Models Simple, intuitive, works well for low-dimensional, known-distribution data Assumes data distribution; less effective in high dimensions or unknown distribution Low-dimensional data, data cleaning, error detection
Distance / Nearest-Neighbor-based Points far from most others are potential outliers k-NN Outlier, Distance-based Outlier Distribution-free; intuitive; works for moderate dimensions Computationally expensive for high-dimensional/large data; sensitive to distance metric and parameter k IoT, RFID, trajectory anomaly detection, moderate-size datasets
Density-based Compare local density of a point with its neighbors; low-density points are outliers LOF (Local Outlier Factor), DBSCAN Outlier Detects local outliers; handles clusters with varying density or shape Sensitive to parameters (neighborhood size, density threshold); computationally intensive; not ideal for streaming/high-dimensional data Local anomaly detection, multi-modal or irregular cluster data
Clustering-based Outliers do not belong to main clusters or belong to small/sparse clusters K-means Outlier, DBSCAN Outlier Easy to interpret; effective if cluster structure exists Sensitive to clustering quality; affected by noise; less effective in high-dimensional or complex data Data with clear cluster structure; batch/static data; exploratory analysis
Learning-based / Model-based Use supervised, semi-supervised, or unsupervised learning to identify outliers One-class SVM, Isolation Forest, Autoencoder Handles high-dimensional, complex, and unstructured data; scalable for large datasets Requires training; parameter tuning; interpretability may be low; sensitive to training data Network intrusion detection, financial fraud, complex sensor/IoT data
Ensemble / Hybrid Methods Combine multiple methods to improve robustness and accuracy Feature Bagging + LOF, Ensemble k-NN Leverages strengths of multiple methods; reduces false positives Complex to implement; computationally expensive; interpretation may be harder High-dimensional, large-scale, or complex datasets; critical systems requiring high reliability

Wikipedia has a page about DBSCAN

Applications: Cluster Analysis (**)

1
2
3
4
5
6
7
8
import numpy as np
import scipy.stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("seaborn") # plot style
pd.set_option("display.notebook_repr_html", False) # disable "rich" output
np.random.seed(123) # initialise the pseudorandom number generator

Euclidean Distance

1
2
3
4
5
6
7
8
9
X = np.array([
[0, 0],
[1, 0],
[-1.5, 1],
[1, 1]
])
import scipy.spatial.distance
D = scipy.spatial.distance.cdist(X, X)
D
1
2
3
4
5
6
plt.plot(X[:, 0], X[:, 1], "ko")
for i in range(X.shape[0]-1):
for j in range(i+1, X.shape[0]):
plt.plot(X[[i,j], 0], X[[i,j], 1], "k-", alpha=0.2)
plt.text(np.mean(X[[i,j], 0]), np.mean(X[[i,j], 1]), np.round(D[i, j], 2))
plt.show()

Centroids

1
2
c = np.mean(X, axis=0)
c
1
np.sqrt(np.mean(scipy.spatial.distance.cdist(X, c.reshape(1, -1))**2))

Clustering with k-Means

1
2
3
4
5
6
7
8
9
10
X = np.loadtxt("https://raw.githubusercontent.com/gagolews/teaching-data/master/marek/blobs1.txt", delimiter=",")

import scipy.cluster.vq
C, l = scipy.cluster.vq.kmeans2(X, 2)

my_colours = np.array(["lightcoral", "cornflowerblue"])
plt.scatter(X[:, 0], X[:, 1], c=my_colours[l])
plt.plot(C[:, 0], C[:, 1], "kX")
plt.axis("equal")
plt.show()
1
2
3
4
5
6
7
8
np.bincount(l)  # or, e.g., pd.Series(l).value_counts()

Xl = pd.DataFrame(dict(X1=X[:, 0], X2=X[:, 1], l=l))
Xl.sample(5, random_state=42) # some randomly chosen rows

Xl.groupby("l").mean()

np.argmin(scipy.spatial.distance.cdist(X, C), axis=1)