import scipy.spatial t = scipy.spatial.KDTree(x) n = t.query_ball_tree(t, 0.2) # epsilon=0.2 (radius) c = np.array([len(e) for e in n]) c[[0, 1, -2, -1]] # preview
[1] Edwin Knorr and Raymond Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. of the VLDB Conference, pages 392–403, New York, USA, September 1998. Read Section 3.1 of the paper
This paper proposes an intuitive and general distance-based definition of outliers, where a data point’s “degree of being an outlier” is measured by its distance to the k-th nearest neighbor, and the top n points with the largest distances are considered outliers. To address the high computational cost on large datasets, the authors designed an efficient partition-based pruning algorithm that can exclude partitions that cannot contain outliers in advance, significantly reducing computation. Experimental results show that this method performs well and scales effectively on large and high-dimensional datasets, making it suitable for applications such as database cleaning, fraud detection, network intrusion detection, and IoT sensor data analysis.
However, the method still depends on the choice of parameters n and k, may suffer from distance degradation in high-dimensional spaces, and is sensitive to data standardization and distance metrics. Overall, this paper is pioneering in the field of outlier detection, offering both theoretical significance and practical applicability.
The LOF (Local Outlier Factor) algorithm is a density-based local outlier detection method. It measures the degree of abnormality of each data point by comparing its local density with that of its neighbors. Specifically, it computes the local reachability density (lrd) of each point and then forms the LOF value as the ratio of the average density of its neighbors to its own density. An LOF value close to 1 indicates that the point has a density similar to its neighbors and is considered normal, while a significantly larger value indicates local sparsity and a potential outlier. The algorithm is particularly effective for detecting local outliers in datasets with varying density or irregular cluster shapes.
The advantages of LOF include its independence from predefined clusters or distribution assumptions, its continuous outlier scoring, and strong adaptability to complex data structures. However, it is sensitive to the choice of the neighborhood parameter k, requires experience-based thresholding for LOF values, and has relatively high computational complexity, making it less efficient for large-scale, high-dimensional, or streaming data. LOF is widely applied in anomaly detection, data cleaning, IoT sensor data analysis, and network traffic monitoring.
This paper proposes an outlier detection method based on a two-phase clustering process combined with a minimum spanning tree (MST). In the first phase, a modified k-means clustering is applied, where points far from existing cluster centers are treated as new cluster centers, allowing isolated or sparse outliers to form independent clusters. In the second phase, an MST is constructed using the cluster centers, and the longest edges are removed to split the tree into subtrees, identifying outliers in small or isolated clusters. This approach leverages both clustering and graph structure, effectively detecting few and sparsely distributed outliers while providing good interpretability.
The method’s advantages include independence from data distribution assumptions and the ability to detect isolated/sparse outliers, making it suitable for static and multi-modal datasets. Its limitations are sensitivity to parameters (such as initial k and distance thresholds), higher computational cost for high-dimensional or large-scale data, and reduced effectiveness for dense anomalous clusters or streaming data. The method is particularly suitable for scenarios like IoT, RFID, and sensor trajectory data where outliers are sparse and batch processing is feasible.
Density-based spatial clustering of applications with noise (DBSCAN) Section