Loading...

Data Matching

Created2025-11-12|Updated2026-06-15|da

|Word Count:210|Reading Time:1mins

Data Matching

After reading `Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, I gained a clearer understanding of the challenges involved in working with real-world data. The book highlights that data is often incomplete, inconsistent, and noisy, and that data matching aims to identity records referring to the same real-world entity under these imperfect conditions. Rather than focusing on a single algorithm, the book presents data matching as a structured process that includes data preprocessing, indexing, similarity comparison, match classification, and evaluation.

Entity Resolution%20All%20of%20Entity%20Resolution.pdf?ou=1761137)

“(Almost) All of Entity Resolution” provides a comprehensive literature review of the entity resolution problem, which aims to systematically identify and merge multiple recors that refer to the same real-world entity across noisy and heterogeneous data sources. The paper traces the foundational history of the field from early probabilistic linkage methods in the mid-20th century to modern probabilistic and supervised machine learning approaches. It discusses deterministic and similarity-based techniques, extensions of classical probabilistic frameworks, clustering-based resolution methods, and recent advances in uncertainty quantification. Through practical examples spanning census data, human rights records, citation network, and medical data, it highlights both theoratical and practical challenges of entity resolution.

Author: Chris Wen

Link: https://wenyupeng.github.io/2025/11/11/big_data/da/09-des-statistics/

Copyright Notice: All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.

Related Articles

Descriptive Statistic for Continuous Data

Descriptive Statistic fopr Continuous DataHistograms are based on binned data and hence provide us with snapshots of how much probability mass is allocated in diferent parts of the data domain. 1234567import numpy as npincome = np.loadtxt("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/marek/uk_income_simulated_2020.txt")b = [0, 10000, 20000, 30000, 40000, 50000, 60000, 80000, np.inf] # bin boundsc = np.histogram(income, bins=b)[0] # countsfor i ...

Inspecting the Distribution of Numberic Data

Inspecting the Distribution of Numberic Data12345import numpy as npheights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/marek/nhanes_adult_female_height_2020.txt")np.random.choice(heights, 24, replace=False) Histograms1234567import matplotlib.pyplot as pltimport seaborn as snsplt.style.use("seaborn")sns.__version__ # FYIsns.histplot(heights, bins=11)plt.show() 1234income = np.loadtxt("https://raw.githubusercontent.co...

Handling Categorical Data

Handling Categorical DataRepresenting Categorical DataTwo common ways to represent a categorical variable with k distinct levels is by storing it as: a vector of strings, a vector of integers between 0 (inclusive) and k (exclusive). 12345678import numpy as npcountries = np.loadtxt("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/marek/37_pzu_warsaw_marathon_country.txt", dtype="str")x = countries[:16]xnp.unique(x) Encoding and Decodin...

Visualising Multidimensional Data and Measuring Correlation

Visualising Multidimensional Data and Measuring Correlation123456789import numpy as npimport pandas as pdbody = pd.read_csv("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv", comment="#")body = body.to_numpy() # data frames will be covered laterbody.shapebody[:6, :] # 6 first rows, all columns Scatterplots2D Data1234567891011import matplotlib.pyplot as pltimport seaborn as snsplt.style.use("...

Continuous Probability Distributions

Continuous Probability Distributions1234import numpy as npimport matplotlib.pyplot as pltimport seaborn as snsplt.style.use("seaborn") 123456heights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/marek/nhanes_adult_female_height_2020.txt")sns.histplot(heights, stat="density", kde=True)plt.show()import scipy.stats Normal DistributionNormal Distribution12345678μ = np.mean(heights) # an estimator of expected valueσ =...

Outliers123456import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snspd.set_option("display.notebook_repr_html", False) # disable "rich" outputplt.style.use("seaborn") Unidimensional Data1234567x = np.loadtxt("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/marek/blobs2.txt")plt.subplot(121)sns.boxplot(data=x, orient="h")plt.subplot(122)sns.histplot(x, binwidth=1)plt.show() M...

Loading Database