BlockingPy is a Python package that implements efficient blocking methods for record linkage and data deduplication using Approximate Nearest Neighbor (ANN) algorithms. It is based on R blocking package.
When performing record linkage or deduplication on large datasets, comparing all possible record pairs becomes computationally infeasible. Blocking helps reduce the comparison space by identifying candidate record pairs that are likely to match, using efficient approximate nearest neighbor search algorithms.
BlockingPy requires Python 3.10 or later. Installation is handled via PIP as follows:
pip install blockingpy
or i.e. with poetry:
poetry add blockingpy
You may need to run the following beforehand:
sudo apt-get install -y libmlpack-dev # on Linux
brew install mlpack # on MacOS
from blockingpy.blocker import Blocker
import pandas as pd
# Example data for record linkage
x = pd.DataFrame({
"txt": [
"johnsmith",
"smithjohn",
"smiithhjohn",
"smithjohnny",
"montypython",
"pythonmonty",
"errmontypython",
"monty",
]})
y = pd.DataFrame({
"txt": [
"montypython",
"smithjohn",
"other",
]})
# Initialize blocker instance
blocker = Blocker()
# Perform blocking with default ANN : FAISS
block_result = blocker.block(x = x['txt'], y = y['txt'])
Printing block_result
contains:
- The method used (
faiss
- refers to Facebook AI Similarity Search) - Number of blocks created (
3
in this case) - Number of columns (features) used for blocking (intersecting n-grams generated from both datasets,
17
in this example) - Reduction ratio, i.e. how large is the reduction of comparison pairs (here
0.8750
which means blocking reduces comparison by over 87.5%).
print(block_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 3
# Number of columns used for blocking: 17
# Reduction ratio: 0.8750
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
# 1 | 3
By printing block_result.result
we can take a look at the results table containing:
- row numbers from the original data,
- block number (integers),
- distance (from the ANN algorithm).
print(block_result.result)
# x y block dist
# 0 4 0 0 0.0
# 1 1 1 1 0.0
# 2 7 2 2 6.0
We can perform deduplication by putting previously created DataFrame in the block()
method.
dedup_result = blocker.block(x = x['txt'])
print(dedup_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 2
# Number of columns used for blocking: 25
# Reduction ratio: 0.5714
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
# 3 | 2
print(dedup_result.result)
# x y block dist
# 0 0 1 0 2.0
# 1 1 2 0 2.0
# 2 1 3 0 2.0
# 3 4 5 1 2.0
# 4 4 6 1 3.0
# 5 4 7 1 6.0
-
Multiple ANN algorithms available:
-
Multiple distance metrics such as:
- Euclidean
- Cosine
- Inner Product
and more...
-
Comprehensive algorithm parameters customization with
control_ann
andcontrol_txt
-
Support for already created Document-Term-Matrices (as
np.ndarray
orcsr_matrix
) -
Support for both record linkage and deduplication
-
Evaluation metrics when true blocks are known
You can find detailed information about BlockingPy in documentation.
BlockingPy is still under development, API and features may change. Also bugs or errors can occur.
BlockingPy is released under MIT license.
BlockingPy benefits from many open-source packages such as Faiss or Annoy. For detailed information see third party notice.
Please see CONTRIBUTING.md for more information.
TODO ?
This package is based on the R blocking package developed by BERENZ. Special thanks to the original author for his foundational work in this area.