TSPFeatureSelector implements the Top-Scoring Pairs (TSP) algorithm for feature selection in gene expression data, with efficient computation using Numba.
For more details on the TSP algorithm:
- Top-scoring pairs for feature selection
- Applications to cancer outcome prediction
- A Pairwise Feature Selection Method For Gene Data Using Information Gain
- A Robust and Efficient Feature Selection Algorithm for Microarray Data
- The Robust Classification Model Based on Combinatorial Features
This implementation is under development and requires validation.
Clone the repository and install with Poetry:
git clone https://github.com/odinokov/tsp-feature-selector.git
cd tsp-feature-selector
poetry install
from tspfs import TSPFeatureSelector
import numpy as np
import pandas as pd
# Simulate gene expression data
num_genes = 2000
num_samples = 50
np.random.seed(42)
gene_expression = np.random.rand(num_genes, num_samples)
class_labels = np.array([True] * 25 + [False] * 25)
assert np.all(np.isin(class_labels, [True, False]))
# Simulate up/down-regulated genes for both classes
gene_expression[1, class_labels] *= 10 # np.random.rand() # Upregulated genes for class 1
gene_expression[2, class_labels] /= 10 # np.random.rand() # Downregulated genes for class 1
gene_expression[3, ~class_labels] /= 10 # np.random.rand() # Downregulated genes for class 0
gene_expression[4, ~class_labels] *= 10 # np.random.rand() # Upregulated genes for class 0
# Select top 100 gene pairs
selector = TSPFeatureSelector(top_n=100)
selector.fit(gene_expression, class_labels)
# Retrieve and print the top selected pairs and their scores
top_pairs = selector.get_top_pairs()
top_scores = selector.get_top_scores()
# Calculate ANOVA-based scores
anova_scores = selector.calculate_anova_scores(gene_expression, class_labels)
# Create a DataFrame to combine the data
df = pd.DataFrame({
"Top Gene Pairs": [f"({i[0]}, {i[1]})" for i in top_pairs],
"Top Delta Scores": top_scores,
"ANOVA-based Scores": anova_scores
})
print(df.head(20))
# Top Gene Pairs Top Delta Scores ANOVA-based Scores
# 0 (1, 4) 0.88 2.576376
# 1 (2, 3) 0.84 0.074278
# 2 (4, 484) 0.76 1.259738
# 3 (1, 1462) 0.76 0.048990
# 4 (4, 488) 0.72 0.698892
# 5 (4, 1310) 0.72 1.191759
# 6 (4, 463) 0.72 1.415432
# 7 (4, 1149) 0.72 1.559436
# 8 (4, 1539) 0.72 1.726177
# 9 (4, 447) 0.72 1.871513
# 10 (1, 1935) 0.72 1.506169
# 11 (4, 1566) 0.72 1.670635
# 12 (4, 1948) 0.72 1.777047
# 13 (2, 962) 0.72 1.996294
# 14 (4, 1522) 0.72 2.089045
# 15 (4, 972) 0.72 2.168370
# 16 (4, 1449) 0.72 2.250204
# 17 (2, 1389) 0.72 2.407223
# 18 (4, 57) 0.72 2.478221
# 19 (4, 1203) 0.72 2.517881
# 20 (4, 184) 0.72 2.551559
MIT License.