Skip to content

A Python package implementing the Top-Scoring Pairs (TSP) algorithm for feature selection (demo).

Notifications You must be signed in to change notification settings

odinokov/tsp-feature-selector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TSP Feature Selector

TSPFeatureSelector implements the Top-Scoring Pairs (TSP) algorithm for feature selection in gene expression data, with efficient computation using Numba.

For more details on the TSP algorithm:

Status: Work in Progress

This implementation is under development and requires validation.

Installation

Clone the repository and install with Poetry:

git clone https://github.com/odinokov/tsp-feature-selector.git
cd tsp-feature-selector
poetry install

Usage

from tspfs import TSPFeatureSelector
import numpy as np
import pandas as pd

# Simulate gene expression data
num_genes = 2000
num_samples = 50
np.random.seed(42)

gene_expression = np.random.rand(num_genes, num_samples)
class_labels = np.array([True] * 25 + [False] * 25)

assert np.all(np.isin(class_labels, [True, False]))

# Simulate up/down-regulated genes for both classes
gene_expression[1, class_labels] *= 10 #  np.random.rand() # Upregulated genes for class 1
gene_expression[2, class_labels] /= 10 # np.random.rand()  # Downregulated genes for class 1
gene_expression[3, ~class_labels] /= 10 # np.random.rand()  # Downregulated genes for class 0
gene_expression[4, ~class_labels] *= 10 # np.random.rand()  # Upregulated genes for class 0

# Select top 100 gene pairs
selector = TSPFeatureSelector(top_n=100)
selector.fit(gene_expression, class_labels)

# Retrieve and print the top selected pairs and their scores
top_pairs = selector.get_top_pairs()
top_scores = selector.get_top_scores()
# Calculate ANOVA-based scores
anova_scores = selector.calculate_anova_scores(gene_expression, class_labels)

# Create a DataFrame to combine the data
df = pd.DataFrame({
    "Top Gene Pairs": [f"({i[0]}, {i[1]})" for i in top_pairs],
    "Top Delta Scores": top_scores,
    "ANOVA-based Scores": anova_scores
})

print(df.head(20))

# Top Gene Pairs  Top Delta Scores  ANOVA-based Scores
# 0          (1, 4)              0.88            2.576376
# 1          (2, 3)              0.84            0.074278
# 2        (4, 484)              0.76            1.259738
# 3       (1, 1462)              0.76            0.048990
# 4        (4, 488)              0.72            0.698892
# 5       (4, 1310)              0.72            1.191759
# 6        (4, 463)              0.72            1.415432
# 7       (4, 1149)              0.72            1.559436
# 8       (4, 1539)              0.72            1.726177
# 9        (4, 447)              0.72            1.871513
# 10      (1, 1935)              0.72            1.506169
# 11      (4, 1566)              0.72            1.670635
# 12      (4, 1948)              0.72            1.777047
# 13       (2, 962)              0.72            1.996294
# 14      (4, 1522)              0.72            2.089045
# 15       (4, 972)              0.72            2.168370
# 16      (4, 1449)              0.72            2.250204
# 17      (2, 1389)              0.72            2.407223
# 18        (4, 57)              0.72            2.478221
# 19      (4, 1203)              0.72            2.517881
# 20       (4, 184)              0.72            2.551559

License

MIT License.

About

A Python package implementing the Top-Scoring Pairs (TSP) algorithm for feature selection (demo).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages