For a more formal analysis of the the discussed modified MDS algorithm, please refer to the statistical_analysis_of_distribution_aware_MDS.pdf file in the repository.
This implementation addresses a fundamental challenge in applying Multidimensional Scaling (MDS) to large datasets by introducing a distribution-aware sampling approach. The key innovation lies in maintaining statistical fidelity while reducing computational complexity through intelligent sampling.
The methodology follows a three-tier approach:
- Distribution-aware sampling with uniform stratification
- Preconditioned mini-batch processing
- Bin-discretized stratification with distribution matching
My implementation uses the DistributionAwareStratifiedSampler
class with several notable features:
class DistributionAwareStratifiedSampler:
def __init__(self, n_bins=5):
self.kbd = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='quantile')
The sampler employs:
- Quantile-based binning for stratification
- Distribution fitting per bin using multiple candidate distributions
- Kolmogorov-Smirnov testing for distribution validation
The implementation tests multiple distributions (Gamma, Log-normal, Normal, Beta) per bin:
- Uses KS statistics for goodness-of-fit testing
- Implements fallback mechanisms for bins with poor fits
- Maintains original data characteristics through oversample-and-match strategy
Several optimization strategies are employed:
- Multi-core execution through
n_jobs=-1
- Memory-efficient sampling
- Early stopping in MDS iterations (
max_iter=300
) - Epsilon-based convergence (
eps=1e-3
)
The provided histogram comparison shows [full_vs_sampled_dataset.png]:
- Strong preservation of overall distribution shape
- Slight variations in tail regions
- Maintained multi-modal characteristics
The implementation achieves efficiency through:
- Reduced sample size (5,000 samples)
- Stratified processing
- Optimized MDS parameters
-
Statistical Robustness
- Distribution-aware sampling preserves data characteristics
- Multiple distribution families accommodate various data types
- KS testing provides statistical validation
-
Computational Efficiency
- Reduced memory footprint through intelligent sampling
- Parallelized processing where possible
- Optimized MDS parameters for faster convergence
-
Implementation Flexibility
- Modular design allows for easy modification
- Fallback mechanisms ensure robustness
- Configurable parameters for different use cases
- Fixed number of bins might not be optimal for all datasets
- Limited to specific distribution families
- Potential loss of local structure in sampling
-
Alternative Manifold Learning Methods
- Local Linear Embedding (LLE) for preserving local structure
- t-SNE for better visualization of clusters
- UMAP for faster processing and better global structure preservation
-
Enhanced Distribution Metrics
- Wasserstein distance for distribution comparison
- KL divergence for information-theoretic analysis
- Enhanced KS testing with multiple statistics
-
Computational Optimizations
- Adaptive bin sizing based on data characteristics
- GPU acceleration for distance calculations
- Incremental updating for streaming data
-
Statistical Enhancements
- Mixture model fitting for complex distributions
- Adaptive distribution family selection
- Enhanced validation metrics
-
Immediate Improvements
- Implement adaptive bin sizing
- Add Wasserstein distance metrics
- Include more distribution families
-
Long-term Development
- Explore UMAP for large-scale applications
- Implement GPU acceleration
- Develop streaming data capabilities
This development represents a significant advancement in applying MDS to large datasets while maintaining distribution awareness. The combination of intelligent sampling, distribution matching, and computational optimization creates a robust framework for dimensionality reduction. The suggested future directions, particularly the exploration of alternative manifold learning methods and enhanced distribution metrics, could further improve the methodology's effectiveness and applicability.
The formulation of this problem and the inspiration for solving it come from Assignment #1 in the Optimization Methods course at the Department of Computer Science, University of Crete. I would like to thank Prof. Tsagkatakis for providing the foundational exercises and challenges that shaped the approach taken in this project.
You can find more of Prof. Tsagkatakis' work and course resources on his GitHub repository:
Prof. Tsagkatakis - Optimization Methods 2024