Fast PCA for Large Dimension and Sample

Introduction

This package is the R code for the FAst DIstributed (FADI) PCA method for federated data when both the dimension $d$ and the sample size $n$ are ultra-large, by simultaneously performing parallel computing along $d$ and distributed computing along $n$.

Our paper is here:

Shen, S., Lu, J. and Lin, X., 2023. FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data. arXiv preprint arXiv:2306.06857.

FADI Method Workflow

Tutorial

The R codes folder contains the R scripts for simulation studies, and application of FADI to the 1000 Genomes data (estimation of principal eigenspace and inferential analysis under the degree-corrected mixed membership model).

R scripts example_spiked_covariance.R, example_GMM.R, example_DCMM.R, and example_missing_matrix.R contain the simulation codes for implementing FADI under the spiked covariance model, the Gaussian mixture models (GMM), the degree-corrected mixed membership (DCMM) model, and the incomplete matrix inference model respectively. Input parameters are d-dimension of data, mc-index of independent Monte Carlo simulations, and rt-ratio of $Lp/d$.

R scripts 1000g_estimation_layer_1.R and 1000g_estimation_layer_2.R contain the codes for applying FADI for estimating the principal eigenspace of the 1000 Genomes data. 1000g_estimation_layer_1.R implements Step 1 of FADI, with input parameters i-index for distributed data split, l-index for parallel sketching, and p-dimension of fast sketching. 1000g_estimation_layer_2.R implements step 2 of FADI, with input parameters l-index for parallel sketching, and p-dimension of fast sketching.

R script inference_1000g_SBM.R implements Step 1 and Step 2 of FADI for computing the top PCs of the undirected graph generated based on the 1000 Genomes data, with input parameter l-index for parallel sketching. R script multiple_testing_1000g.Rmd performs multiple testing on inferring subject population of the 1000 Genomes data. The script multiple_testing_1000g.Rmd first implements Step 3 of FADI, by aggregating the parallel sketching results to output the FADI estimator of the top PCs. Then the script multiple_testing_1000g.Rmd performs the inferential procedure for membership testing using the FADI PC estimator, as detailed in Supplement D of the paper "Dimension Reduction for Large-Scale Federated Data: Statistical Rate and Asymptotic Inference".

Data

The folder Data contains supplementary data for applying FADI to inferential analysis of the 1000 Genomes data. The file 1000g_sbm95.RData contains an undirected graph generated based on the 1000 Genomes data, used for the inferential application of FADI, and the file 1KG_TRACE_pca.txt contains the population information of the 1000 Genomes data subjects.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Data		Data
R codes		R codes
FADI_workflow.png		FADI_workflow.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast PCA for Large Dimension and Sample

Introduction

FADI Method Workflow

Tutorial

Data

About

Releases

Packages

Languages

License

junwei-lu/FADI

Folders and files

Latest commit

History

Repository files navigation

Fast PCA for Large Dimension and Sample

Introduction

FADI Method Workflow

Tutorial

Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages