Gene Identity Score of Mammalian Orthologs
- Generating GISMO-mis and GISMO metrics.
- consensus_v2_missense-counts.gz and consensus_v2_synonymous-counts.gz output from scripts in GISMO-mis
- unmerged-species_combined_matrix_2023-07-26.tsv output from scripts in one2_matrices
- Scripts to generate input files for GISMO-mis
- Instead of comparing to human (">REFERENCE" in fasta), generate a consensus reference sequence by codon for each gene. Note that this permits multiple reference codons at a position (in the event of ties)
- If there is a deletion ("-") anywhere in either reference or query sequence for a gene, skip over it (not counted towards mis or syn totals)
- For any "N" bp, consider all possible substitutions. To be the most conservative, if it can be synonymous, mark is as synonymous. In other words, only classify if missense if all possibilities are missense.
- For synonymous and missense scoring in the case of multiple reference codons, again taking the most conservative route by only determining missense if all reference codons are only missense against queried codon.
- All used ENST[x].[gene].fasta.gz files hosted at: https://genome.senckenberg.de/download/TOGA/human_hg38_reference/MultipleCodonAlignments/
- Primary outputs moved to: GenerateScore/inputs/consensus_v2_missense-counts.gz and GenerateScore/inputs/consensus_v2_synonymous-counts.gz
- Generating GISMO one2 matrix (primary output: GenerateScore/inputs/unmerged-species_combined_matrix_2023-07-26.tsv)
- Input files hosted at: https://genome.senckenberg.de/download/TOGA/human_hg38_reference/
- Scripts to run partitioned heritability
- Scripts used to run partitioned heritability
- (most) Input files needed to run partitioned heritability
- SNP list: list.txt
- Gene coordinate file: ENSG_coord.txt
- Dependent files that are not included here:
- 1000G_EUR_Phase3_plink/ hosted at: gs://broad-alkesgroup-public-requester-pays/LDSCORE/1000G_Phase3_plinkfiles.tgz
- hapmap3_snps/ hosted at: gs://broad-alkesgroup-public-requester-pays/LDSCORE/hapmap3_snps.tgz. Using w_hm3.snplist instead.
- Baseline model hosted at: gs://broad-alkesgroup-public-requester-pays/LDSCORE/1000G_Phase3_baselineLD_v2.2_ldscores.tgz
- Weights hosted at: gs://broad-alkesgroup-public-requester-pays/LDSCORE/1000G_Phase3_weights_hm3_no_MHC.tgz
- frqfiles hosted at: gs://broad-alkesgroup-public-requester-pays/LDSCORE/1000G_Phase3_frq.tgz
- GWAS sumstats can be downloaded according to details in: Manifest_201807.csv
- decile, decile_rand, and decile-list used as inputs to partitioned heritability scripts (all generated by GISMO-mis_gencode-v44_annotate-ensembl.R)
- gencode.v44.chr_patch_hapl_scaff.basic.annotation.gff3.genes are gene names from gencode v44 (hosted at: https://www.gencodegenes.org/human/). Generated by v44_ensembl-from-gencode.py.