Skip to content
This repository has been archived by the owner on Aug 25, 2019. It is now read-only.

2017.12.06

DMTR edited this page Dec 6, 2017 · 1 revision

Today, the group had another meeting to discuss our progress (or more appropriately, the lack thereof -- at least that's how I feel about my contribution).
Regarding the to-do list from the last meeting:

  1. [Completed by Arif] Combine metadata and raw counts → patient code vs gene codes.
  2. [Resolved] Look up whether to sum up or take median/mean for the samples that were sequenced multiple times (deeper coverage).
    • Because we have technical replicates: cf. (DESeq2 package tutorial)
    • Conclusion: just sum them :)
  3. [Pending!] Get information regarding the workflow so far up until we get the raw count.
  4. [Pending!] Upload reference regarding TPM value cut-off.

Regarding §1:
I am also at this stage; trying to emulate this. Am sure that I will be successful, already halfway there but most likely will do it in a quite inefficient manner. cf: SampleID_mapping.py, ensembl_reference_dict.py & sum_transcripts_to_genes.py.
The workflow:

  • Created a dictionary from the metadata to assign the sample ID into the NASH identifiers (SampleID_mapping.py). Then, the raw_count file is regenerated with the NASH identifiers.
  • From Ensembl, I downloaded the latest Homo sapiens GTF file (build: GRCh38.p10, version: GRCh38, build accession: NCBI:GCA_000001405.25, last update: 2017-07).
  • Using the following command, I filtered down the file only to include the information that contains transcript ID (which is what we needed anyway).
awk ' $3 = "transcript" { print $0 } ' input_gtf_file > transcript_gtf_file
  • A dictionary dictionary-like file was generated with ensembl_reference_dict.py. It essentially only contains the information that (I thought) we require for downstream processing: gene_ID (ENSG...), gene_symbol, and transcript_ID (ENST...).
  • Moving on to the part where I am currently at (sum_transcripts_to_genes.py):
    • Done! - Load the raw_count_NASH_id file.
    • Done! - Translate all the transcript_ID in column1 into the associated gene_ID.
    • Stuck :) - Combining the genes (if they have the same gene_ID) and the raw_count values. I feel this would be too complicated or tedious or memory-inefficient if I just turn everything into a nested list then group it one by one and summing the respective values. I am thinking of doing this either by converting it into a pandas data frame or doing it on R.

The (summary of the) outcomes from today's meeting are as follows:

  • Still need to find out about:
    • Upstream processing until raw_counts,
    • TPM reference,
    • More information regarding the phenotypes of the patients (age, weight, sex, etc.) - as Lukas (who happened to pop in halfway through the meeting - thanks for your advice!) has suggested, though these may increase the complexity of the analysis.
  • As Arif has done the following several times, it was decided that Xueqing, Vasilis and I should do the following. Using some of the common packages/tools, conduct the differential expression with DESeq2 and feed the results into an R package called Piano alongside different pathway data (e.g. KEGG, GO - from MSigDB).
  • If we have time, it might be worth it to conduct further analysis with other tools, e.g. edgeR, SLE, or do an enrichment analysis using DAVID.
  • In the meantime, Arif will try to do what has been suggested: ANOVA for each gene across the 4 disease states. We were informed that doing this would increase the statistical resolution (CMIIW).
  • Think about what to put in the presentation!
  • Semi-quasi-official deadline to do these analyses is Sunday, 10 Dec. Hopefully, I can achieve this!
Clone this wiki locally