2017.12.06

Today, the group had another meeting to discuss our progress (or more appropriately, the lack thereof -- at least that's how I feel about my contribution).
Regarding the to-do list from the last meeting:

[Completed by Arif] Combine metadata and raw counts → patient code vs gene codes.
[Resolved] Look up whether to sum up or take median/mean for the samples that were sequenced multiple times (deeper coverage).
- Because we have technical replicates: cf. (DESeq2 package tutorial)
- Conclusion: just sum them :)
[Pending!] Get information regarding the workflow so far up until we get the raw count.
[Pending!] Upload reference regarding TPM value cut-off.

Regarding §1:
I am also at this stage; trying to emulate this. Am sure that I will be successful, already halfway there but most likely will do it in a quite inefficient manner. cf: SampleID_mapping.py, ensembl_reference_dict.py & sum_transcripts_to_genes.py.
The workflow:

Created a dictionary from the metadata to assign the sample ID into the NASH identifiers (SampleID_mapping.py). Then, the raw_count file is regenerated with the NASH identifiers.
From Ensembl, I downloaded the latest Homo sapiens GTF file (build: GRCh38.p10, version: GRCh38, build accession: NCBI:GCA_000001405.25, last update: 2017-07).
Using the following command, I filtered down the file only to include the information that contains transcript ID (which is what we needed anyway).

awk ' $3 = "transcript" { print $0 } ' input_gtf_file > transcript_gtf_file

A ~~dictionary~~ dictionary-like file was generated with ensembl_reference_dict.py. It essentially only contains the information that (I thought) we require for downstream processing: gene_ID (ENSG...), gene_symbol, and transcript_ID (ENST...).
Moving on to the part where I am currently at (sum_transcripts_to_genes.py):
- Done! - Load the raw_count_NASH_id file.
- Done! - Translate all the transcript_ID in column1 into the associated gene_ID.
- Stuck :) - Combining the genes (if they have the same gene_ID) and the raw_count values. I feel this would be too complicated or tedious or memory-inefficient if I just turn everything into a nested list then group it one by one and summing the respective values. I am thinking of doing this either by converting it into a pandas data frame or doing it on R.

The (summary of the) outcomes from today's meeting are as follows:

Still need to find out about:
- Upstream processing until raw_counts,
- TPM reference,
- More information regarding the phenotypes of the patients (age, weight, sex, etc.) - as Lukas (who happened to pop in halfway through the meeting - thanks for your advice!) has suggested, though these may increase the complexity of the analysis.
As Arif has done the following several times, it was decided that Xueqing, Vasilis and I should do the following. Using some of the common packages/tools, conduct the differential expression with DESeq2 and feed the results into an R package called Piano alongside different pathway data (e.g. KEGG, GO - from MSigDB).
If we have time, it might be worth it to conduct further analysis with other tools, e.g. edgeR, SLE, or do an enrichment analysis using DAVID.
In the meantime, Arif will try to do what has been suggested: ANOVA for each gene across the 4 disease states. We were informed that doing this would increase the statistical resolution (CMIIW).
Think about what to put in the presentation!
Semi-quasi-official deadline to do these analyses is Sunday, 10 Dec. Hopefully, I can achieve this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2017.12.06

Project

2018

January

2017

December

November

Clone this wiki locally