This repository has been archived by the owner on Aug 25, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
2017.12.06
DMTR edited this page Dec 6, 2017
·
1 revision
Today, the group had another meeting to discuss our progress (or more appropriately, the lack thereof -- at least that's how I feel about my contribution).
Regarding the to-do list from the last meeting:
- [Completed by Arif] Combine metadata and raw counts → patient code vs gene codes.
- [Resolved] Look up whether to sum up or take median/mean for the samples that were sequenced multiple times (deeper coverage).
- Because we have technical replicates: cf. (DESeq2 package tutorial)
- Conclusion: just sum them :)
- [Pending!] Get information regarding the workflow so far up until we get the raw count.
- [Pending!] Upload reference regarding TPM value cut-off.
Regarding §1:
I am also at this stage; trying to emulate this. Am sure that I will be successful, already halfway there but most likely will do it in a quite inefficient manner. cf: SampleID_mapping.py, ensembl_reference_dict.py & sum_transcripts_to_genes.py.
The workflow:
- Created a dictionary from the metadata to assign the sample ID into the NASH identifiers (SampleID_mapping.py). Then, the raw_count file is regenerated with the NASH identifiers.
- From Ensembl, I downloaded the latest Homo sapiens GTF file (build: GRCh38.p10, version: GRCh38, build accession: NCBI:GCA_000001405.25, last update: 2017-07).
- Using the following command, I filtered down the file only to include the information that contains transcript ID (which is what we needed anyway).
awk ' $3 = "transcript" { print $0 } ' input_gtf_file > transcript_gtf_file
- A
dictionarydictionary-like file was generated with ensembl_reference_dict.py. It essentially only contains the information that (I thought) we require for downstream processing: gene_ID (ENSG...), gene_symbol, and transcript_ID (ENST...). - Moving on to the part where I am currently at (sum_transcripts_to_genes.py):
- Done! - Load the raw_count_NASH_id file.
- Done! - Translate all the transcript_ID in column1 into the associated gene_ID.
- Stuck :) - Combining the genes (if they have the same gene_ID) and the raw_count values. I feel this would be too complicated or tedious or memory-inefficient if I just turn everything into a nested list then group it one by one and summing the respective values. I am thinking of doing this either by converting it into a pandas data frame or doing it on R.
The (summary of the) outcomes from today's meeting are as follows:
- Still need to find out about:
- Upstream processing until raw_counts,
- TPM reference,
- More information regarding the phenotypes of the patients (age, weight, sex, etc.) - as Lukas (who happened to pop in halfway through the meeting - thanks for your advice!) has suggested, though these may increase the complexity of the analysis.
- As Arif has done the following several times, it was decided that Xueqing, Vasilis and I should do the following. Using some of the common packages/tools, conduct the differential expression with DESeq2 and feed the results into an R package called Piano alongside different pathway data (e.g. KEGG, GO - from MSigDB).
- If we have time, it might be worth it to conduct further analysis with other tools, e.g. edgeR, SLE, or do an enrichment analysis using DAVID.
- In the meantime, Arif will try to do what has been suggested: ANOVA for each gene across the 4 disease states. We were informed that doing this would increase the statistical resolution (CMIIW).
- Think about what to put in the presentation!
- Semi-quasi-official deadline to do these analyses is Sunday, 10 Dec. Hopefully, I can achieve this!
This makes me hungry. (C)2017-2018 • DMTR13
- Description
- Scripts
- Sample Results
-
Poster^
^Requires Canvas access