Updated via the Open Science Framework

terraref · Nov 20, 2015 · b7ed7a6 · b7ed7a6
1 parent 85f98eb
commit b7ed7a6
Showing 1 changed file with 52 additions and 28 deletions.
diff --git a/meeting-notes/2015-11-20_computational_pipeline.md b/meeting-notes/2015-11-20_computational_pipeline.md
@@ -5,63 +5,87 @@
 
 ## Participants:
 
-* David Lebauer (TERRAref & TERRA cat 1, Illinois)
+* David LeBauer (TERRAref & TERRA cat 1, Illinois)
 * Vicor Jongeneel (HPCBio, Illinois)
 * Matt Hudson (Illinois)
 * Chris Fields (HPCBio, Illinois)
 * Noah Fahlgren (Danforth)
 * Rob Alba (Danforth)
 
-## Tenative:
+## Agenda
 
-* Elodie Gazave (Cornell)
-* Pat Brown (TERRA cat 1, Illinois)
+1. Review proposal terraref/reference-data#19
+2. Determine roles / interests / contributions of each participating group
+  *  TERRA REF
+  *  TERRA MEPP
+  *  HPCBio
+3. Estimate data volumes (order of magnitude?) for each step
+4. Define Use cases
 
-### Regrets:
+## Teams
 
-* Mike Gore (Cornell)
-* Todd Mockler (Danforth)
+### TERRA REF
 
-### Agenda
+Will provide central repository and computing resources for genomics pipeline.
 
-* Review proposal terraref/reference-data#19
-* Determine roles / interests / contributions of each participating group
-
-
-## TERRA REF
+Core data set; sequencing to be done by Jeremy Schmutz at HudsonAlpha 
 
 * 75 shared lines
 * 40 de novo sequences, 
-* 15 lines by PacBio. 
+ * 15 by PacBio (? I infer this is a sort of 'gold standard'?)
 * 400 resequences  (enough coverage with enough coverage to do assembly 
 
+Estimated completion date: Mid 2016
 
-Question: where to partition 15, 40, 400
 
-Coordinated germplasm for G x E analysis
+### TERRA MEPP 
 
-## TERRA MEPP
 
-Sequencing? Yet to be determined. Pending discussions like this.
+Amount of sequencing is to be determined, unclear from programmatic level what will be required. Pending discussions like this.
 
-1. against existing reference genome
-2. re-analyze data with the PacBio lines (with a separate pipeline)
 
+### HPC Bio
 
-## HPC Bio
+Can support this project; they need clear specifications and funding 
 
-* ED: should spread 15 PacBiom through diversity of germplam + more coverage on more likely to be commercial lines
-* What can TERRA Ref do for the program? (and what are storage needs?)
-* What can HPCbio provide?
+## Data volume and compute needs
+
+### Sequence alignments
 
+* BAM can be very large (TBs); can be visualized in Jbrowse. 
+  * Sorted BAM is smaller (compressed). 
+  * Only need to keep sorted BAM. Can drop raw SAM and BAM intermediate files. 
+  * 1000 genomes, 30x ea. 4TB of BAMs + 4TB fastQ, no more than 20 TB. 
 
+* CPU time: 
+ * alignment depends on aligner efficency; runs 20x faster than BWA. 10k - 100k CPU-h for re-aligning 1000 genomes
+ * denovo assembly: depends on context
 
-  * TERRA MEPP
-  * HPCBio
-* Estimate data volumes (order of magnitude?) for each step
+### Genome Assembly
+
+* assembling Cassava genome 5k CPU-h per genome. This will be more time than the re-alignment. 
 * Define Use cases
 
-Need to figure out data flows for the entire project. 
+## Open Questions to be addressed by larger TERRA program
+
+### Shared germplasm
+
+* where to partition 15, 40, 400 lines in genomics diversity
+* What will be the coordinated germplasm used for cross-site G x E analysis?
+* ED: should spread 15 PacBiom through diversity of germplam + more coverage on more likely to be commercial lines
+
+
+Need to figure out data flows for the entire project
 
 * **Victor**: makes sense to analyze data at same place it is generated. Then deposit both raw data and derived products in common repository.
 * **Chris Fields**: browsing and visualization 
+
+
+* What can teams collaborate on?
+  * TODO Survey to be coordinated by Rob, Rachel, and David
+    * what people are required to do according to milestones, and what are the overlaps?
+    * how can the reference team help? to what degree are teams interested in using a centralized pipeline, and in sharing data at different points in the pipeline?
+
+
+
+