Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update final-workflow.cwl to add functional annotations to gff file #141

Closed
mpoelchau opened this issue Apr 26, 2022 · 15 comments
Closed

Update final-workflow.cwl to add functional annotations to gff file #141

mpoelchau opened this issue Apr 26, 2022 · 15 comments
Assignees

Comments

@mpoelchau
Copy link
Contributor

We need to begin adding functional annotation information to the genome browsers. The most straightforward way to do this is via the annotation gff3 file, prior to creating the apollo/jbrowse files. That means we will change some of the first steps of final-workflow.cwl.

  1. Download the following files:
  2. An NCBI table file (e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/298/625/GCF_001298625.1_SEUB3.0/GCF_001298625.1_SEUB3.0_feature_table.txt.gz; add URL to yml file?)
  3. An NCBI GFF file (e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/298/625/GCF_001298625.1_SEUB3.0/GCF_001298625.1_SEUB3.0_genomic.gff.gz; add URL to yml file?)
  4. Process the downloaded gff file with the following script: https://gitlab.com/i5k_Workspace/monicas-data-processing-scripts/-/blob/master/add_GO-KEGG_to_RefSeq-gff.pl (this is new, needs to be pulled into existing monicas-data-processing-scripts repo on your local)
  5. Script inputs:
  6. GO file (add input path to yml file)
  7. KEGG file (add input path to yml file)
  8. the downloaded GFF
  9. the downloaded table file
  10. Script is used as follows: perl add_GO-KEGG_to_RefSeq-gff.pl GO-file Kegg-file GFF table-file > output.gff
  11. Script output: processed gff file, file name should be GFF.annotated.gff (e.g. GCF_001298625.1_SEUB3.0_genomic.annotated.gff).
  12. change input for https://github.com/NAL-i5K/Organism_Onboarding/blob/master/flow_apollo2_data_processing/processing/workflow.cwl: in_gff is now the processed Gff file
  13. Processed gff file should also be distributed in flow_dispatch workflow
@mpoelchau
Copy link
Contributor Author

@amcooksey - could you run the functional annotation pipeline for Saccharomyces eubayanus GCF_001298625.1, so we have some smaller test data for Tina to work with?

@mpoelchau
Copy link
Contributor Author

@ZhiXuanLai here are paths to an example GO and KEGG file on Ceres:
GO: /project/nal_genomics/amanda.cooksey/protein_sets/Neodiprion_pinetum/NCBI\ Annotation\ Release\ 100\ functional\ annotation/GCF_021155775.1_complete.gaf.tsv
KEGG: /project/nal_genomics/amanda.cooksey/protein_sets/Neodiprion_pinetum/NCBI\ Annotation\ Release\ 100\ functional\ annotation/KOBAS/GCF_021155775.1_KOBAS_acc_pathways.tsv

@amcooksey
Copy link
Contributor

functional annotation for Saccharomyces eubayanus on CERES:
/project/nal_genomics/amanda.cooksey/protein_sets/Saccharomyces_eubayanus/NCBI Annotation Release 100 functional annotation

@mpoelchau
Copy link
Contributor Author

Update writeLastLine-genePred.cwl:

  • add the original and processed gff file names as inputs
  • change l. 21
    valueFrom: "echo -e '\nThe file [file] was post-processed to [describe post-processing, if any]. The resulting file is: [Filename]. This file was used for all operations within the i5k Workspace.' >> readme.txt"

to (replace the file names in brackets with the original and processed gff file name inputs)
valueFrom: "echo -e '\nThe file [original-file-name] was post-processed to add functional annotations from the AgBase functional annotation pipeline (https://github.com/agbase). The resulting file is: [processed-file-name]. This file was used for all operations within the i5k Workspace.' >> readme.txt"

@mpoelchau
Copy link
Contributor Author

@ZhiXuanLai can we include both the original gff file and the processed gff file in the dispatch output?

@mpoelchau
Copy link
Contributor Author

@ZhiXuanLai when I run the workflow using NA for the url_table_file parameter, I get the following error:

INFO [workflow md5checksums] starting step gunzip_table
INFO [step gunzip_table] start
ERROR Exception on step 'gunzip_table'
ERROR [step gunzip_table] Cannot make job: Invalid job input record:
pipeline/flow_md5checksums/gunzip_single.cwl:21:3: Missing required input parameter 'in_gz'
INFO [workflow md5checksums] completed permanentFail
WARNING [step md5checksums] completed permanentFail
INFO [workflow ] completed permanentFail
{}
WARNING Final process status is permanentFail

@ZhiXuanLai
Copy link
Contributor

Hi @mpoelchau
(1) I updated the filenames in writeLastLine-genePred.cwl. I wonder if the filenames in writeLastLine.cwl need to be changed too. The current content is: "echo -e '\nThe file [file] was post-processed to [describe post-processing, if any]. The resulting file is: [Filename]. This file was used for all operations within the i5k Workspace.' >> readme.txt"

(2) Sure! I added the two gff files to dispatch output.

(3) My bad! I fixed the error now.

@mpoelchau
Copy link
Contributor Author

@ZhiXuanLai thanks for the updates! I get the following error when I try to run the pipeline with

url_table_file: [
NA
]
INFO [step gaps-or-not] start
INFO [job gaps-or-not] /tmp/83xizhb_$ perl \
    -ne \
    'print if /N/' \
    id_deleted_file.txt > /tmp/83xizhb_/lines-contain-N.txt
INFO [job gaps-or-not] completed success
INFO [step gaps-or-not] completed success
INFO [workflow gaps_or_not] completed success
INFO [step gaps_or_not] completed success
INFO [workflow ] starting step add_annotation
INFO [step add_annotation] will be skipped
INFO [step add_annotation] completed skipped
INFO [workflow ] starting step apollo2_data_processing
INFO [step apollo2_data_processing] start
ERROR Exception on step 'apollo2_data_processing'
ERROR [step apollo2_data_processing] Cannot make job: Invalid job input record:
pipeline/flow_apollo2_data_processing/processing/workflow.cwl:15:3: Missing required input parameter 'in_gff'
INFO [workflow ] completed permanentFail
{}
WARNING Final process status is permanentFail

@mpoelchau
Copy link
Contributor Author

@ZhiXuanLai when I run the program with the table file URL, it completes successfully. The readme file looks good. However, I don't see the unprocessed gff in the analyses directory:

apollo@apollo:~$ ls /app/data/other_species/saceub/SEUB3.0/scaffold/analyses/Saccharomyces_eubayanus_Annotation_Release_100/
GCF_001298625.1_SEUB3.0_cds_from_genomic.fna   GCF_001298625.1_SEUB3.0_rna_from_genomic.fna  readme.txt
GCF_001298625.1_SEUB3.0_genomic.annotated.gff  GCF_001298625.1_SEUB3.0_translated_cds.faa

@mpoelchau
Copy link
Contributor Author

For the readme update, we won't need to change writeLastLine.cwl - that only pertains to the assembly readme, and that file remains unchanged. Good question though!

@ZhiXuanLai
Copy link
Contributor

@mpoelchau Sorry for not fixing the error. I must run the pipeline without saving the change in yaml file.
I got a question regarding the filename in writeLastLine.cwl. I wonder what we would like to fill in [processed-file-name] field when there is no table file provided (no processed gff file).

@mpoelchau
Copy link
Contributor Author

Good question! Is it possible to leave that line unchanged?

@mpoelchau
Copy link
Contributor Author

Update on how to handle writeLastLine-genePred.cwl.

  • If functional annotation is used at the beginning of the workflow, the last line should be: The file $(inputs.original_gff.basename) was post-processed to add functional annotations from the AgBase functional annotation pipeline (https://github.com/agbase). The resulting file is: $(inputs.processed_gff.basename). This file was used for all operations within the i5k Workspace
  • If functional annotation is not used at the beginning of the workflow, the last line should be: The file was post-processed to [describe post-processing, if any]. The resulting file is: [Filename]. This file was used for the JBrowse genome browser and the Apollo manual curation tool.

@mpoelchau
Copy link
Contributor Author

We need to add another process that I forgot about when I described this issue. The functional annotation directory (name is now in the tree variable array) needs to be moved into the analyses directory during the dispatch workflow. We could add a sub-workflow similar to https://github.com/NAL-i5K/Organism_Onboarding/blob/master/flow_dispatch/2other_species/cp_dir.cwl.

@mpoelchau
Copy link
Contributor Author

@childers could you take a look at the last comment/update?

@mpoelchau mpoelchau assigned childers and unassigned ZhiXuanLai Jul 15, 2022
@mpoelchau mpoelchau assigned mpoelchau and unassigned childers Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants