Update final-workflow.cwl to add functional annotations to gff file #141

mpoelchau · 2022-04-26T19:40:21Z

We need to begin adding functional annotation information to the genome browsers. The most straightforward way to do this is via the annotation gff3 file, prior to creating the apollo/jbrowse files. That means we will change some of the first steps of final-workflow.cwl.

Download the following files:
An NCBI table file (e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/298/625/GCF_001298625.1_SEUB3.0/GCF_001298625.1_SEUB3.0_feature_table.txt.gz; add URL to yml file?)
An NCBI GFF file (e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/298/625/GCF_001298625.1_SEUB3.0/GCF_001298625.1_SEUB3.0_genomic.gff.gz; add URL to yml file?)
Process the downloaded gff file with the following script: https://gitlab.com/i5k_Workspace/monicas-data-processing-scripts/-/blob/master/add_GO-KEGG_to_RefSeq-gff.pl (this is new, needs to be pulled into existing monicas-data-processing-scripts repo on your local)
Script inputs:
GO file (add input path to yml file)
KEGG file (add input path to yml file)
the downloaded GFF
the downloaded table file
Script is used as follows: perl add_GO-KEGG_to_RefSeq-gff.pl GO-file Kegg-file GFF table-file > output.gff
Script output: processed gff file, file name should be GFF.annotated.gff (e.g. GCF_001298625.1_SEUB3.0_genomic.annotated.gff).
change input for https://github.com/NAL-i5K/Organism_Onboarding/blob/master/flow_apollo2_data_processing/processing/workflow.cwl: in_gff is now the processed Gff file
Processed gff file should also be distributed in flow_dispatch workflow

The text was updated successfully, but these errors were encountered:

mpoelchau · 2022-04-27T15:53:34Z

@amcooksey - could you run the functional annotation pipeline for Saccharomyces eubayanus GCF_001298625.1, so we have some smaller test data for Tina to work with?

mpoelchau · 2022-04-27T15:55:13Z

@ZhiXuanLai here are paths to an example GO and KEGG file on Ceres:
GO: /project/nal_genomics/amanda.cooksey/protein_sets/Neodiprion_pinetum/NCBI\ Annotation\ Release\ 100\ functional\ annotation/GCF_021155775.1_complete.gaf.tsv
KEGG: /project/nal_genomics/amanda.cooksey/protein_sets/Neodiprion_pinetum/NCBI\ Annotation\ Release\ 100\ functional\ annotation/KOBAS/GCF_021155775.1_KOBAS_acc_pathways.tsv

amcooksey · 2022-04-28T15:38:20Z

functional annotation for Saccharomyces eubayanus on CERES:
/project/nal_genomics/amanda.cooksey/protein_sets/Saccharomyces_eubayanus/NCBI Annotation Release 100 functional annotation

mpoelchau · 2022-05-11T17:30:27Z

Update writeLastLine-genePred.cwl:

add the original and processed gff file names as inputs
change l. 21
valueFrom: "echo -e '\nThe file [file] was post-processed to [describe post-processing, if any]. The resulting file is: [Filename]. This file was used for all operations within the i5k Workspace.' >> readme.txt"

to (replace the file names in brackets with the original and processed gff file name inputs)
valueFrom: "echo -e '\nThe file [original-file-name] was post-processed to add functional annotations from the AgBase functional annotation pipeline (https://github.com/agbase). The resulting file is: [processed-file-name]. This file was used for all operations within the i5k Workspace.' >> readme.txt"

mpoelchau · 2022-05-11T18:00:17Z

@ZhiXuanLai can we include both the original gff file and the processed gff file in the dispatch output?

mpoelchau · 2022-05-11T20:50:53Z

@ZhiXuanLai when I run the workflow using NA for the url_table_file parameter, I get the following error:

INFO [workflow md5checksums] starting step gunzip_table
INFO [step gunzip_table] start
ERROR Exception on step 'gunzip_table'
ERROR [step gunzip_table] Cannot make job: Invalid job input record:
pipeline/flow_md5checksums/gunzip_single.cwl:21:3: Missing required input parameter 'in_gz'
INFO [workflow md5checksums] completed permanentFail
WARNING [step md5checksums] completed permanentFail
INFO [workflow ] completed permanentFail
{}
WARNING Final process status is permanentFail

ZhiXuanLai · 2022-05-12T09:16:53Z

Hi @mpoelchau
(1) I updated the filenames in writeLastLine-genePred.cwl. I wonder if the filenames in writeLastLine.cwl need to be changed too. The current content is: "echo -e '\nThe file [file] was post-processed to [describe post-processing, if any]. The resulting file is: [Filename]. This file was used for all operations within the i5k Workspace.' >> readme.txt"

(2) Sure! I added the two gff files to dispatch output.

(3) My bad! I fixed the error now.

mpoelchau · 2022-05-12T13:14:04Z

@ZhiXuanLai thanks for the updates! I get the following error when I try to run the pipeline with

url_table_file: [
NA
]

INFO [step gaps-or-not] start
INFO [job gaps-or-not] /tmp/83xizhb_$ perl \
    -ne \
    'print if /N/' \
    id_deleted_file.txt > /tmp/83xizhb_/lines-contain-N.txt
INFO [job gaps-or-not] completed success
INFO [step gaps-or-not] completed success
INFO [workflow gaps_or_not] completed success
INFO [step gaps_or_not] completed success
INFO [workflow ] starting step add_annotation
INFO [step add_annotation] will be skipped
INFO [step add_annotation] completed skipped
INFO [workflow ] starting step apollo2_data_processing
INFO [step apollo2_data_processing] start
ERROR Exception on step 'apollo2_data_processing'
ERROR [step apollo2_data_processing] Cannot make job: Invalid job input record:
pipeline/flow_apollo2_data_processing/processing/workflow.cwl:15:3: Missing required input parameter 'in_gff'
INFO [workflow ] completed permanentFail
{}
WARNING Final process status is permanentFail

mpoelchau · 2022-05-12T13:20:48Z

@ZhiXuanLai when I run the program with the table file URL, it completes successfully. The readme file looks good. However, I don't see the unprocessed gff in the analyses directory:

apollo@apollo:~$ ls /app/data/other_species/saceub/SEUB3.0/scaffold/analyses/Saccharomyces_eubayanus_Annotation_Release_100/
GCF_001298625.1_SEUB3.0_cds_from_genomic.fna   GCF_001298625.1_SEUB3.0_rna_from_genomic.fna  readme.txt
GCF_001298625.1_SEUB3.0_genomic.annotated.gff  GCF_001298625.1_SEUB3.0_translated_cds.faa

mpoelchau · 2022-05-12T13:21:36Z

For the readme update, we won't need to change writeLastLine.cwl - that only pertains to the assembly readme, and that file remains unchanged. Good question though!

ZhiXuanLai · 2022-05-13T00:36:36Z

@mpoelchau Sorry for not fixing the error. I must run the pipeline without saving the change in yaml file.
I got a question regarding the filename in writeLastLine.cwl. I wonder what we would like to fill in [processed-file-name] field when there is no table file provided (no processed gff file).

mpoelchau · 2022-05-13T12:58:18Z

Good question! Is it possible to leave that line unchanged?

mpoelchau · 2022-05-23T15:50:49Z

Update on how to handle writeLastLine-genePred.cwl.

If functional annotation is used at the beginning of the workflow, the last line should be: The file $(inputs.original_gff.basename) was post-processed to add functional annotations from the AgBase functional annotation pipeline (https://github.com/agbase). The resulting file is: $(inputs.processed_gff.basename). This file was used for all operations within the i5k Workspace
If functional annotation is not used at the beginning of the workflow, the last line should be: The file was post-processed to [describe post-processing, if any]. The resulting file is: [Filename]. This file was used for the JBrowse genome browser and the Apollo manual curation tool.

mpoelchau · 2022-06-08T16:55:19Z

We need to add another process that I forgot about when I described this issue. The functional annotation directory (name is now in the tree variable array) needs to be moved into the analyses directory during the dispatch workflow. We could add a sub-workflow similar to https://github.com/NAL-i5K/Organism_Onboarding/blob/master/flow_dispatch/2other_species/cp_dir.cwl.

mpoelchau · 2022-07-15T15:17:20Z

@childers could you take a look at the last comment/update?

mpoelchau assigned ZhiXuanLai Apr 26, 2022

mpoelchau assigned childers and unassigned ZhiXuanLai Jul 15, 2022

mpoelchau assigned mpoelchau and unassigned childers Oct 14, 2022

mpoelchau closed this as completed Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update final-workflow.cwl to add functional annotations to gff file #141

Update final-workflow.cwl to add functional annotations to gff file #141

mpoelchau commented Apr 26, 2022

mpoelchau commented Apr 27, 2022

mpoelchau commented Apr 27, 2022

amcooksey commented Apr 28, 2022

mpoelchau commented May 11, 2022

mpoelchau commented May 11, 2022

mpoelchau commented May 11, 2022

ZhiXuanLai commented May 12, 2022

mpoelchau commented May 12, 2022

mpoelchau commented May 12, 2022

mpoelchau commented May 12, 2022

ZhiXuanLai commented May 13, 2022

mpoelchau commented May 13, 2022

mpoelchau commented May 23, 2022

mpoelchau commented Jun 8, 2022

mpoelchau commented Jul 15, 2022

Update final-workflow.cwl to add functional annotations to gff file #141

Update final-workflow.cwl to add functional annotations to gff file #141

Comments

mpoelchau commented Apr 26, 2022

mpoelchau commented Apr 27, 2022

mpoelchau commented Apr 27, 2022

amcooksey commented Apr 28, 2022

mpoelchau commented May 11, 2022

mpoelchau commented May 11, 2022

mpoelchau commented May 11, 2022

ZhiXuanLai commented May 12, 2022

mpoelchau commented May 12, 2022

mpoelchau commented May 12, 2022

mpoelchau commented May 12, 2022

ZhiXuanLai commented May 13, 2022

mpoelchau commented May 13, 2022

mpoelchau commented May 23, 2022

mpoelchau commented Jun 8, 2022

mpoelchau commented Jul 15, 2022