Skip to content

Commit 85aaf86

Browse files
committed
VcfCalculatePRS: improved documentation and error message.
1 parent a54a281 commit 85aaf86

File tree

3 files changed

+44
-36
lines changed

3 files changed

+44
-36
lines changed

README.md

+16-16
Original file line numberDiff line numberDiff line change
@@ -16,24 +16,21 @@ Binaries of *ngs-bits* are available via Bioconda. Alternatively, *ngs-bits* can
1616

1717
Changes already implemented in GIT master for next release:
1818

19-
* GSvar: improved cfDNA sample handling (Batchimport of cfDNA panels, queue anlysis, ...)
19+
* none
20+
21+
22+
Changes in release 2021_09:
23+
24+
* GenePrioritization: Performs gene prioritization based on list of known disease genes and a PPI graph.
25+
* GraphStringDb: Creates simple representation of String-DB interaction graph.
26+
* VcfCheck: improved checking of empty INFO column
2027
* VcfAnnotateFromBed: added multithread support
21-
* MappingQC: added support for cfDNA samples
22-
23-
Changes in release 2021_06:
24-
25-
* General: Improved GRCh38 support in several tools.
26-
* General: Using BGZIP for compressed VCFs now to allow indexing them with tabix.
27-
* VcfAnnotateFromBed: Made separator configurable; Added check for separator in source BED file; Fixed broken output VCF if input has no FORMAT column.
28-
* VcfAnnotateFromVcf: Fixed crash in VCF header parser.
29-
* NGSDExportSamples: Added ancestry column.
30-
* SampleAncestry: Improved runtime and memory use.
31-
* SampleGender: Improved runtime for algorithm 'hetx'.
32-
* SomaticQC: Added support for mutect2.
28+
* VcfAnnotateFromVcf: improved memory useage
29+
* MappingQC: added support for cfDNA samples, improved support for RNA
3330
* NGSD:
34-
* Added disease status 'Unclear' to table 'sample'.
35-
* Added table 'processed_sample_ancestry'.
36-
* Added percent occupied to 'runqc_lane' (for Illumina NovaSeq).
31+
* removed 'gene' and 'variant\_type' columns from 'variant' table
32+
* added 'germline\_het' and 'germline\_hom' columns to 'variant' table
33+
* added method 'shallow WGS' to 'variant\_validation' table
3734

3835
For older releases see the [releases page](https://github.com/imgag/ngs-bits/releases).
3936

@@ -134,6 +131,7 @@ The default output format of the quality control tools is [qcML](https://pubmed.
134131
* [VcfAnnotateFromBed](doc/tools/VcfAnnotateFromBed.md) - Annotates the INFO column of a VCF with data from a BED file.
135132
* [VcfAnnotateFromVcf](doc/tools/VcfAnnotateFromVcf.md) - Annotates the INFO column of a VCF with data from another VCF file (or multiple VCF files if config file is provided)
136133
* [VcfBreakMulti](doc/tools/VcfBreakMulti.md) - Breaks multi-allelic variants into several lines, making sure that allele-specific INFO/SAMPLE fields are still valid.
134+
* [VcfCalculatePRS](doc/tools/VcfCalculatePRS.md) - Calculates the Polgenic Risk Score(s) for a sample.
137135
* [VcfCheck](doc/tools/VcfCheck.md) - Checks a VCF file for errors.
138136
* [VcfExtractSamples](doc/tools/VcfExtractSamples.md) - Extract one or several samples from a VCF file.
139137
* [VcfFilter](doc/tools/VcfFilter.md) - Filters a VCF based on the given criteria.
@@ -154,6 +152,8 @@ The default output format of the quality control tools is [qcML](https://pubmed.
154152

155153
### Gene handling tools
156154

155+
* [GenePrioritization](doc/tools/GenePrioritization.md): Performs gene prioritization based on list of known disease genes and a PPI graph (see also GraphStringDb).
156+
* [GraphStringDb](doc/tools/GraphStringDb.md): Creates simple representation of String-DB interaction graph.
157157
* [GenesToApproved](doc/tools/GenesToApproved.md) - Replaces gene symbols by approved symbols using the HGNC database (needs [NGSD](doc/install_ngsd.md)).
158158
* [GenesToBed](doc/tools/GenesToBed.md) - Converts a text file with gene names to a BED file (needs [NGSD](doc/install_ngsd.md)).
159159
* [NGSDExportGenes](doc/tools/NGSDExportGenes.md) - Lists genes from NGSD (needs [NGSD](doc/install_ngsd.md)).

doc/tools/VcfCalculatePRS.md

+8-4
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,14 @@
11
### VcfCalculatePRS tool help
2-
VcfCalculatePRS (2020_06-40-g7bb4fdf2)
2+
VcfCalculatePRS (2021_06-89-gbbd16264)
33

4-
Calculates the Polgenic Risk Score for a given set of PRS VCFs
4+
Calculates the Polgenic Risk Score(s) for a sample.
5+
6+
The PRS VCF files have to contain a WEIGHT entry in the INFO column.
7+
Additionally some information about the PRS score is required in the VCF header.
8+
An example VCF file can be found at https://github.com/imgag/ngs-bits/blob/master/src/tools-TEST/data_in/VcfCalculatePRS_prs2.vcf
59

610
Mandatory parameters:
7-
-in <file> Tabix indexed VCF.GZ file of the sample.
11+
-in <file> Tabix indexed VCF.GZ file of a sample.
812
-prs <filelist> List of PRS VCFs.
913
-out <file> Output TSV file containing Scores and PRS details
1014

@@ -15,7 +19,7 @@
1519
--tdx Writes a Tool Definition Xml file. The file name is the application name with the suffix '.tdx'.
1620

1721
### VcfCalculatePRS changelog
18-
VcfCalculatePRS 2020_06-40-g7bb4fdf2
22+
VcfCalculatePRS 2021_06-89-gbbd16264
1923

2024
2020-07-22 Initial version of this tool.
2125
[back to ngs-bits](https://github.com/imgag/ngs-bits)

src/VcfCalculatePRS/main.cpp

+20-16
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,10 @@ class ConcreteTool
2020

2121
virtual void setup()
2222
{
23-
setDescription("Calculates the Polgenic Risk Score for a given set of PRS VCFs");
24-
addInfile("in", "Tabix indexed VCF.GZ file of the sample.", false);
23+
setDescription("Calculates the Polgenic Risk Score(s) for a sample.");
24+
setExtendedDescription(QStringList() << "The PRS VCF files have to contain a WEIGHT entry in the INFO column." << "Additionally some information about the PRS score is required in the VCF header." << "An example VCF file can be found at https://github.com/imgag/ngs-bits/blob/master/src/tools-TEST/data_in/VcfCalculatePRS_prs2.vcf");
25+
26+
addInfile("in", "Tabix indexed VCF.GZ file of a sample.", false);
2527
addInfileList("prs", "List of PRS VCFs.", false);
2628
addOutfile("out", "Output TSV file containing Scores and PRS details", false);
2729

@@ -52,15 +54,14 @@ class ConcreteTool
5254
//does not support multi sample
5355
if(prs_variant_list.sampleIDs().count() > 1)
5456
{
55-
THROW(FileParseException, "VCF file can only contain one sample for Polgenic Risk Score calculation.");
57+
THROW(FileParseException, "PRS VCF file must not contain more than one sample: " + prs_file_path);
5658
}
5759

5860
//define PRS
5961
double prs = 0;
6062
QVector<double> percentiles;
6163
QHash<QByteArray,QByteArray> column_entries;
6264

63-
6465
//parse comment lines
6566
foreach(const VcfHeaderLine& comment_line, prs_variant_list.vcfHeader().comments())
6667
{
@@ -71,7 +72,7 @@ class ConcreteTool
7172
{
7273
if(column_entries.contains(column_name))
7374
{
74-
THROW(FileParseException, "Comment section of PRS VCF '" + prs_file_path + "' contains more than one entry for '" +column_name + "'!");
75+
THROW(FileParseException, "Comment section of PRS VCF file contains more than one entry for '" +column_name + "': " + prs_file_path);
7576
}
7677
column_entries[column_name] = comment_line.value.trimmed();
7778
break;
@@ -80,10 +81,9 @@ class ConcreteTool
8081

8182
if(comment_line.key.startsWith("percentiles"))
8283
{
83-
if (percentiles.size() != 0) THROW(FileParseException, "Percentiles for PRS VCF '" + prs_file_path + "' are given twice!");
84+
if (percentiles.size() != 0) THROW(FileParseException, "Percentiles in PRS VCF file given twice: " + prs_file_path);
8485
QByteArrayList percentile_string = comment_line.value.trimmed().split(',');
85-
if (percentile_string.size() != 100) THROW(FileParseException, "Invalid number of percentiles given (required: 100, given: "
86-
+ QByteArray::number(percentile_string.size()) + "!");
86+
if (percentile_string.size() != 100) THROW(FileParseException, "Invalid number of percentiles given (required: 100, given: " + QByteArray::number(percentile_string.size()) + ": " + prs_file_path);
8787
foreach (const QByteArray& value_string, percentile_string)
8888
{
8989
percentiles.append(Helper::toDouble(value_string, "Percentile"));
@@ -98,7 +98,7 @@ class ConcreteTool
9898
if(col_entries_not_in_header.contains(key)) continue;
9999
if(!column_entries.contains(key))
100100
{
101-
THROW(FileParseException, "Comment section of PRS VCF '" + prs_file_path + "' misses the entry for '" + key + "'!");
101+
THROW(FileParseException, "Comment section of PRS VCFs does not contsin an entry for '" + key + "': " + prs_file_path);
102102
}
103103
}
104104

@@ -110,7 +110,7 @@ class ConcreteTool
110110
//does not support multi-allelic variants
111111
if(prs_variant.isMultiAllelic())
112112
{
113-
THROW(FileParseException, "Does not support multi-allelic variants for Polgenic Risk Score calculation.");
113+
THROW(FileParseException, "Multi-allelic variants in PRS VCF files are not supported: " + prs_variant.variantToString());
114114
}
115115

116116
int allele_count = 0;
@@ -119,13 +119,17 @@ class ConcreteTool
119119
QByteArrayList matching_variants;
120120
foreach(const QByteArray& line, matching_lines)
121121
{
122-
// check if variant has same ref/alternative base(s)
123-
if((Sequence(line.split('\t')[3]) == prs_variant.ref()) && (Sequence(line.split('\t')[4]) == prs_variant.alt(0))) matching_variants.append(line);
122+
// check if overlapping variant is actually the one we are looking for
123+
QByteArrayList parts = line.split('\t');
124+
if(parts[1].toInt()==prs_variant.start() && parts[3]==prs_variant.ref() && parts[4]==prs_variant.alt(0))
125+
{
126+
matching_variants.append(line);
127+
}
124128
}
125129

126130
if(matching_variants.size() > 1)
127131
{
128-
THROW(FileParseException, "Variant '" + prs_variant.variantToString() + "' occures multiple times in sample VCF!");
132+
THROW(FileParseException, "Variant occures multiple times in sample VCF: " + prs_variant.variantToString());
129133
}
130134

131135
if(matching_variants.size() == 1)
@@ -135,12 +139,12 @@ class ConcreteTool
135139
QByteArrayList format_header_items = split_line[8].split(':');
136140
QByteArrayList format_value_items = split_line[9].split(':');
137141
int genotype_idx = format_header_items.indexOf("GT");
138-
if(genotype_idx < 0) THROW(FileParseException, "Genotype information is missing for variant '" + prs_variant.variantToString() + "'!");
139-
QByteArray genotype = format_value_items[genotype_idx].trimmed();
142+
if(genotype_idx < 0) THROW(FileParseException, "Genotype information is missing for sample variant: " + matching_variants[0]);
143+
QByteArray genotype = format_value_items[genotype_idx].trimmed().replace("|", "/").replace(".", "0");
140144

141145
if(genotype == "0/1") allele_count = 1;
142146
else if(genotype == "1/1") allele_count = 2;
143-
else THROW(FileParseException, "Invalid genotype '" + genotype + "' in variant " + prs_variant.variantToString() + "'!");
147+
else THROW(FileParseException, "Invalid genotype '" + genotype + "' in sample variant: " + matching_variants[0]);
144148

145149
//calculate PRS part
146150
double weight = Helper::toDouble(prs_variant.info("WEIGHT"), "PRS weight");

0 commit comments

Comments
 (0)