forked from broadinstitute/picard
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Bug fix for tiny edge-case in MarkDuplicates
* In the extremely improbable case that the mated reads share the same unclipped 5' end, and are on opposite strands, the order in the file will determine the orientation ("innie" vs. "outtie") and thus if the relative order of the mates is not deterministic, two such pairs (with the same position) may or may not be marked as duplicates. While this is virtually impossible in coordinate-sorted input (since you need all but one base to be clipped in the back-facing read) in query-ordered input this has happened in the wild. - This PR corrects this behavior by declaring such mates to be "innies" regardless of the order in the file. - Tests covering both cases (queryname and coordinate sorted) have been added - Variable names changed to make code more readable - A small saving was found which reduced the number of times the readnames are parsed by 1/2.
- Loading branch information
Yossi Farjoun
committed
May 26, 2016
1 parent
04ed147
commit 85823d2
Showing
14 changed files
with
300 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
68 changes: 68 additions & 0 deletions
68
src/tests/java/picard/sam/markduplicates/AsIsMarkDuplicatesTester.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
/* | ||
* The MIT License | ||
* | ||
* Copyright (c) 2016 The Broad Institute | ||
* | ||
* Permission is hereby granted, free of charge, to any person obtaining a copy | ||
* of this software and associated documentation files (the "Software"), to deal | ||
* in the Software without restriction, including without limitation the rights | ||
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
* copies of the Software, and to permit persons to whom the Software is | ||
* furnished to do so, subject to the following conditions: | ||
* | ||
* The above copyright notice and this permission notice shall be included in | ||
* all copies or substantial portions of the Software. | ||
* | ||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
* THE SOFTWARE. | ||
*/ | ||
package picard.sam.markduplicates; | ||
|
||
import htsjdk.samtools.SamReader; | ||
import htsjdk.samtools.SamReaderFactory; | ||
import htsjdk.samtools.util.CloserUtil; | ||
import org.testng.annotations.DataProvider; | ||
import org.testng.annotations.Test; | ||
|
||
import java.io.File; | ||
|
||
/** | ||
* Tests a few hand build sam files as they are. | ||
*/ | ||
public class AsIsMarkDuplicatesTester { | ||
|
||
@DataProvider | ||
public Object[][] testSameUnclipped5PrimeOppositeStrandData() { | ||
final File TEST_DIR = new File("testdata/picard/sam/MarkDuplicates"); | ||
return new Object[][]{ | ||
new Object[]{new File(TEST_DIR, "sameUnclipped5primeEndv1.sam")}, | ||
new Object[]{new File(TEST_DIR, "sameUnclipped5primeEndv2.sam")}, | ||
new Object[]{new File(TEST_DIR, "sameUnclipped5primeEndCoordinateSortedv1.sam")}, | ||
new Object[]{new File(TEST_DIR, "sameUnclipped5primeEndCoordinateSortedv2.sam")}, | ||
new Object[]{new File(TEST_DIR, "sameUnclipped5primeEndCoordinateSortedv3.sam")}, | ||
new Object[]{new File(TEST_DIR, "sameUnclipped5primeEndCoordinateSortedv4.sam")} | ||
}; | ||
} | ||
|
||
@Test(dataProvider = "testSameUnclipped5PrimeOppositeStrandData") | ||
public void testSameUnclipped5PrimeOppositeStrand(final File input) { | ||
|
||
final AbstractMarkDuplicatesCommandLineProgramTester tester = new BySumOfBaseQAndInOriginalOrderMDTester(); | ||
|
||
final SamReader reader = SamReaderFactory.makeDefault().open(input); | ||
|
||
tester.setHeader(reader.getFileHeader()); | ||
reader.iterator().stream().forEach(tester::addRecord); | ||
|
||
CloserUtil.close(reader); | ||
tester.setExpectedOpticalDuplicate(0); | ||
tester.runTest(); | ||
} | ||
} | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
28 changes: 28 additions & 0 deletions
28
testdata/picard/sam/MarkDuplicates/sameUnclipped5primeEndCoordinateSortedv1.sam
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
@HD VN:1.5 SO:coordinate | ||
@SQ SN:1 LN:197195432 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:f05d753079c455c0e57af88eeda24493 SP:Mus musculus | ||
@SQ SN:2 LN:181748087 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:9b9d64dc89ecc73d3288eb38af3f94bd SP:Mus musculus | ||
@SQ SN:3 LN:159599783 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:0a692666a1b8526e1d1e799beb71b6d0 SP:Mus musculus | ||
@SQ SN:4 LN:155630120 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:f5993a04396a06ed6b28fa42b2429be0 SP:Mus musculus | ||
@SQ SN:5 LN:152537259 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:f90804fb8fe9cb06076d51a710fb4563 SP:Mus musculus | ||
@SQ SN:6 LN:149517037 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:258a37e20815bb7e3f2e974b9d4dd295 SP:Mus musculus | ||
@SQ SN:7 LN:152524553 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:e0d6cea6f72cb4d9f8d0efc1d29dd180 SP:Mus musculus | ||
@SQ SN:8 LN:131738871 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:5f217cb8a9685b9879add3ae110cabd7 SP:Mus musculus | ||
@SQ SN:9 LN:124076172 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:dde08574942fc18050195618cc3f35af SP:Mus musculus | ||
@SQ SN:10 LN:129993255 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:be7e6a13cc6b9da7c1da7b7fc32c5506 SP:Mus musculus | ||
@SQ SN:11 LN:121843856 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:e0099550b3d3943fb9bb7af6fa6952c1 SP:Mus musculus | ||
@SQ SN:12 LN:121257530 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:1f9c11dc6f288f93e9fab56772a36e85 SP:Mus musculus | ||
@SQ SN:13 LN:120284312 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:a7b4bb418aa21e0ec59d9e2a1fe1810b SP:Mus musculus | ||
@SQ SN:14 LN:125194864 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:09d1c8449706a17d40934302a0a3b671 SP:Mus musculus | ||
@SQ SN:15 LN:103494974 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:e41c8b42b0921378b1fdd5172f6be067 SP:Mus musculus | ||
@SQ SN:16 LN:98319150 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:e051b3930c2557ade21d67db41f3a518 SP:Mus musculus | ||
@SQ SN:17 LN:95272651 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:47eede15e5761fb9c2267627f18211e7 SP:Mus musculus | ||
@SQ SN:18 LN:90772031 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:9f9d41cfdb9d91b62b928a3eb4eb6928 SP:Mus musculus | ||
@SQ SN:19 LN:61342430 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:591f8486f82c22442bb8463595a18e0a SP:Mus musculus | ||
@SQ SN:X LN:166650296 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:3d0d9df898d2c830b858f91255d8a1eb SP:Mus musculus | ||
@SQ SN:Y LN:15902555 UR:ftp://hgdownload.cse.ucsc.edu/goldenPath/Mus_musculus_assembly9/bigZips/chromFa.tar.gz AS:mm9 M5:5ff564f9fbc8cb87bcad6cfa6874902b SP:Mus musculus | ||
@RG ID:00001.3 PL:illumina PU:00001ABXX101026.3.TTGAGCCT LB:Solexa-45924 DT:2010-10-26T00:00:00-0400 SM:stat2_120 CN:BI | ||
@RG ID:00001.3 PL:illumina PU:00001ABXX101026.3.TTGAGCCT LB:Solexa-45924 DT:2010-10-26T00:00:00-0400 SM:stat2_120 CN:BI | ||
ST-E00297:149016593:H3GVWCCXX:3:1218:20812:27591 145 8 43092875 0 150S1M = 43092875 49 AAAAAAAAAA BBBBBBBBBB MC:Z:151M RG:Z:00001.3 | ||
ST-E00297:149016593:H3GVWCCXX:3:1218:20812:27591 97 8 43092875 0 151M = 43092875 -49 AAAAAAAAAA BBBBBBBBBB MC:Z:150S1M RG:Z:00001.3 | ||
ST-E00297:149016593:H3GVWCCXX:5:2214:10145:57038 1169 8 43092875 0 150S1M = 43092875 44 AAAAAAAAAA AAAAAAAAAA MC:Z:151M RG:Z:00001.3 | ||
ST-E00297:149016593:H3GVWCCXX:5:2214:10145:57038 1121 8 43092875 0 151M = 43092875 -44 AAAAAAAAAA AAAAAAAAAA MC:Z:150S1M RG:Z:00001.3 |
Oops, something went wrong.