Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix read length plotting in aggregate mode #928

Merged
merged 1 commit into from
Jan 30, 2020

Conversation

fanli-gcb
Copy link
Contributor

Small fix to the calculation of read counts that pass each length when being plotted in aggregate mode. Cum=10*as.vector(cums)/rc references the read count from the last sample processed, should be the sum of the actual read counts used in the tabulation.

@benjjneb
Copy link
Owner

Can you provide a minimal example that demonstrates the bug in the current behavior? My quick inspection of results w/ 3 random fastqs of different lengths in normal and aggregated mode turned out as I expected.

@fanli-gcb
Copy link
Contributor Author

Sure, see the attached example of 3 fastq files with 10k, 10k, and 100 reads.
fastqs.zip

Then:

fnFs <- c("S1.fastq.gz", "S2.fastq.gz", "small.fastq.gz")
plotQualityProfile(fnFs, aggregate=T)

will yield the following:
Rplots

The key is that the last element of fnFs is substantially smaller than the others, as the read count from the last file to be processed gets used in a cumulative calculation.

@benjjneb
Copy link
Owner

Ah OK I see it now, effectively the last rc was being multiplied by the number of files, which isn't the correct way to handle this. rc*nrow(anndf) doesn't have to be equal to the total records across all the files.

@benjjneb
Copy link
Owner

LGTM, thanks!

@benjjneb benjjneb merged commit a9d25c7 into benjjneb:master Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants