Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the filter steps between the main report and the matrix in DIANN 1.9 #1056

Open
momo-0521 opened this issue Jun 20, 2024 · 12 comments

Comments

@momo-0521
Copy link

Hi Vadim

Thanks for your work in DiaNN 1.9.
When analyzing the results from version 1.9, I've observed discrepancies between the number of Protein.Group entries filtered by R and those reported in report.pg_matrix. Are there additional filtering steps being applied? I suspect that the "Additional 5% run-specific protein-level FDR filter applied to the protein matrices, use --matrix-spec-q to adjust it" might be impacting the results. However, I'm unsure how to address this issue.

report_pg <- diann_load("report.pg_matrix.tsv")
length(unique(report_pg$Protein.Group))
[1] 13121
df<-read_parquet("report.parquet")
length(unique(df$Protein.Group[df$Lib.Q.Value <= 0.01 & df$Lib.PG.Q.Value <= 0.01 ]))#14126
[1] 14126

Thank you in advance

@vdemichev
Copy link
Owner

Hi,

Please try:
df<-read_parquet("report.parquet")
length(unique(df$Protein.Group[df$Lib.Q.Value <= 0.01 & df$Lib.PG.Q.Value <= 0.01 & df$PG.Q.Value <= 0.05]))

Best,
Vadim

@momo-0521
Copy link
Author

Thank you for your advice。

I have tried this, but it does not work.It affected the number of precursors but had no effect on the entries in Protein.Group.

df<-read_parquet("report.parquet")
length(unique(df$Protein.Group[df$Lib.Q.Value <= 0.01 & df$Lib.PG.Q.Value <= 0.01 & df$PG.Q.Value <= 0.05]))
[1] 14126

Thank you again!
T

@vdemichev
Copy link
Owner

Is this MBR output?

@momo-0521
Copy link
Author

Yes, it is MBR output.

@vdemichev
Copy link
Owner

Can you please share both the .parquet and pg_matrix?
A quick check: do the timestamps (date modified) on those files match?

Best,
Vadim

@momo-0521
Copy link
Author

@vdemichev
Copy link
Owner

length(unique(df$Protein.Group[df$Lib.PG.Q.Value <= 0.01 & df$PG.Q.Value <= 0.05 & df$PG.MaxLFQ > 0]))
[1] 13121

Works if filter for non-zero quantities too :)

@momo-0521
Copy link
Author

Thank you very much for your great help.

Best wishes!

@momo-0521
Copy link
Author

Hi, Vadim

Thanks for your help yesterday. I have encountered a new question. When I utilized ‘diann_maxlfq’ to estimate protein group quantities, the results appear to differ significantly from those obtained from 'pg_matrix' as well as the 'PG.MaxLFQ' column. Below is the code I employed, which functioned correctly in DIANN 1.8 but has raised some concerns in DIANN 1.9. Do you have any suggestions or advice on this issue?
protein.groups <- diann_maxlfq(df[df$Lib.PG.Q.Value <= 0.01 & df$PG.Q.Value <= 0.05 & df$PG.MaxLFQ > 0,],
sample.header = "Run",
group.header="Protein.Group",
id.header = "Precursor.Id",
quantity.header = "Precursor.Normalised")

Thank you in advance!

@vdemichev
Copy link
Owner

diann_maxlfq implements a simple MaxLFQ algorithm, different from what DIA-NN uses internally. The results will therefore always differ.

@momo-0521
Copy link
Author

Thank you. I understand.

Another question is about species-specifc precursors. Our samples contain a mixture of human and mouse proteins. When running DIANN 1.9, we used both human and mouse FASTA files and add additional options including '--species-genes' and '--species-ids'. We would like to exclude precursors specific to mouse or shared between both species, and instead focus only on human-specific precursors to quantify their associated proteins. Under these parameter settings, we would like to know if the 'PG.MaxLFQ' value is calculated from human-specific and mouse-specific precursors?

Best wishes!

@vdemichev
Copy link
Owner

It's calculated using all precursors matched to the protein group (Protein.Group column). So in this case you'd want to just discard all entries in the .parquet report with Protein.Ids column string containing 'MOUSE'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants