Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FASTA format #1029

Open
TANIAKMONS opened this issue Jun 5, 2024 · 15 comments
Open

FASTA format #1029

TANIAKMONS opened this issue Jun 5, 2024 · 15 comments

Comments

@TANIAKMONS
Copy link

Hello,

I have an issue with the FASTA format. It is a FASTA format which was made from the Illumina Sequencing and annotated with KREGG. We have tried a first time wihtout Uniprot annotation and it did not.
Will it work if the FASTA is composed of different annotation uncluded the Uniprot one ? it seems that we can't just have the Uniprot FASTA format.

Thanks in advance
TK

@vdemichev
Copy link
Owner

Hi TK,

Protein sequence IDs should be read correctly from any FASTA. All other information you can always pull out of the FASTA using some FASTA-reading R package, to annotate DIA-NN's output report.

We have tried a first time wihtout Uniprot annotation and it did not.

How did it manifest?

Best,
Vadim

@saradufour
Copy link

Hi,

I'm having the same issue in the library free search. The FASTA header for example looks like this:

>P62874,Q3TQ70|TX=10090 OS=Mouse GN=ENSMUSG00000029064.16,Gnb1 TA=NM_001160016.1,ENSMUST00000105616.10,XM_017319977.2,NM_001160017.1,ENSMUST00000030940.14,ENSMUST00000176637.2,ENSMUST00000165335.8,NM_008142.4 PA=ENSMUSP00000030940.8,NP_032168.1,ENSMUSP00000135091.2,XP_017175466.1,ENSMUSP00000101241.4,NP_001153488.1,ENSMUSP00000130123.2,NP_001153489.1,P62874,Q3TQ70
(fasta file from openprot (microprotein identification) with > 500000 entries)
and the output in the log is the following:

[0:48] Processing FASTA
[1:35] Assembling elution groups
[2:47] 23495123 precursors generated
[2:47] Gene names missing for some isoforms
[2:47] Library contains 1 proteins, and 1 genes
[2:51] Encoding peptides for spectra and RTs prediction

Any idea how to fix this issue?

Thanks !
Best,
Sara

@vdemichev
Copy link
Owner

Hi Sara,

DIA-NN will not correctly extract protein names from this. It should get the IDs OK though, i.e. you can annotate DIA-NN output using some FASTA-reading R package.

Best,
Vadim

@TANIAKMONS
Copy link
Author

Hi Vadim,

I had the same thing than Sara (Library contains 1 proteins, and 1 genes).
We have done a scrpit to incorporate Uniprot annotations within the FASTA and now we use DIANN 1.9.
This is the result we have:

10 files will be processed
[0:00] Loading FASTA C:\Tania\output_proteinpilot2.fasta
[2:07] Processing FASTA
[4:11] Assembling elution groups
[6:57] 59894740 precursors generated
[6:58] Gene names missing for some isoforms
[6:58] Library contains 717220 proteins, and 1 genes
[7:09] Encoding peptides for spectra and RTs prediction
[9:53] Predicting spectra and IMs
[370:52] Predicting RTs
[409:47] Decoding predicted spectra and IMs
[411:19] Decoding RTs
[412:01] Saving the library to C:\Tania\DIA-NN\1.9\report.predicted.speclib
[415:57] Initialising library

First pass: generating a spectral library from DIA data

[418:51] File #1/10
[418:51] Loading run C:\Tania\PSF21h.wiff
[421:59] 59872940 library precursors are potentially detectable
[423:20] Processing.

Since it is very long to process .... we will run it on a more powerfull server, it works with linux. Is it the smae command lin ethan with DIANN 1.8 ?

Thanks,
Kind Regards,

TK

@vdemichev
Copy link
Owner

Hi TK,

I would suggest to try the recommended settings first, which should result in much smaller predicted library & search space.

No, I don't recommend using 1.8.1. If you do, please make sure to use the predicted library generated by 1.9.

Best,
Vadim

@TANIAKMONS
Copy link
Author

Hi,

I'm having the same issue in the library free search. The FASTA header for example looks like this:

>P62874,Q3TQ70|TX=10090 OS=Mouse GN=ENSMUSG00000029064.16,Gnb1 TA=NM_001160016.1,ENSMUST00000105616.10,XM_017319977.2,NM_001160017.1,ENSMUST00000030940.14,ENSMUST00000176637.2,ENSMUST00000165335.8,NM_008142.4 PA=ENSMUSP00000030940.8,NP_032168.1,ENSMUSP00000135091.2,XP_017175466.1,ENSMUSP00000101241.4,NP_001153488.1,ENSMUSP00000130123.2,NP_001153489.1,P62874,Q3TQ70 (fasta file from openprot (microprotein identification) with > 500000 entries) and the output in the log is the following:

[0:48] Processing FASTA [1:35] Assembling elution groups [2:47] 23495123 precursors generated [2:47] Gene names missing for some isoforms [2:47] Library contains 1 proteins, and 1 genes [2:51] Encoding peptides for spectra and RTs prediction

Any idea how to fix this issue?

Thanks ! Best, Sara

Hi Sara,

We seems to have both a large amount of precusors, can you tell me what kind of computer or server do you use for your analysis once the library is generated ?
Our computer is able to generated a library as a first step but doesn't seems to move much in the second step with the raw data.

Best,
Tania

@vdemichev
Copy link
Owner

Hi Tania,

What is the amount of RAM? If you wish, I can take a look at the log.

Best,
Vadim

@TANIAKMONS
Copy link
Author

This was the log of the first step, the library generation:
report.log.txt

@TANIAKMONS
Copy link
Author

This is the 2nd step:

IMG_20240715_085005

@vdemichev
Copy link
Owner

Metaproteomics I guess? Yes, can take a very long time. I would also suggest using Peptidoform scoring in this case.
You can look up in Task Manager if there's enough free physical RAM.
Mass Accuracies are better fixed to 20ppm MS2 and 12ppm MS1. Can also use --mass-acc-cal 20 if the instrument is properly calibrated. All this will speed things up a bit. Can also try to run Search and RAM usage: Ultra-fast mode at first (fold-change faster but noticeably less IDs), to get a preliminary feeling about the data.

@TANIAKMONS
Copy link
Author

Yes, Metaproteomics indeed.

Capture2
Capture

@vdemichev
Copy link
Owner

Seems fine. I would run first with the settings I suggested and ultra-fast mode. In fact, would also regenerate the library with precursors charges restricted to 2-3, and then run ultra-fast mode. After this works, you can explore slower (and potentially more thorough) analysis methods.

@TANIAKMONS
Copy link
Author

Thank you Vadim, I will try definitely.

Best,
Tania

@TANIAKMONS
Copy link
Author

Hi,

Does it speed up the process also if I convert all the wiif file from Sciex to .dia prior the analysis ?

Thank you again for your help

Tania

@vdemichev
Copy link
Owner

Hi Tania,

Will save ~2min/file, based on the screenshot, i.e. not worth it in this case.

Best,
Vadim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants