Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Calculating Coherence. What words are expected as Targets? #121

Open
hhagedorn opened this issue May 18, 2021 · 8 comments
Open
Labels
documentation need to improve documentation question Further information is requested

Comments

@hhagedorn
Copy link

Hello @bab2min,

I am trying to use your implementation of the C_v coherence measure to evaluate both topic models that are included in tomotopy and some that are not. Therefore I generated a tomotpy.utils.Corpus to initialise the .Coherence class.

But I am a little confused with the targets parameter. Does it expect the whole vocabulary of the Corpus (or at least the vocabulary that is relevant for the coherence, e.g. all words from LDAModel.used_vocabs) or only a set of words that I want to later check for coherence (e.g. all words in my to-be evaluated topics)?

I am not exactly sure how to understand the sentence "Only words that are provided as targets are included in probability estimation."

Thank you already in advance!

@bab2min bab2min added the question Further information is requested label May 23, 2021
@bab2min
Copy link
Owner

bab2min commented May 24, 2021

Hi @hhagedorn,
Sorry for the confusion due to the unclear documentation.
For targets, the latter is correct. In other words, you just pass a set of words in to-be evaluated topics as targets.

The reason why targets is required is for computational efficiency. Calculating co-occurrence of all words from LDAModel.used_vocabs consumes a lot of time and memory. If you know the words to be evaluated for coherence, it can calculate their co-occurrences only instead of all. For this purpose, Coherence provides targets argument.

I'll supplement this explanation to the documentation in the next update.
Thank you for your good question!

@bab2min bab2min added the documentation need to improve documentation label May 24, 2021
@benreaves
Copy link

benreaves commented Jan 27, 2022

Hello @bab2min - thank you for the time you put into maintaining tomotopy!

I'm having some trouble that might be similar to @hhagedorn : I'm calculating the c_v coherence on a model that had earlier been trained and saved to disk, like this:

mdl = tomotopy.LDAModel.load("saved_model.bin")
coh = tomotopy.coherence.Coherence(mdl, coherence='c_v')

On the second line, I'm not specifying targets value, only the model. I understand it might be slow because of the large number of targets (about 20000 unique tokens), but my concern is that it sometimes crashes and hangs, even with the same model on the same machine. If I specify u_mass, then it calculates the coherence within a few minutes, but c_v stops for hours. Sometimes it crashes with just "Killed" and sometimes I see bad_alloc. So I suppose it's deep inside the coherence. I run it under mprof (memory profiler) and it uses only about 1.1GB, nowhere near the memory limit. I get different behavior at different times on the same model, same machine.

tomotopy.isa returns 'avx2' and I am using an intel i7-11800H, python 3.8.10, ubuntu 20.04 on WSL2 under Windows 11. I get similar behavior when running on GCP or AWS. What would you recommend here?

Thank you!

@bab2min
Copy link
Owner

bab2min commented Feb 3, 2022

Hi @benreaves
There appears to be some bugs in the current implementation of tomotopy.coherence.
However, a similar situation was not reproduced in my test set, so it is difficult to analyze details.
If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug.

@benreaves
Copy link

benreaves commented Feb 3, 2022 via email

@benreaves
Copy link

benreaves commented Feb 3, 2022 via email

@benreaves
Copy link

benreaves commented Feb 3, 2022 via email

@bab2min
Copy link
Owner

bab2min commented Feb 4, 2022

@benreaves Thank you for sharing the files and details. I'll look into them!

@benreaves
Copy link

benreaves commented Feb 9, 2022

This issue is no longer important. Reasons:

  1. c_npmi seems to work fine, so I can use that instead of c_v
  2. c_v should be avoided, according to this serious issue from 2018: Not being able to replicate coherence scores from paper dice-group/Palmetto#13

However, I am still having a numerical problem in add_doc() but it belongs in a new thread: #159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation need to improve documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants