Question: Calculating Coherence. What words are expected as Targets? #121

hhagedorn · 2021-05-18T06:55:57Z

I am trying to use your implementation of the C_v coherence measure to evaluate both topic models that are included in tomotopy and some that are not. Therefore I generated a tomotpy.utils.Corpus to initialise the .Coherence class.

But I am a little confused with the targets parameter. Does it expect the whole vocabulary of the Corpus (or at least the vocabulary that is relevant for the coherence, e.g. all words from LDAModel.used_vocabs) or only a set of words that I want to later check for coherence (e.g. all words in my to-be evaluated topics)?

I am not exactly sure how to understand the sentence "Only words that are provided as targets are included in probability estimation."

Thank you already in advance!

The text was updated successfully, but these errors were encountered:

bab2min · 2021-05-24T09:50:34Z

Hi @hhagedorn,
Sorry for the confusion due to the unclear documentation.
For targets, the latter is correct. In other words, you just pass a set of words in to-be evaluated topics as targets.

The reason why targets is required is for computational efficiency. Calculating co-occurrence of all words from LDAModel.used_vocabs consumes a lot of time and memory. If you know the words to be evaluated for coherence, it can calculate their co-occurrences only instead of all. For this purpose, Coherence provides targets argument.

I'll supplement this explanation to the documentation in the next update.
Thank you for your good question!

benreaves · 2022-01-27T09:13:07Z

Hello @bab2min - thank you for the time you put into maintaining tomotopy!

I'm having some trouble that might be similar to @hhagedorn : I'm calculating the c_v coherence on a model that had earlier been trained and saved to disk, like this:

mdl = tomotopy.LDAModel.load("saved_model.bin")
coh = tomotopy.coherence.Coherence(mdl, coherence='c_v')

On the second line, I'm not specifying targets value, only the model. I understand it might be slow because of the large number of targets (about 20000 unique tokens), but my concern is that it sometimes crashes and hangs, even with the same model on the same machine. If I specify u_mass, then it calculates the coherence within a few minutes, but c_v stops for hours. Sometimes it crashes with just "Killed" and sometimes I see bad_alloc. So I suppose it's deep inside the coherence. I run it under mprof (memory profiler) and it uses only about 1.1GB, nowhere near the memory limit. I get different behavior at different times on the same model, same machine.

tomotopy.isa returns 'avx2' and I am using an intel i7-11800H, python 3.8.10, ubuntu 20.04 on WSL2 under Windows 11. I get similar behavior when running on GCP or AWS. What would you recommend here?

Thank you!

bab2min · 2022-02-03T16:06:06Z

Hi @benreaves
There appears to be some bugs in the current implementation of tomotopy.coherence.
However, a similar situation was not reproduced in my test set, so it is difficult to analyze details.
If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug.

benreaves · 2022-02-03T16:23:02Z

Yes I will send it later today. Thank you for investigating!

…

On Thu, Feb 3, 2022, 08:06 Minchul Lee ***@***.***> wrote: Hi @benreaves <https://github.com/benreaves> There appears to be some bugs in the current implementation of tomotopy.coherence. However, a similar situation was not reproduced in my test set, so it is difficult to analyze details. If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug. — Reply to this email directly, view it on GitHub <#121 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

benreaves · 2022-02-03T21:45:37Z

Yes, here it is! [1] The zip file contains - the model file (in folder 20220126065224i0) - coherence_later.py, which I use for calculating the coherence on saved models (did I do it correctly?) - results.csv containing a list of saved models. Only the first one is included in this zipfile (but then that one does cause the hang). [1] https://drive.google.com/file/d/1s9WBQ_dxHV55qpy-mzSyB1tGpPuX7mhG/view?usp=sharing

…

On Thu, Feb 3, 2022 at 8:22 AM Ben Reaves ***@***.***> wrote: Yes I will send it later today. Thank you for investigating! On Thu, Feb 3, 2022, 08:06 Minchul Lee ***@***.***> wrote: > Hi @benreaves <https://github.com/benreaves> > There appears to be some bugs in the current implementation of > tomotopy.coherence. > However, a similar situation was not reproduced in my test set, so it is > difficult to analyze details. > If possible, can you please share the saved_model.bin file that causes > crashes? It will be of great help in figuring out the cause of the bug. > > — > Reply to this email directly, view it on GitHub > <#121 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

--

_____________________________________________________________________ Ben Reaves

--

benreaves · 2022-02-03T21:49:01Z

BTW, it doesn't always give the same error - sometimes it's "bad_alloc" sometimes it just says "Killed" and exits with no traceback, and sometimes it just hangs for at least 8 hours. I really appreciate your looking into this!

…

On Thu, Feb 3, 2022 at 8:06 AM Minchul Lee ***@***.***> wrote: Hi @benreaves <https://github.com/benreaves> There appears to be some bugs in the current implementation of tomotopy.coherence. However, a similar situation was not reproduced in my test set, so it is difficult to analyze details. If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug. — Reply to this email directly, view it on GitHub <#121 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

--

_____________________________________________________________________ Ben Reaves

--

bab2min · 2022-02-04T03:22:28Z

@benreaves Thank you for sharing the files and details. I'll look into them!

benreaves · 2022-02-09T07:50:53Z

This issue is no longer important. Reasons:

c_npmi seems to work fine, so I can use that instead of c_v
c_v should be avoided, according to this serious issue from 2018: Not being able to replicate coherence scores from paper dice-group/Palmetto#13

However, I am still having a numerical problem in add_doc() but it belongs in a new thread: #159

bab2min added the question Further information is requested label May 23, 2021

bab2min added the documentation need to improve documentation label May 24, 2021

bab2min mentioned this issue Jul 9, 2022

Questions about choosing coherence measures #172

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Calculating Coherence. What words are expected as Targets? #121

Question: Calculating Coherence. What words are expected as Targets? #121

hhagedorn commented May 18, 2021

bab2min commented May 24, 2021

benreaves commented Jan 27, 2022 •

edited

Loading

bab2min commented Feb 3, 2022

benreaves commented Feb 3, 2022 via email

benreaves commented Feb 3, 2022 via email

benreaves commented Feb 3, 2022 via email

bab2min commented Feb 4, 2022

benreaves commented Feb 9, 2022 •

edited

Loading

Question: Calculating Coherence. What words are expected as Targets? #121

Question: Calculating Coherence. What words are expected as Targets? #121

Comments

hhagedorn commented May 18, 2021

bab2min commented May 24, 2021

benreaves commented Jan 27, 2022 • edited Loading

bab2min commented Feb 3, 2022

benreaves commented Feb 3, 2022 via email

benreaves commented Feb 3, 2022 via email

benreaves commented Feb 3, 2022 via email

bab2min commented Feb 4, 2022

benreaves commented Feb 9, 2022 • edited Loading

benreaves commented Jan 27, 2022 •

edited

Loading

benreaves commented Feb 9, 2022 •

edited

Loading