-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Calculating Coherence. What words are expected as Targets? #121
Comments
Hi @hhagedorn, The reason why I'll supplement this explanation to the documentation in the next update. |
Hello @bab2min - thank you for the time you put into maintaining tomotopy! I'm having some trouble that might be similar to @hhagedorn : I'm calculating the c_v coherence on a model that had earlier been trained and saved to disk, like this:
On the second line, I'm not specifying targets value, only the model. I understand it might be slow because of the large number of targets (about 20000 unique tokens), but my concern is that it sometimes crashes and hangs, even with the same model on the same machine. If I specify u_mass, then it calculates the coherence within a few minutes, but c_v stops for hours. Sometimes it crashes with just "Killed" and sometimes I see bad_alloc. So I suppose it's deep inside the coherence. I run it under mprof (memory profiler) and it uses only about 1.1GB, nowhere near the memory limit. I get different behavior at different times on the same model, same machine. tomotopy.isa returns 'avx2' and I am using an intel i7-11800H, python 3.8.10, ubuntu 20.04 on WSL2 under Windows 11. I get similar behavior when running on GCP or AWS. What would you recommend here? Thank you! |
Hi @benreaves |
Yes I will send it later today. Thank you for investigating!
…On Thu, Feb 3, 2022, 08:06 Minchul Lee ***@***.***> wrote:
Hi @benreaves <https://github.com/benreaves>
There appears to be some bugs in the current implementation of
tomotopy.coherence.
However, a similar situation was not reproduced in my test set, so it is
difficult to analyze details.
If possible, can you please share the saved_model.bin file that causes
crashes? It will be of great help in figuring out the cause of the bug.
—
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, here it is! [1]
The zip file contains
- the model file (in folder 20220126065224i0)
- coherence_later.py, which I use for calculating the coherence on saved
models (did I do it correctly?)
- results.csv containing a list of saved models. Only the first one is
included in this zipfile (but then that one does cause the hang).
[1]
https://drive.google.com/file/d/1s9WBQ_dxHV55qpy-mzSyB1tGpPuX7mhG/view?usp=sharing
…On Thu, Feb 3, 2022 at 8:22 AM Ben Reaves ***@***.***> wrote:
Yes I will send it later today. Thank you for investigating!
On Thu, Feb 3, 2022, 08:06 Minchul Lee ***@***.***> wrote:
> Hi @benreaves <https://github.com/benreaves>
> There appears to be some bugs in the current implementation of
> tomotopy.coherence.
> However, a similar situation was not reproduced in my test set, so it is
> difficult to analyze details.
> If possible, can you please share the saved_model.bin file that causes
> crashes? It will be of great help in figuring out the cause of the bug.
>
> —
> Reply to this email directly, view it on GitHub
> <#121 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
_____________________________________________________________________
Ben Reaves
--
|
BTW, it doesn't always give the same error - sometimes it's "bad_alloc"
sometimes it just says "Killed" and exits with no traceback, and sometimes
it just hangs for at least 8 hours. I really appreciate your looking into
this!
…On Thu, Feb 3, 2022 at 8:06 AM Minchul Lee ***@***.***> wrote:
Hi @benreaves <https://github.com/benreaves>
There appears to be some bugs in the current implementation of
tomotopy.coherence.
However, a similar situation was not reproduced in my test set, so it is
difficult to analyze details.
If possible, can you please share the saved_model.bin file that causes
crashes? It will be of great help in figuring out the cause of the bug.
—
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
_____________________________________________________________________
Ben Reaves
--
|
@benreaves Thank you for sharing the files and details. I'll look into them! |
This issue is no longer important. Reasons:
However, I am still having a numerical problem in add_doc() but it belongs in a new thread: #159 |
Hello @bab2min,
I am trying to use your implementation of the C_v coherence measure to evaluate both topic models that are included in tomotopy and some that are not. Therefore I generated a
tomotpy.utils.Corpus
to initialise the.Coherence
class.But I am a little confused with the
targets
parameter. Does it expect the whole vocabulary of the Corpus (or at least the vocabulary that is relevant for the coherence, e.g. all words fromLDAModel.used_vocabs
) or only a set of words that I want to later check for coherence (e.g. all words in my to-be evaluated topics)?I am not exactly sure how to understand the sentence "Only words that are provided as targets are included in probability estimation."
Thank you already in advance!
The text was updated successfully, but these errors were encountered: