-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opens up a way to normalize the model in place #19
Conversation
Ping? |
Right ah ... Wouldn't it be easier to have Map<String, Double> which stores the denominator used for normalization (Euclidean norm) and have getVector do the copy + normalize operation on the fly? Slightly worried about re-normalizing the same target strings over and over again, but we can add a CachedSearcherImpl which memoizes the normalized vectors with a decorator pattern. |
In our use case, we find the 100 nearest neighbors in-line with the query, i.e., the UI doesn't update until this is done. For some reason, this is fast enough right now (thank you!). I'm worried about doing the normalization on the fly though. To find nearest neighbors, we'd basically have to normalize the whole thing for every query. If we cache it, even one nearest-neighbors search would duplicate the model in RAM, but less efficiently, because of the extra maps. There is almost nothing that beats the efficiency of a single big array. |
Ah, right forgot the nearest neighbor does an O(n) search ... Can you drop the reference to the original model object so the non-normalized values can be GC'd? |
I'm happy to, but I'm not sure which reference to the original model you mean.
|
Ah I meant in the existing code before the changes you've introduced. If you write the code as:
We'd be unable to GC the intermediate Word2VecModel object created since it's passed in the constructor to the SearcherImpl class, but if we pass in a reference to the model's vocab and vectors, then normalize and store the vectors, then the model (and it's vectors) should be free for GC. In general, I would prefer that Word2VecModel remains an immutable if at all possible. This prevents difficulties if we later on add functionality which may want to use the raw un-normalized vectors. |
The original model is already free for GC. I agree, the models should be immutable. The problem is scale. And I guess there is a question in here for you about the overall direction of this project. I have other changes queued up that improve scaling at the expense of readability and usability. If you think of this as more of a teaching tool to learn about word embeddings, then my changes might not be a good contribution. If you're thinking of this as a Java replacement for the C version of word2vec, there is more work to be done. There are other ways, too. For example, if the model was stored in a good format in a memory-mapped file, you could load models bigger than main memory, as long as your access patterns are acceptable. The models would be effectively immutable, and they would scale, but the readability of the code would be terrible. I would actually prefer a solution like that, but as much as I might want to, I can't spend all my time re-implementing the Either way, if you think this change is too hacky, I understand. |
Hi! Sorry for the delay again, got a bit busy ... On master, the original model object itself can't be GC'd since SearcherImpl retains a reference to it. Regarding overall direction of the project; we do intend to use this in production but there's always tradeoffs between performance and readability/extensibility/maintainability/debuggability. I'd love to hear what other types of improvements you have in mind though! :) With regards to this specific problem, I do believe the current approach could be a bit more elegant. How about you subclass Word2VecModel (call it NormalizedWord2VecModel, and override the appropriate methods forSearch, etc). I would have a separate static fromThrift method which does the normalization on creation. One workflow would then be:
Thoughts? |
I took your suggestion, and went with it until it was #22. By moving the vectors off the heap, I don't need to serialize to Thrift and de-serialize afterwards. For the brief period of time where we have both normalized and un-normalized model in memory, the OS will figure out whether there is space for it in memory, and if there is not, use the swap file to buffer it. It's super fast, has the same semantics as before, and the code isn't that much more complex for it. I think I will use this trick more often from now on. |
I'm inclined to close this, since #22 is much better. |
Normalizing the model for search creates a second copy of it. This is a problem for big models that fit into memory only once. This gives you a way to normalize the model in place, so you can store it only once.
The API is a little clunky admittedly, but this preserves complete backwards compatibility.