Opens up a way to normalize the model in place #19

dirkgr · 2015-05-20T01:04:14Z

Normalizing the model for search creates a second copy of it. This is a problem for big models that fit into memory only once. This gives you a way to normalize the model in place, so you can store it only once.

The API is a little clunky admittedly, but this preserves complete backwards compatibility.

dirkgr · 2015-05-26T21:52:36Z

Ping?

wko27 · 2015-05-26T23:13:52Z

Right ah ...

Wouldn't it be easier to have Map<String, Double> which stores the denominator used for normalization (Euclidean norm) and have getVector do the copy + normalize operation on the fly?

Slightly worried about re-normalizing the same target strings over and over again, but we can add a CachedSearcherImpl which memoizes the normalized vectors with a decorator pattern.

dirkgr · 2015-05-26T23:30:53Z

In our use case, we find the 100 nearest neighbors in-line with the query, i.e., the UI doesn't update until this is done. For some reason, this is fast enough right now (thank you!). I'm worried about doing the normalization on the fly though. To find nearest neighbors, we'd basically have to normalize the whole thing for every query.

If we cache it, even one nearest-neighbors search would duplicate the model in RAM, but less efficiently, because of the extra maps. There is almost nothing that beats the efficiency of a single big array.

wko27 · 2015-05-26T23:39:52Z

Ah, right forgot the nearest neighbor does an O(n) search ...

Can you drop the reference to the original model object so the non-normalized values can be GC'd?

dirkgr · 2015-05-27T17:38:17Z

I'm happy to, but I'm not sure which reference to the original model you mean.

Word2VecModel normalizes in place, so there is no original model. SearcherImpl.withModel() normalizes a copy, so at one point we need to have two copies of the model in memory, but it doesn't hold on to the original one. I didn't want to normalize the model that's passed to SearcherImpl.withModel(), because it would be pretty surprising if that function mutated the model you pass in.

wko27 · 2015-05-28T07:38:54Z

Ah I meant in the existing code before the changes you've introduced. If you write the code as:

Searcher searcher = Word2VecModel.fromThrift(thrift).toSearcher();

We'd be unable to GC the intermediate Word2VecModel object created since it's passed in the constructor to the SearcherImpl class, but if we pass in a reference to the model's vocab and vectors, then normalize and store the vectors, then the model (and it's vectors) should be free for GC.

In general, I would prefer that Word2VecModel remains an immutable if at all possible. This prevents difficulties if we later on add functionality which may want to use the raw un-normalized vectors.

dirkgr · 2015-05-28T18:06:52Z

The original model is already free for GC. SearcherImpl.withModel() makes a copy of the original model, and that's the last time the original model is touched. If the surrounding code doesn't hold on to the output of Word2VecModel.fromThrift(), the original model will be GCd.

I agree, the models should be immutable. The problem is scale. And I guess there is a question in here for you about the overall direction of this project. I have other changes queued up that improve scaling at the expense of readability and usability. If you think of this as more of a teaching tool to learn about word embeddings, then my changes might not be a good contribution. If you're thinking of this as a Java replacement for the C version of word2vec, there is more work to be done.

There are other ways, too. For example, if the model was stored in a good format in a memory-mapped file, you could load models bigger than main memory, as long as your access patterns are acceptable. The models would be effectively immutable, and they would scale, but the readability of the code would be terrible. I would actually prefer a solution like that, but as much as I might want to, I can't spend all my time re-implementing the Word2VecModel class :-)

Either way, if you think this change is too hacky, I understand.

wko27 · 2015-06-07T19:24:00Z

Hi! Sorry for the delay again, got a bit busy ...

On master, the original model object itself can't be GC'd since SearcherImpl retains a reference to it.

Regarding overall direction of the project; we do intend to use this in production but there's always tradeoffs between performance and readability/extensibility/maintainability/debuggability. I'd love to hear what other types of improvements you have in mind though! :)

With regards to this specific problem, I do believe the current approach could be a bit more elegant.

How about you subclass Word2VecModel (call it NormalizedWord2VecModel, and override the appropriate methods forSearch, etc). I would have a separate static fromThrift method which does the normalization on creation. One workflow would then be:

Use the builder to create a Word2VecModel object
Convert to thrift (and persist if desired)
Read from thrift to normalized version
Drop reference to the original trained Word2VecModel object

Thoughts?

dirkgr · 2015-06-11T02:12:25Z

I took your suggestion, and went with it until it was #22. By moving the vectors off the heap, I don't need to serialize to Thrift and de-serialize afterwards. For the brief period of time where we have both normalized and un-normalized model in memory, the OS will figure out whether there is space for it in memory, and if there is not, use the swap file to buffer it. It's super fast, has the same semantics as before, and the code isn't that much more complex for it. I think I will use this trick more often from now on.

dirkgr · 2015-06-12T17:27:41Z

I'm inclined to close this, since #22 is much better.

Opens up a way to normalize the model in place.

4765f36

dirkgr mentioned this pull request May 20, 2015

Loads big models in bin file format faster #15

Merged

dirkgr mentioned this pull request Jun 11, 2015

Stores vectors in off-heap buffers #22

Merged

dirkgr closed this Jun 22, 2015

dirkgr deleted the StoreModelsOnlyOnce branch June 22, 2015 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opens up a way to normalize the model in place #19

Opens up a way to normalize the model in place #19

dirkgr commented May 20, 2015

dirkgr commented May 26, 2015

wko27 commented May 26, 2015

dirkgr commented May 26, 2015

wko27 commented May 26, 2015

dirkgr commented May 27, 2015

wko27 commented May 28, 2015

dirkgr commented May 28, 2015

wko27 commented Jun 7, 2015

dirkgr commented Jun 11, 2015

dirkgr commented Jun 12, 2015

Opens up a way to normalize the model in place #19

Opens up a way to normalize the model in place #19

Conversation

dirkgr commented May 20, 2015

dirkgr commented May 26, 2015

wko27 commented May 26, 2015

dirkgr commented May 26, 2015

wko27 commented May 26, 2015

dirkgr commented May 27, 2015

wko27 commented May 28, 2015

dirkgr commented May 28, 2015

wko27 commented Jun 7, 2015

dirkgr commented Jun 11, 2015

dirkgr commented Jun 12, 2015