Skip to content

Commit

Permalink
[SPARK-50673][ML] Avoid traversing model coefficients twice in `Word2…
Browse files Browse the repository at this point in the history
…VecModel` constructor

### What changes were proposed in this pull request?
Avoid traversing model twice in `Word2VecModel` constructor

### Why are the changes needed?
In public constructor `def this(model: Map[String, Array[Float]])`,
the implementation traverses the model coefficients (which is a Map) twice to build the `Word2VecModel`, for `wordIndex` and `wordVectors`, respectively.

I am not sure whether it might be problematic, since the two traversals needs the same ordering.

Generating the result with single pass can guarantee the correctness, anyway.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
existing tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#49298 from zhengruifeng/ml_w2v_build.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
  • Loading branch information
zhengruifeng committed Dec 27, 2024
1 parent c920210 commit 94f9bb0
Showing 1 changed file with 19 additions and 12 deletions.
31 changes: 19 additions & 12 deletions mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
Original file line number Diff line number Diff line change
Expand Up @@ -520,9 +520,15 @@ class Word2VecModel private[spark] (
}
}

// Auxiliary constructor must begin with call to 'this'.
// Helper constructor for `def this(model: Map[String, Array[Float]])`.
private def this(model: (Map[String, Int], Array[Float])) = {
this(model._1, model._2)
}

@Since("1.5.0")
def this(model: Map[String, Array[Float]]) = {
this(Word2VecModel.buildWordIndex(model), Word2VecModel.buildWordVectors(model))
this(Word2VecModel.buildFromVecMap(model))
}

@Since("1.4.0")
Expand Down Expand Up @@ -642,21 +648,22 @@ class Word2VecModel private[spark] (
@Since("1.4.0")
object Word2VecModel extends Loader[Word2VecModel] {

private def buildWordIndex(model: Map[String, Array[Float]]): Map[String, Int] = {
CUtils.toMapWithIndex(model.keys)
}

private def buildWordVectors(model: Map[String, Array[Float]]): Array[Float] = {
private def buildFromVecMap(
model: Map[String, Array[Float]]): (Map[String, Int], Array[Float]) = {
require(model.nonEmpty, "Word2VecMap should be non-empty")

val (vectorSize, numWords) = (model.head._2.length, model.size)
val wordList = model.keys.toArray
val wordVectors = new Array[Float](vectorSize * numWords)
var i = 0
while (i < numWords) {
Array.copy(model(wordList(i)), 0, wordVectors, i * vectorSize, vectorSize)
i += 1

val wordIndex = collection.immutable.Map.newBuilder[String, Int]
wordIndex.sizeHint(numWords)

model.iterator.zipWithIndex.foreach {
case ((word, vector), i) =>
wordIndex += ((word, i))
Array.copy(vector, 0, wordVectors, i * vectorSize, vectorSize)
}
wordVectors
(wordIndex.result(), wordVectors)
}

private object SaveLoadV1_0 {
Expand Down

0 comments on commit 94f9bb0

Please sign in to comment.