prepare_lm_training_data.py - how to use dataset with more than 2147483647 words for RNN LM training #1322

TrMaXXX · 2025-01-26T08:19:29Z

I'm trying to train RNN LM and I've encountered a problem that the data preparation script supports maximum dataset with less than 2147483647 words. The limitation is due to the fact that it uses Array1 when creating RaggedTensor for sentences, which has a maximum length of int32_t. Is it possible to get around this limitation somehow?

I attach an error message

prepare_lm_training_data.py:130 sentences = k2.ragged.RaggedTensor(sentences)
[F] /var/www/k2/csrc/array.h:501:void k2::Array1::Init(k2::ContextPtr, int32_t, k2::Dtype) [with T = int; k2::ContextPtr = std::shared_ptrk2::Context; int32_t = int] Check failed: size >= 0 (-2047483664 vs. 0) Array size MUST be greater than or equal to 0, given :-2047483664

csukuangfj · 2025-01-29T13:09:37Z

please split your dataset into smaller parts, where each part contains fewer than 2147483647 words.

csukuangfj closed this as completed Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prepare_lm_training_data.py - how to use dataset with more than 2147483647 words for RNN LM training #1322

prepare_lm_training_data.py - how to use dataset with more than 2147483647 words for RNN LM training #1322

TrMaXXX commented Jan 26, 2025

csukuangfj commented Jan 29, 2025

prepare_lm_training_data.py - how to use dataset with more than 2147483647 words for RNN LM training #1322

prepare_lm_training_data.py - how to use dataset with more than 2147483647 words for RNN LM training #1322

Comments

TrMaXXX commented Jan 26, 2025

csukuangfj commented Jan 29, 2025