Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepare_lm_training_data.py - how to use dataset with more than 2147483647 words for RNN LM training #1322

Closed
TrMaXXX opened this issue Jan 26, 2025 · 1 comment

Comments

@TrMaXXX
Copy link

TrMaXXX commented Jan 26, 2025

I'm trying to train RNN LM and I've encountered a problem that the data preparation script supports maximum dataset with less than 2147483647 words. The limitation is due to the fact that it uses Array1 when creating RaggedTensor for sentences, which has a maximum length of int32_t. Is it possible to get around this limitation somehow?

I attach an error message

  • prepare_lm_training_data.py:130 sentences = k2.ragged.RaggedTensor(sentences)
  • [F] /var/www/k2/csrc/array.h:501:void k2::Array1::Init(k2::ContextPtr, int32_t, k2::Dtype) [with T = int; k2::ContextPtr = std::shared_ptrk2::Context; int32_t = int] Check failed: size >= 0 (-2047483664 vs. 0) Array size MUST be greater than or equal to 0, given :-2047483664
@csukuangfj
Copy link
Collaborator

please split your dataset into smaller parts, where each part contains fewer than 2147483647 words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants