-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BERTIN blog post #166
base: main
Are you sure you want to change the base?
Add BERTIN blog post #166
Conversation
Co-authored-by: Javier de la Rosa <[email protected]>
Co-authored-by: Javier de la Rosa <[email protected]>
Co-authored-by: Javier de la Rosa <[email protected]>
…ragraph before methodology.
… discussion and improved discusion section
BERTIN blog post
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I did an initial pass with some minor suggestions. I'll need to do a more careful reading.
My overall feedback is:
- The article is great and very informative
- Remember this is a blog post, not a paper, so use things such as lists, interactive stuff (you can embed html), etc.
- Read things from the perspective of a reader, not from your shoes. Is X something a reader would care about? Is this too redundant with model card? Is this too much detail?
FYI @lvwerra this blog post might be interesting to you
bertin.md
Outdated
|
||
Each BERTIN model was trained in under a week on a Google Cloud TPUv3-8 using publicly available data. Our results show state-of-the-art performance in multiple downstream tasks, and overall figures are comparable to models trained on supercomputers using large private datasets. | ||
|
||
Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a link to perplexity-based sampling strategies to a useful resource to read about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you already mentioned the time constraint earlier, I don't think we need to mention it again 😊
Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget. | |
We explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@osanseviero we could link to this paper, which is where we got the idea from, but we already link it further down in the perplexity sampling section.
Or we could add a link to wikipedia maybe?
Let me know what you think
<caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption> | ||
</figure> | ||
|
||
In order to rule out the possibility of perplexity sampling filtering out relevant subsets of the dataset, such as documents relating to certain topics, we visually explored potential correlations between semantics and perplexity. The interactive visualization was generated using [a distilled version of multilingual USE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 mC4-es examples, then t-SNE was used for dimensionality reduction to a 2D space. The visualization showed a seemingly uniform distribution of perplexity across the different semantic clusters (each example is colored based on its perplexity). This is important since, in principle, perplexity sampling could introduce undesired biases if perplexity happens to be correlated to some other quality of our data. The visualization can be found [here](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/raw/main/images/perplexity_colored_embeddings.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI we could have the embedding in a Space if you want. You can use static files if you don't want to use streamlit/gradio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Omar, I'll give it a go! I tried embedding it in the markdown without much success 😢 but a Space sounds like a good alternative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, great job on such a thorough blog post! I love that you are able to get really good results while doing it with less time and computing resources. 🎉
I agree with @osanseviero, and I think it reads like a paper in some places, especially the Methodology and Results section. This may intimidate some readers who were expecting a brief and casual read. Try to include only the most relevant/impactful details as it relates to the results of your project.
bertin.md
Outdated
|
||
Each BERTIN model was trained in under a week on a Google Cloud TPUv3-8 using publicly available data. Our results show state-of-the-art performance in multiple downstream tasks, and overall figures are comparable to models trained on supercomputers using large private datasets. | ||
|
||
Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you already mentioned the time constraint earlier, I don't think we need to mention it again 😊
Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget. | |
We explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget. |
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
… Spanish Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Hi @edugp. Is this article ready for a second review or were there some missing pieces? |
Hey @osanseviero ! Sorry, I have this a bit abandoned - I still need to finish a couple things. I got a long flight on Saturday I'm hoping I can wrap it up then! I'll ping you when it's ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This article looks very interesting. I just have some high-level feedback at this point. Use what you think makes sense and ignore the rest :)
- Where does the name BERTIN come from? Is it obvious and I missed it? :P
- Instead of giving a historical introduction (this is how the project evolved) at the beginning what do you think about pitching the highlights (what the project achieved)? Along the lines of "During the flax we trained a model on X that achieved Y with Z resources. In this article we show how we did this". I think this could help catch the interest of the reader to continue reading the rest.
- Two models I'd consider adding to the comparison are XLM-RoBERTa and RemBERT. I think for multilingual applications XLM-R has been the standard for a while and RemBERT is a recent addition that seems to surpass it. I know this is extra work but maybe just mentioning those would be good.
- You mention the available resources at the very end but I think it would be worth mentioning at the beginning. E.g. when you discuss the training procedures it would be useful for a reader to know how long/expensive one training experiment is.
- I would remove the organisational challenges at the beginning of lessons and next steps - it sounds a bit apologetic to me.
Add a new blogpost for BERTIN, a series of Spanish RoBERTa models pre-trained during the JAX/Flax community event.