Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BERTIN blog post #166

Open
wants to merge 56 commits into
base: main
Choose a base branch
from
Open

Add BERTIN blog post #166

wants to merge 56 commits into from

Conversation

edugp
Copy link

@edugp edugp commented Oct 19, 2021

Add a new blogpost for BERTIN, a series of Spanish RoBERTa models pre-trained during the JAX/Flax community event.

edugp and others added 30 commits August 15, 2021 18:19
Co-authored-by: Javier de la Rosa <[email protected]>
Co-authored-by: Javier de la Rosa <[email protected]>
Co-authored-by: Javier de la Rosa <[email protected]>
@osanseviero osanseviero requested a review from stevhliu October 19, 2021 20:04
Copy link
Contributor

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I did an initial pass with some minor suggestions. I'll need to do a more careful reading.

My overall feedback is:

  • The article is great and very informative
  • Remember this is a blog post, not a paper, so use things such as lists, interactive stuff (you can embed html), etc.
  • Read things from the perspective of a reader, not from your shoes. Is X something a reader would care about? Is this too redundant with model card? Is this too much detail?

FYI @lvwerra this blog post might be interesting to you

bertin.md Outdated

Each BERTIN model was trained in under a week on a Google Cloud TPUv3-8 using publicly available data. Our results show state-of-the-art performance in multiple downstream tasks, and overall figures are comparable to models trained on supercomputers using large private datasets.

Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a link to perplexity-based sampling strategies to a useful resource to read about it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you already mentioned the time constraint earlier, I don't think we need to mention it again 😊

Suggested change
Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.
We explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osanseviero we could link to this paper, which is where we got the idea from, but we already link it further down in the perplexity sampling section.
Or we could add a link to wikipedia maybe?
Let me know what you think

<caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
</figure>

In order to rule out the possibility of perplexity sampling filtering out relevant subsets of the dataset, such as documents relating to certain topics, we visually explored potential correlations between semantics and perplexity. The interactive visualization was generated using [a distilled version of multilingual USE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 mC4-es examples, then t-SNE was used for dimensionality reduction to a 2D space. The visualization showed a seemingly uniform distribution of perplexity across the different semantic clusters (each example is colored based on its perplexity). This is important since, in principle, perplexity sampling could introduce undesired biases if perplexity happens to be correlated to some other quality of our data. The visualization can be found [here](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/raw/main/images/perplexity_colored_embeddings.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI we could have the embedding in a Space if you want. You can use static files if you don't want to use streamlit/gradio

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Omar, I'll give it a go! I tried embedding it in the markdown without much success 😢 but a Space sounds like a good alternative

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, great job on such a thorough blog post! I love that you are able to get really good results while doing it with less time and computing resources. 🎉

I agree with @osanseviero, and I think it reads like a paper in some places, especially the Methodology and Results section. This may intimidate some readers who were expecting a brief and casual read. Try to include only the most relevant/impactful details as it relates to the results of your project.

bertin.md Outdated

Each BERTIN model was trained in under a week on a Google Cloud TPUv3-8 using publicly available data. Our results show state-of-the-art performance in multiple downstream tasks, and overall figures are comparable to models trained on supercomputers using large private datasets.

Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you already mentioned the time constraint earlier, I don't think we need to mention it again 😊

Suggested change
Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.
We explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.

edugp and others added 24 commits November 1, 2021 22:22
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Omar Sanseviero <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
@osanseviero
Copy link
Contributor

Hi @edugp. Is this article ready for a second review or were there some missing pieces?

@edugp
Copy link
Author

edugp commented Nov 9, 2021

Hey @osanseviero ! Sorry, I have this a bit abandoned - I still need to finish a couple things. I got a long flight on Saturday I'm hoping I can wrap it up then! I'll ping you when it's ready

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This article looks very interesting. I just have some high-level feedback at this point. Use what you think makes sense and ignore the rest :)

  • Where does the name BERTIN come from? Is it obvious and I missed it? :P
  • Instead of giving a historical introduction (this is how the project evolved) at the beginning what do you think about pitching the highlights (what the project achieved)? Along the lines of "During the flax we trained a model on X that achieved Y with Z resources. In this article we show how we did this". I think this could help catch the interest of the reader to continue reading the rest.
  • Two models I'd consider adding to the comparison are XLM-RoBERTa and RemBERT. I think for multilingual applications XLM-R has been the standard for a while and RemBERT is a recent addition that seems to surpass it. I know this is extra work but maybe just mentioning those would be good.
  • You mention the available resources at the very end but I think it would be worth mentioning at the beginning. E.g. when you discuss the training procedures it would be useful for a reader to know how long/expensive one training experiment is.
  • I would remove the organisational challenges at the beginning of lessons and next steps - it sounds a bit apologetic to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants