Add BERTIN blog post #166

edugp · 2021-10-19T18:58:34Z

Add a new blogpost for BERTIN, a series of Spanish RoBERTa models pre-trained during the JAX/Flax community event.

Co-authored-by: Javier de la Rosa <[email protected]>

…ragraph before methodology.

… discussion and improved discusion section

…ion.ipynb

BERTIN blog post

osanseviero

Thanks for this! I did an initial pass with some minor suggestions. I'll need to do a more careful reading.

My overall feedback is:

The article is great and very informative
Remember this is a blog post, not a paper, so use things such as lists, interactive stuff (you can embed html), etc.
Read things from the perspective of a reader, not from your shoes. Is X something a reader would care about? Is this too redundant with model card? Is this too much detail?

FYI @lvwerra this blog post might be interesting to you

bertin.md

osanseviero · 2021-10-21T20:21:51Z

bertin.md

+
+Each BERTIN model was trained in under a week on a Google Cloud TPUv3-8 using publicly available data. Our results show state-of-the-art performance in multiple downstream tasks, and overall figures are comparable to models trained on supercomputers using large private datasets.
+
+Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.


I would add a link to perplexity-based sampling strategies to a useful resource to read about it

Since you already mentioned the time constraint earlier, I don't think we need to mention it again 😊

Suggested change

Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.

We explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.

@osanseviero we could link to this paper, which is where we got the idea from, but we already link it further down in the perplexity sampling section.
Or we could add a link to wikipedia maybe?
Let me know what you think

bertin.md

osanseviero · 2021-10-21T20:33:27Z

bertin.md

+<caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
+</figure>
+
+In order to rule out the possibility of perplexity sampling filtering out relevant subsets of the dataset, such as documents relating to certain topics, we visually explored potential correlations between semantics and perplexity. The interactive visualization was generated using [a distilled version of multilingual USE](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) to embed a random subset of 20,000 mC4-es examples, then t-SNE was used for dimensionality reduction to a 2D space. The visualization showed a seemingly uniform distribution of perplexity across the different semantic clusters (each example is colored based on its perplexity). This is important since, in principle, perplexity sampling could introduce undesired biases if perplexity happens to be correlated to some other quality of our data. The visualization can be found [here](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/raw/main/images/perplexity_colored_embeddings.html).


FYI we could have the embedding in a Space if you want. You can use static files if you don't want to use streamlit/gradio

thanks Omar, I'll give it a go! I tried embedding it in the markdown without much success 😢 but a Space sounds like a good alternative

bertin.md

stevhliu

Wow, great job on such a thorough blog post! I love that you are able to get really good results while doing it with less time and computing resources. 🎉

I agree with @osanseviero, and I think it reads like a paper in some places, especially the Methodology and Results section. This may intimidate some readers who were expecting a brief and casual read. Try to include only the most relevant/impactful details as it relates to the results of your project.

bertin.md

stevhliu · 2021-10-27T22:38:32Z

bertin.md

+
+Each BERTIN model was trained in under a week on a Google Cloud TPUv3-8 using publicly available data. Our results show state-of-the-art performance in multiple downstream tasks, and overall figures are comparable to models trained on supercomputers using large private datasets.
+
+Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.


Since you already mentioned the time constraint earlier, I don't think we need to mention it again 😊

Suggested change

Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.

We explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.

bertin.md

Co-authored-by: Omar Sanseviero <[email protected]>

… Spanish Co-authored-by: Omar Sanseviero <[email protected]>

Co-authored-by: Omar Sanseviero <[email protected]>

Co-authored-by: Steven Liu <[email protected]>

osanseviero · 2021-11-09T15:40:25Z

Hi @edugp. Is this article ready for a second review or were there some missing pieces?

edugp · 2021-11-09T15:58:32Z

Hey @osanseviero ! Sorry, I have this a bit abandoned - I still need to finish a couple things. I got a long flight on Saturday I'm hoping I can wrap it up then! I'll ping you when it's ready

lvwerra

This article looks very interesting. I just have some high-level feedback at this point. Use what you think makes sense and ignore the rest :)

Where does the name BERTIN come from? Is it obvious and I missed it? :P
Instead of giving a historical introduction (this is how the project evolved) at the beginning what do you think about pitching the highlights (what the project achieved)? Along the lines of "During the flax we trained a model on X that achieved Y with Z resources. In this article we show how we did this". I think this could help catch the interest of the reader to continue reading the rest.
Two models I'd consider adding to the comparison are XLM-RoBERTa and RemBERT. I think for multilingual applications XLM-R has been the standard for a while and RemBERT is a recent addition that seems to surpass it. I know this is extra work but maybe just mentioning those would be good.
You mention the available resources at the very end but I think it would be worth mentioning at the beginning. E.g. when you discuss the training procedures it would be useful for a reader to know how long/expensive one training experiment is.
I would remove the organisational challenges at the beginning of lessons and next steps - it sounds a bit apologetic to me.

edugp and others added 30 commits August 15, 2021 18:19

Add template, authors and initial parragraph

8023707

Change intro parragraph

907dc9e

Write motivation, dataset, perplexity sampling and methodology

105a8c8

Add training details and results sections

21d837a

Add bias section

1149bf8

Add analysis, next steps and conclusions

17fcfe6

Fix typos

3b104b5

Fix grammar and add reference

9a9fea6

Fix typos and grammar

488303b

Co-authored-by: Javier de la Rosa <[email protected]>

Fix typo

e86254b

Co-authored-by: Javier de la Rosa <[email protected]>

Update lessons section

9b058eb

Co-authored-by: Javier de la Rosa <[email protected]>

Leave bias section out

016b5b7

Merge branch 'bertin' of https://github.com/nlp-en-es/blog into bertin

95b0642

Improved figure 1 text. Small change in title. Re-written the last pa…

a9df4be

…ragraph before methodology.

small changes to blog text

9b06e80

Merge branch 'bertin' of https://github.com/nlp-en-es/blog into bertin

9ba6c31

Improve motivation section

2a238b6

Improve methodology

6dded7c

Improve training details section

fae097e

Improve results section, remove original Table 3, renamed analysis to…

4f198be

… discussion and improved discusion section

Improve lessions and conclusion sections

44a81a1

Add links to all references

d40619d

Resolve merge conflict

d765b06

Undo accidental changes to 01_how_to_train notebook

c3704e3

Undo accidental modification of notebooks/trainer/01_text_classificat…

cf17f1d

…ion.ipynb

Update blogpost assets number from 25 to 27

a3fcee3

Add a parragraph to the intro around how the project was born

d95ced3

Simplfy, shorten and remove details

bb1b3a7

Fix typos and errors

877309c

Merge pull request #1 from nlp-en-es/bertin

cddce05

BERTIN blog post

osanseviero requested a review from stevhliu October 19, 2021 20:04

osanseviero reviewed Oct 21, 2021

View reviewed changes

stevhliu reviewed Oct 27, 2021

View reviewed changes

edugp and others added 24 commits November 1, 2021 22:22

Add newline

cc39a97

Co-authored-by: Omar Sanseviero <[email protected]>

Reprhase intro.

bb074b3

Merge branch 'master' of https://github.com/nlp-en-es/blog

a33b157

Add sentence about NLP en ES, and fix typo

adc92b6

Update bertin.md

468545b

Co-authored-by: Omar Sanseviero <[email protected]>

Remove redundancy around time constraint

8a12671

Co-authored-by: Omar Sanseviero <[email protected]>

Add suggested fix to section on time for large LMs to be available in…

ea1904f

… Spanish Co-authored-by: Omar Sanseviero <[email protected]>

Remove redundancy

be06909

Co-authored-by: Omar Sanseviero <[email protected]>

Improve section on the value of competition

1e6923c

Co-authored-by: Omar Sanseviero <[email protected]>

Add bullet points to rationale behind sampling strategy

8c63bbd

Co-authored-by: Omar Sanseviero <[email protected]>

Add newline

3a17206

Co-authored-by: Omar Sanseviero <[email protected]>

Add newline

7a0a646

Co-authored-by: Omar Sanseviero <[email protected]>

Split paragraph on team discussions and strategy in two

951b014

Co-authored-by: Steven Liu <[email protected]>

Fix grammar

0f58af3

Co-authored-by: Steven Liu <[email protected]>

Replace fortells -> suggests

42b17cf

Co-authored-by: Steven Liu <[email protected]>

Remove redundant commas

5ea13e6

Co-authored-by: Steven Liu <[email protected]>

Split sentence in two

f2c7d7b

Fix grammar

7a5c32f

Co-authored-by: Steven Liu <[email protected]>

Fix grammar

1532709

Co-authored-by: Steven Liu <[email protected]>

Fix grammar

3871d12

Co-authored-by: Steven Liu <[email protected]>

Fix grammar

0c8380a

Co-authored-by: Steven Liu <[email protected]>

Add mc4 link to datasets hub

3e722b7

Improve grammar

4663345

Co-authored-by: Steven Liu <[email protected]>

Improve grammar

26ac8ec

Co-authored-by: Steven Liu <[email protected]>

lvwerra reviewed Nov 10, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BERTIN blog post #166

Add BERTIN blog post #166

edugp commented Oct 19, 2021

osanseviero left a comment

osanseviero Oct 21, 2021

stevhliu Oct 27, 2021

edugp Nov 1, 2021

osanseviero Oct 21, 2021

edugp Nov 1, 2021

stevhliu left a comment

stevhliu Oct 27, 2021

osanseviero commented Nov 9, 2021

edugp commented Nov 9, 2021

lvwerra left a comment


		Each BERTIN model was trained in under a week on a Google Cloud TPUv3-8 using publicly available data. Our results show state-of-the-art performance in multiple downstream tasks, and overall figures are comparable to models trained on supercomputers using large private datasets.

		Training a competitive large language model within the time frame of the event proved to be a challenging task since the early stage of the project. For this reason, we explored several perplexity-based sampling strategies that allowed us to train models efficiently under limited compute and time resources. We are very excited to introduce our methodology and learnings with the hope to empower small teams to train competitive language models on a budget.

Add BERTIN blog post #166

Are you sure you want to change the base?

Add BERTIN blog post #166

Conversation

edugp commented Oct 19, 2021

osanseviero left a comment

Choose a reason for hiding this comment

osanseviero Oct 21, 2021

Choose a reason for hiding this comment

stevhliu Oct 27, 2021

Choose a reason for hiding this comment

edugp Nov 1, 2021

Choose a reason for hiding this comment

osanseviero Oct 21, 2021

Choose a reason for hiding this comment

edugp Nov 1, 2021

Choose a reason for hiding this comment

stevhliu left a comment

Choose a reason for hiding this comment

stevhliu Oct 27, 2021

Choose a reason for hiding this comment

osanseviero commented Nov 9, 2021

edugp commented Nov 9, 2021

lvwerra left a comment

Choose a reason for hiding this comment