forked from rspeer/wordfreq
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
93 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# Why wordfreq will not be updated | ||
|
||
The wordfreq data is a snapshot of language that could be found in various | ||
online sources up through 2021. There are several reasons why it will not be | ||
updated anymore. | ||
|
||
|
||
## Generative AI has polluted the data | ||
|
||
I don't think anyone has reliable information about post-2021 language usage by | ||
humans. | ||
|
||
The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at | ||
large is full of slop generated by large language models, written by no one to | ||
communicate nothing. Including this slop in the data skews the word | ||
frequencies. | ||
|
||
Sure, there was spam in the wordfreq data sources, but it was manageable and | ||
often identifiable. Large language models generate text that masquerades as | ||
real language with intention behind it, even though there is none, and their | ||
output crops up everywhere. | ||
|
||
As one example, [Philip Shapira | ||
reports](https://pshapira.net/2024/03/31/delving-into-delve/) that ChatGPT | ||
(OpenAI's popular brand of language model circa 2024) is obsessed with the word | ||
"delve" in a way that people never have been, and caused its overall frequency | ||
to increase by an order of magnitude. | ||
|
||
|
||
## Information that used to be free became expensive | ||
|
||
wordfreq is not just concerned with formal printed words. It collected more | ||
conversational language usage from two sources in particular: Twitter and | ||
Reddit. | ||
|
||
The Twitter data was always built on sand. Even when Twitter allowed free | ||
access to a portion of their "firehose", the terms of use did not allow me to | ||
distribute that data outside of the company where I collected it (Luminoso). | ||
wordfreq has the frequencies that were built with that data as input, but the | ||
collected data didn't belong to me and I don't have it anymore. | ||
|
||
Now Twitter is gone anyway, its public APIs have shut down, and the site has | ||
been replaced with an oligarch's plaything, a spam-infested right-wing cesspool | ||
called X. Even if X made its raw data feed available (which it doesn't), there | ||
would be no valuable information to be found there. | ||
|
||
Reddit also stopped providing public data archives, and now they sell their | ||
archives at a price that only OpenAI will pay. | ||
|
||
And given what's happening to the field, I don't blame them. | ||
|
||
|
||
## I don't want to be part of this scene anymore | ||
|
||
wordfreq used to be at the intersection of my interests. I was doing corpus | ||
linguistics in a way that could also benefit natural language processing tools. | ||
|
||
The field I know as "natural language processing" is hard to find these days. | ||
It's all being devoured by generative AI. Other techniques still exist but | ||
generative AI sucks up all the air in the room and gets all the money. It's | ||
rare to see NLP research that doesn't have a dependency on closed data | ||
controlled by OpenAI and Google, two companies that I already despise. | ||
|
||
I don't want to work on anything that could be confused with generative AI, | ||
or that could benefit generative AI. | ||
|
||
OpenAI and Google can collect their own damn data. I hope they have to pay a | ||
very high price for it, and I hope they're constantly cursing the mess that | ||
they made themselves. | ||
|
||
— Robyn Speer | ||
|