If Google execs believe that AIs trained on the public Web are the future of Google, this paper basically argues that those AIs, and by extension Google's future, are unethical and probably can't be fixed at any reasonable cost.
See also<p>"The Slodderwetenschap (Sloppy Science) of Stochastic Parrots – A Plea for Science to NOT take the Route Advocated by Gebru and Bender" by Michael Lissack.<p><a href="https://arxiv.org/ftp/arxiv/papers/2101/2101.10098.pdf" rel="nofollow">https://arxiv.org/ftp/arxiv/papers/2101/2101.10098.pdf</a><p>I found this a reasonable critique of the original, despite apparent
TOS violations by Lissack leading to his Twitter account being locked.
The paper mentions "... similar to the ones used in GPT-2’s training data, i.e. documents linked to from Reddit [25], plus Wikipedia and a collection of books". Does anyone know what collection of books they are talking about?<p>I tried following the chain of references but ended up at a pay-walled source. Is it based on project gutenberg? Also, does Google train their models on the contents of all the books they scanned for Google Books or are they not allowed to because of copyright right issues?
From the authors<p><pre><code> Shmargaret Shmitchell
shmargaret.shmitchell@gmail.com
The Aether
</code></pre>
Is this some meta joke or a reference to anything?
Apart from the external dangers described (social, environmental), which I'm sure many will disagree with on multiple grounds, the article in general raises some very good points about the internal dangers these models pose to the field of NLP itself:<p>> The problem is, if one side of the communication does not have meaning, then the comprehension of the
implicit meaning is an illusion arising from our singular human
understanding of language (independent of the model). Contrary to how it may seem when we observe its output, an LM is a system
for haphazardly stitching together sequences of linguistic forms
it has observed in its vast training data, according to probabilistic
information about how they combine, but without any reference to
meaning: a stochastic parrot.<p>> However, from
the perspective of work on language technology, it is far from clear
that all of the effort being put into using large LMs to ‘beat’ tasks
designed to test natural language understanding, and all of the
effort to create new such tasks, once the existing ones have been
bulldozed by the LMs, brings us any closer to long-term goals of
general language understanding systems. If a large LM, endowed
with hundreds of billions of parameters and trained on a very large
dataset, can manipulate linguistic form well enough to cheat its
way through tests meant to require language understanding, have
we learned anything of value about how to build machine language
understanding or have we been led down the garden path?
This is the paper surrounding Timnit's "departure" from Google.<p>If you're on Timnit's side, "departure" means "firing", and the paper is the reason she was fired.<p>If you're on Google's side, "departure" means "mutually-agreeable resignation", with Timnit's melodramatic and unprofessional response to normal feedback.<p>Personally, I don't see anything in this paper that implicates Google or would be reasonable for Google to try to suppress, so I'm falling into the camp of trusting Google's side of the story. But who knows?
reading it I thought - if language models can be too big that could be a problem for Google given that at least one of their major competitive advantages is being able to have the biggest language models there are.<p>Although I don't really know if that's so (about the competitive advantage), it certainly seems like it is something Google might think from what I remember about earlier Google arguments about automated translation.