Posted: 2/20/2020
I have blogged here repeatedly how ridiculous the international copyright laws are granting such a protective right of at least up to 50 years after the death of the author. In some national copyright laws it goes significantly beyond 50 years.
Google recently created a new gigantic text dataset for natural language processing called PG-19 (named after Project Gutenberg and 1919). The source for the book level dataset is the Gutenberg Project.
However, Google saw it necessary to include only books in this dataset published before 1919 to avoid conflicts with international copyrights! This is outrageous!
“We select Project Gutenberg books which were published over 100 years old, i.e. before 1919
(hence the name PG-19) to avoid complications with international copyright, … ” (S1)
Thus, Google was limited to include only “28,752 books” (S1); the language contained in these selected books dates back 100 years or older and may be antiquated and so forth.
This restriction presents a serious impediment to AI research and innovation!
P.S. That Google also opted to apply a politically correct, Orwellian censorship to this dataset was captured in a previous blog post of mine.
Sources (S):
No comments:
Post a Comment