Correction (2/17/2020): Mea Culpa! It was Google Deepmind not Microsoft as stated in my original post!
Google recently released a huge text dataset for computational linguistics/machine learning called PG19. It contains the content of almost 29,000 books published before 1919 (I suppose to avoid copyright issues).
Google decided to cleanse the content of those old books by "mapping of offensive discriminatory words as specified by Ofcom [4] to placeholder tokens."
I think, this is a very dubious practice to censor such old books of offensive words/phrases, which were common a 100 years or more ago!
It is probably also very hard to defend this particular censorship practice as these words are irrelevant for current language modeling. Why then choose books that were published 100 years ago or longer! This books probably contain lots of outdated language!
By the way, Ofcom refers to "Attitudes to potentially offensive language and gestures on TV and radio Quick Reference Guide". "Ofcom is the regulator for the communications services that we use and rely on each day." Ofcom could very well be an Orwellian speak agency!
GitHub - deepmind/pg19: Contribute to deepmind/pg19 development by creating an account on GitHub.
No comments:
Post a Comment