OpenAI does not disclose the data it uses to train ChatGPT models. So it’s hard to unravel the mystery of the answers it provides and understand the way AI models are built. But to try to give an idea, the Washington Post analyzed the Google dataset, dubbed C4 for “Common Crawl Web Crawl Corpus.”
C4 is a huge database of 15 million websites that have been used to train certain AIs like Google’s T5 and Facebook’s LLaMA.
Note, upstream, that ChatGPT’s GPT-3 model includes 40 times the amount of data from C4. The GPT-3 data also includes the full English Wikipedia, a collection of free novels by unpublished authors, and many Reddit links.
Technically, to carry out his analysis, the Washington Post he worked with researchers at the Allen Institute for AI and ranked the websites using data from Similarweb, a web analytics company. They then categorized the sites by topic and showed the most used sites.
Frequently used far-right sites
Thus, the most recurring themes are business and industry, followed by technology and then the media. The top three sites, all topics combined, are Google Patents (a patent search engine), Wikipedia, and scribd.com (an online document sharing site). Half of the top 10 sites are also news outlets. Among them, the New York Times EITHER The Guardian.
But what is worrisome are the sites mentioned a little lower in the ranking and in a place important enough to be highlighted. Sites such as Russia Today, affiliated with the Russian state, Breitbart.com known for its false information and close to the extreme right, or Vdare, an anti-immigration site associated with the ideology of white supremacy, are used.
4chan, known for its links to the extreme right, sites close to Qanon or conspiracy sites are also mentioned in the ranking.
In this way, the use of these sites to train and train AI models could lead them to spread misinformation and conspiracy theories, without the user being able to trace the source of the information, especially with the opacity of ChatGPT.
Filters to improve
Reference is also made to religious places. Of the top 20 religious sites, 14 are Christian, including “Christianity Today,” which recently wrote that it advises women to continue to submit to abusive fathers and husbands and to avoid reporting them to authorities, notes the Washington Post.
To prevent AI models from providing responses filled with obscene, racist, and insulting comments, Big Tech is designing filter types to improve the quality of responses. For example, Google blocks this type of content for C4. Filters that have limits too: C4 removes certain LGBTQ content that does not contain any offense. A mixture of genres, which needs to be refined.
Another issue raised was that of data confidentiality. Technology is the second most recurring category. Social networks like Facebook and Twitter remain unclear on how users’ personal information can be used to train AI models.
Data and copyright
On the commercial side, two sites raise doubts about the respect of copyright. Kickstarter, a crowdfunding site and patreon.com that helps creators earn money from their fans.
And that is the problem. Kickstarter and Patreon can give AI access to ideas submitted by entrepreneurs and creators on these platforms. Currently, they do not receive compensation if their work is used as a basis.
The copyright issue also arises for image generators like Stable Diffusion or MidJourney. Some news organizations have also singled out tech companies for using their content without permission.
Also, Reddit, a community site, just flagged their dissatisfaction on April 18. The platform is a gold mine for AI models that will heavily rely on these assets. Businesses that want to use the site’s conversations to train their AI systems will now have to pay to access Reddit’s APIs.
Source: BFM TV
