Stolen information? The mystery over how tech giants train their AI chatbots

By Tim Biggs

Updated August 1, 2023 — 7.43am

As Google, Microsoft and OpenAI continue refining and promoting their AI-powered chatbots, claims that they are built on stolen — and in some cases, private — information are to be tested in US courts.

Bots including Bard, Bing and ChatGPT have become widely used for drafting creative and professional work, answering questions and even writing code: talents they have learnt by analysing massive amounts of human-created content. But how exactly did the companies source that content? And did the people who created it consent to their words being used to train bots?

Development of chatbot AI has become more secretive as the sector has become more competitive.Credit: Getty

Generative AI and large language models, two schools of technology involved in the creation of chatbots, have a long history and the methods of their creation are well documented. But as the developments have become an arms race between tech giants, the reality of how gains are achieved have become obfuscated.

This year, when OpenAI introduced its latest language model named GPT-4 – which is what ChatGPT and Microsoft’s Bing runs on – its accompanying paper explicitly said the company would not give an insight into its data collection.

“Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture [including model size], hardware, training compute, dataset construction, training method, or similar,” the paper said.

Google has similarly given only vague assurances on its own language models, that it only collects data from open or public sources, and that its methods are legal.

But analyses have spotted AI’s prying eyes peeking into everything from law school test exams and code repositories to fan fiction collections and news articles. Some experts and analysts believe the companies’ technologies trawl the furthest extents of the entire internet, rapidly jumping from link to link and sorting through databases to suck up every last word, indifferent to whether sites have settings or terms of service that indicate to users that information there is private.

Again, bots trawling the web are nothing new. LinkedIn has been fighting legal battles for many years to keep recruiters and data merchants from scraping its site.

But here we’re talking about massive tech corporations. A cynic would argue that Google and Microsoft developed many of the technologies in place to prevent bots getting into places they shouldn’t be. Most CAPTCHA puzzles, for example – those boxes full of images you need to click to prove you’re human – are run by Google. What would stop them taking everything?

A pair of class action lawsuits recently filed in the US allege that’s exactly what the tech giants have been doing.

Microsoft has partnered with OpenAI to turn its Bing search engine into a chatbot.

Brought by Clarkson Law Firm, the cases allege that OpenAI, Microsoft and Google have been harvesting as much information from the internet as possible — including, illegally, private information and copyrighted works — to build their AI chatbots.

The lawsuits also allege that these actions have caused or will cause harm; that the companies ignored an existing and established market for purchasing information, that models trained on copyrighted works will be used to compete against traditional media, and that people will be sold products that were created by data stolen from them in the first place.

“[Bing and ChatGPT] use stolen private information, including personally identifiable information, from hundreds of millions of internet users, including children of all ages, without their informed consent or knowledge,” the first lawsuit alleges, in part.

“Furthermore, defendants continue to unlawfully collect and feed additional personal data from millions of unsuspecting consumers worldwide, far in excess of any reasonably authorised use, in order to continue developing and training the products.”

A Microsoft spokesperson declined to comment.

Clarkson makes similar claims in its lawsuit against Google.

“Google illegally accessed restricted, subscription-based websites to take the content of millions without permission and infringed at least 200 million materials explicitly protected by copyright, including previously stolen property from websites known for pirated collections of books and other creative works,” that lawsuit says in part.

“Without this mass theft of private and copyrighted information belonging to real people, communicated to unique communities for specific purposes, and targeting specific audiences, many of Google’s AI products, including Bard, would not exist.”

OpenAI chief executive Sam Altman.Credit: Bloomberg

In response to the allegations, Google reiterated that all its data collection practices were legal.

“We’ve been clear for years that we use data from public sources – like information published to the open web and public datasets – to train the AI models behind services like Google Translate, responsibly and in line with our AI principles,” said Google general counsel Halimah DeLaine Prado.

“American law supports using public information to create new beneficial uses, and we look forward to refuting these baseless claims.”

This masthead does not suggest that Google, OpenAI or Microsoft have done anything illegal.

The question of how chatbots are trained and improved, and whether our own personal data is used, brings to mind a familiar problem we’ve seen primarily in social media.

The fundamental problem is that the inner workings of this technology – which seems bound to play a big part in our lives – is completely opaque, making it impossible to answer burning questions or truly understand what’s happening with the information we place online. It’s unclear even if those operating the bots fully control or understand the scope of their data collection.

And as with social media, privacy policies and official FAQs don’t offer much illumination. For example, Google’s privacy policy makes it clear that it collects information about every email you send, photo you upload, and search you make, and that it’s used to develop, maintain and improve services. So does that mean an original novel that you have sitting in your Google Drive is helping train language models?

A Google spokesperson said no. Personal data from Gmail, Photos and Google’s Workspace services are not used to train AI models including Bard.

But the same might not be true if you happen to post your writing on a forum. A recent change to Google’s policy explicitly states it will use information that’s “publicly available or from other public sources” to train AI models and build products.

Recently, discussion site Reddit has seemed to realise that its service was being constantly crawled by bots, and instituted a change to its backend that meant nobody could get unfettered access to its content without paying a hefty subscription fee. The moved caused a site-wide protest supporting developers who needed that access to create accessible and custom-made versions of the site. But Reddit chief executive Steve Huffman said at the time he didn’t “need to give all of that value to some of the largest companies in the world for free”.

Data scraping was also supposedly behind recent changes at Twitter, where limits were imposed on how many tweets a given user could see per day. Owner Elon Musk said it was to “address extreme levels of data scraping [and] system manipulation”.

Elsewhere, several media companies have expressed concerns that their articles and journalism could be scraped to train bots, which eventually would be used to produce work that would compete with traditional media. Arguably, most news sites are “public sources”. Similar issues have been raised by writers as part of the film industry strikes in the US.

As with social media, the full weight of regulation and litigation will be a few years behind, it seems, as practically all rules that govern what data crawlers can do – or what information can be collected for which purpose – were written well before the advent of generative AI. In the meantime, it’s close to impossible to tell exactly what’s going on behind the scenes.

Get news and reviews on technology, gadgets and gaming in our Technology newsletter every Friday. Sign up here.

License this article

The Sydney Morning Herald

The Sydney Morning Herald

Stolen information? The mystery over how tech giants train their AI chatbots

By Tim Biggs

Save articles for later

Most Viewed in Technology