Google says public data is fair game for training its AI • The Register

Google has updated its privacy policy to confirm that it scrapes public data from the internet to train its AI models and services — including chatbot Bard and its search engine that now provides quick answers to queries.

the Good print Currently under research and development are the following: “Google uses information to improve our services and to develop new products, features, and technologies that benefit our users and the public. For example, we use publicly available information to help train Google AI models and build products and features such as Google Translate, Bard, and Cloud AI. “

We use publicly available information to help train Google AI models and create products and features

Interestingly, reg Employees outside the United States could not see the text quoted at the link above. but This is a PDF file Google’s policy version states: “We may collect information that is publicly available online or from other public sources to help train Google AI models and build products and features, such as Google Translate capabilities, Bard, and Cloud AI.”

The changes define Google’s scope for AI training. Previously, the policy only referred to “language models” and referred to Google Translate. But the wording has been changed to include “artificial intelligence models” and include Bard and other systems built as applications on its cloud platform.

A Google spokesperson said log That the update didn’t fundamentally change the way it trains its AI models.

Our privacy policy has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate. This latest update simply states that newer services like Bard are also included. We incorporate privacy principles and safeguards into the development of AI technologies our own, in line with our AI principles,” the spokesperson said in a statement.

Developers have been scraping the Internet, photo albums, books, social networks, source code, music, articles, and more, to collect training data for AI systems for years. However, this process is controversial, since the material is usually protected by copyright, terms of use, and licenses, and the whole thing has resulted in lawsuits.

Some people are unhappy that not only is their content being used to build machine learning systems that duplicate their work, and thus potentially jeopardize their livelihoods, but that the models’ output comes very close to copyright or license infringement by regurgitating that training data unaltered.

AI developers may argue that their efforts fall under fair use, and that the models’ output is a new form of work and not actually a copy of the original training data. It is a hotly debated problem.

Amnesty International, for example, has been sued by Getty Images for harvesting and misusing millions of images from its stock image website to train text-to-image tools. Meanwhile, OpenAI and its owner Microsoft have been hit with multiple lawsuits, accusing it of improperly scraping “300 billion words from the Internet,” and “books, articles, websites and publications — including personal information obtained without consent.” , and manipulate the source code. From the public repository to create the AI-pair programming tool GitHub Copilot.

A Google representative declined to say whether or not the advertising and search giant would scrap public copyrighted data, licensed data, or social media posts to train its systems.

Now that people are better informed on how to train AI models, some internet companies have started charging developers for access to their data. overflow stack, reddit, and Twitter, for example, this year introduced new fees or rules for accessing their content through APIs. Other sites like Shutterstock and Getty have chosen to license their images to creators of AI models, and have partnered with the likes of meta And nvidia. ®

Leave a Reply

Your email address will not be published. Required fields are marked *