Google-Extended: A New Tool for Website Publishers to Control AI Training Data

28 Sep 2023



Google has recently unveiled Google-Extended, a robust new tool that empowers website publishers to determine whether their data is used for enhancing Google's AI models. With this innovative tool, data can still be scanned and indexed by web crawlers such as Googlebot while simultaneously preventing the usage of the data to refine AI models.

Google-Extended allows publishers to regulate if their websites contribute to improving Bard and Vertex AI generative APIs, the AI systems employed by Google. Additionally, they have the liberty to "control access to content on a site," according to Google's statement in July, confirming its use of publicly available data from the internet to fine-tune its AI chatbot, Bard.

The availability of Google-Extended through robots.txt, the text file that guides web crawlers' on-site accessibility, makes it convenient for publishers to employ. This initiative symbolizes Google’s commitment to privacy and highlights the control publishers hold over their content, even as they contribute to the vast pool of internet data.

As AI applications continue to broaden their horizon, Google aims to investigate "additional machine-readable approaches to choice and control for web publishers.” This indicates that Google Extended is merely the beginning, and further efforts toward empowering publishers are on the horizon.

The dilemma facing many websites, including prestigious names such as The New York Times, CNN, Reuters, and Medium, has been managing the blockage of Google’s web crawlers while maintaining their search visibility. These sites moved to deter the web crawler used by OpenAI to gather data for improving ChatGPT. However, entirely disallowing Google’s crawlers could lead to their absence in search results, an unacceptable compromise for most publishers. To tackle this challenge, some sites, like The New York Times, opted for a legal approach, modifying their terms of service to prohibit companies from employing their content for AI training purposely. This strategy reflects a delicate balance between maintaining internet visibility, asserting control over their content, and safeguarding its application in AI development. Google-Extended could potentially herald a much-needed solution to this increasingly prevalent concern.