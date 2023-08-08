How to Stop OpenAI’s Web Crawler From Training New ChatGPT Models On Your Site - The Messenger
How to Stop OpenAI’s Web Crawler From Training New ChatGPT Models On Your Site

You can disallow GPTBot from accessing your site with a simple command

Published |Updated
Sherin Shibu
Sam Altman, CEO of OpenAI, speaks to the media as he arrives at the Sun Valley Lodge for the Allen & Company Sun Valley Conference on July 11, 2023 in Sun Valley, Idaho.Kevin Dietsch/Getty Images

Yesterday, OpenAI launched GPTBot, a program that collects publicly available data for use in training future ChatGPT models. Such bots are commonly known as web crawlers or web spiders, and while website owners can opt-out of being crawled, they’ll need to either block the bot’s IP address or enter a disallow command into their site’s code to do it. OpenAI has provided instructions on how to do the latter.

To execute the block, add GPTBot in your site’s robots.txt file and disallow it:

User-agent: GPTBot
Disallow: /

GPTBot can be identified with the following token:

User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

You can also customize access by adding the GPTBot Token in select parts of your site’s robot.txt file. Here’s an example:

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

Web pages crawled by GPTBot “are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies,” Open AI stated in an effort to assauge privacy concerns. 

GPTBot’s launch comes three weeks after OpenAI filed a trademark application for “GPT-5” software, which could be the next generation of the GPT-4 model that ChatGPT currently uses. The trademark registration is under review and covers the artificial production of human speech and text, as well as natural language processing and translation of text or speech.

OpenAI has previously admitted that ChatGPT updates might not be perfect. OpenAI faces multiple lawsuits over ChatGPT, including one over libel, one over copyright infringement and a class action lawsuit over data confidentiality

According to the last lawsuit, Microsoft and OpenAI allegedly gathered “private information and private conversations, medical data, information about children — essentially every piece of data exchanged on the internet it could take — without notice to the owners or users of such data, much less with anyone's permission.”

As GPTBot is a recent addition to the company’s tools, it is unclear to the public the exact methods the company used to train prior ChatGPT models.

The Federal Trade Commission launched an investigation of OpenAI in July for potential breaches of consumer protection law. The agency intends to look into OpenAI’s efforts to correct false and personal information revealed through ChatGPT. 

OpenAI did not respond to The Messenger’s request for comment about any new safeguards in place with the GPTBot. 

