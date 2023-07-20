ChatGPT’s behavior is changing and its response quality has declined heavily over the last few months, a study by Stanford University researchers show.

Titled “How Is ChatGPT’s Behavior Changing over Time?,” the study was published by researchers Lingjiao Chen, Matei Zaharia and James Zou earlier this week on the server arXiv.

The study is not peer reviewed, but the preliminary findings directly challenge OpenAI’s claim that its large language models are getting more efficient and accurate at answering prompts. Instead, the answers changed over time — and generally got less accurate and useful.

“We don’t fully understand what causes these changes in ChatGPT’s responses because these models are opaque,” one of the researchers, James Zou, an assistant professor at Stanford, told The Messenger.

The study evaluated the responses of OpenAI’s GPT-3.5 and GPT-4, and the accuracy of their answers during the months of March and June.

Importantly, it is hard to tell when these GPT versions receive new updates. OpenAI fine-tunes these models behind the scenes and it’s unclear what aspects of the technology received the most changes. So to test their quality over time, the researchers evaluated both GPTs in four ways: solving math problems; answering sensitive or dangerous questions; generating code; and visual reasoning.

The results show GPT-4 had 97.6% accuracy at identifying prime numbers — a math problem — in March. But by June, its answer to the same question was only 2.4% accurate, a steep decline in quality. Meanwhile, GPT-3.5 dramatically improved its quality at identifying math problems. In March it had 7.4% accuracy, but got better in June with an 86.8% answer accuracy.

This figure from the paper shows how the models performed over time.

Both GPTs had gotten significantly worse at code-generation tasks, and neither made significant improvements in answering sensitive questions or visual reasoning.

“It is possible that tuning the model to improve its performance in some domains can have unexpected side effects of making it worse on other tasks,” Zou speculated.

“Because we don’t understand what is responsible for the LLM drift, it’s even more important to continuously monitor the model’s behaviors over time,” he added.

The study is evidence backing up the many anecdotes from ChatGPT users suggesting the bot has gotten worse at answering questions over time.

But other AI experts are not totally convinced by that conclusion.

“It is very much unclear what could cause such performance differences,” Ilia Shumailov, a research fellow studying AI at Oxford University, told The Messenger.

“It is really hard to evaluate such systems,” she added. “Some model performance differences can come from technical details, yet at the same time, they can also come from a changing business model or even a legal landscape.”

Shumailov and a team of other UK-based researchers previously found that AI models which rely too heavily on AI-generated training data “causes irreversible defects” in their real-world outputs. This suggests that the quality of any generative AI trained on synthetic data will degrade if it relies on that data. That kind of cannibalistic data pipeline would cause a “model collapse” — a degenerative process whereby, over time, models forget the true underlying data distribution, the research found.

OpenAI did not respond to The Messenger’s request for comment. But OpenAI employee Logan Kilpatrick responded on Twitter to the reports that: “Everyone @OpenAI wants the best models that help people do more of what they are excited about. We are actively looking into the reports people shared.”

“Model collapse will definitely influence the learning process and should be taken explicitly into account to avoid degradation,” Shumailov said.