Google’s RT-2 Robot Gives ChatGPT-Style AI a Body - The Messenger
It's time to break the news.The Messenger's slogan

Google’s latest robot can learn, infer and act beyond the scope of its training, according to a July 28 post to the Google DeepMind website. Using the Robotic Transformer 2 (RT-2) AI model, it can determine the best improvised hammer from a series of objects laid out in front of it (a rock), or which drink a tired person should consume (an energy drink), even if it hasn’t been explicitly given the answers. Perhaps most impressively, it can carry out basic tasks given to it in plain language, even when the instructions reference objects it’s unfamiliar with (push the ketchup to the blue cube). It does all this by taking in web and robotics data to visually understand, interpret, and carry out commands.

The RT-2 model’s capabilities extend its reach beyond its predecessor, RT-1. While robots running the RT-1 model could learn combinations of tasks and objects, RT-2 robots take things a step further by understanding more visual cues and performing basic, chain-of-thought reasoning. They connect physical skills learned from training data to their surroundings, making complex inferences when given commands and applying those inferences to execute instructions. Even if the answer to a command is not explicitly present in RT-2’s training data, an RT-2 robot can still figure out object categories and use high-level descriptions to adequately respond to a user’s request.

Google’s DeepMind team noted in a paper that it built the RT-2 model with a similar dataset to RT-1, but broadened the model “to use a large vision-language backbone,” allowing it to make inferences based on what it sees. The team trained vision-language models to potentially solve “Internet-scale vision-language tasks” by combining data from robotics (i.e. the movement of the robot) with web-based interpretations of images and text. The team then fine-tuned its data by balancing the ratio of robot and web data, calling this move “a key technical detail of the training recipe.”

The resulting RT-2 model can not only perform low-level robotic actions, but also answer open-vocabulary questions based on camera input. Even though it is limited physically on the robotics end to what it has absorbed in the robot data, the model can combine these functions to uniquely apply trained robotics skills in new contexts beyond what its data has outlined. It can also take in images and plain language commands and interpret them physically with the help of knowledge obtained from the web.

A promotional image of Google's RT-2 robotics training model in action.
Google

In this way, RT-2 is loosely analogous to AI bots like ChatGPT: It takes in a command and uses the wealth of knowledge on the internet to shape its response. But unlike ChatGPT, RT-2’s response occurs through robotic movement rather than text. RT-2 also generalizes information, meaning that the model operates on general rules instead of on billions of data points. This approach advances robot training methods because it means that complex reasoning can happen on the foundation of a smaller amount of data. A fully formed general purpose robot has yet to be realized: RT-2’s thought processes are more about interpreting data and connecting dots than achieving sentience or a human-like thought process. 

The team tested over 200 tasks to evaluate how well RT-2 stacks up to other models. When compared to RT-1, RT-2 performed about the same in tasks it had directly seen in its training data, but showed significant improvements in inferring what to do for unseen tasks. 

The team also investigated any emergent capabilities of RT-2, or capabilities that arise through Internet-scale training. The researchers were looking for relationships between objects to transfer to robotic action, even when the model was not exposed to robotic data that would spur said action. The findings showed that RT-2 “inherits novel capabilities in terms of semantic understanding and basic reasoning” and further that “all the interactions tested in these scenarios have never been seen in the robot data.” The novelty of RT-2 lies in the apparent transfer of semantic knowledge through vision-language data. 

Despite its novelty, RT-2 has its limits. The robot can’t physically do anything new outside of the skills it learned through the robot data. Even though RT-2 can understand the world around it more organically than its predecessor, it is still restricted to the actions it has learned.

Businesswith Ben White
Sign up for The Messenger’s free, must-read business newsletter, with exclusive reporting and expert analysis from Chief Wall Street Correspondent Ben White.
 
By signing up, you agree to our privacy policy and terms of use.
Thanks for signing up!
You are now signed up for our Business newsletter.