Transcript
In “The Jetsons,” a classic cartoon series about the 21st Century, “Rosie the robotic maid” seamlessly switches from vacuuming the house to cooking dinner to taking out the trash. But in real life, training such a general-purpose robot remains an enormous challenge. Typically, engineers collect data that are specific to a certain robot and a certain task. Then, they use this data to train the robot in a controlled environment. However, gathering these data is costly and time-consuming, and the robot is still likely to struggle to adapt to environments or tasks it hasn’t seen before.
To train better general-purpose robots, MIT researchers have developed a versatile technique that combines a vast amount of heterogeneous data from many sources into one system that can teach any robot a wide range of tasks. This method involves aligning data from varied domains, like simulations and real robots, as well as multiple modalities, including vision sensors and robotic arm position encoders, into a shared “language” that a generative AI model can process.
The theory is that by combining such an enormous amount of data, this approach can be used to train any robot to perform a variety of tasks without the need to start training it from scratch each time. In theory, this method will be faster and less expensive than traditional techniques because it requires far less task-specific data. In fact, it has outperformed training from scratch by more than 20 percent in simulations and re al-world experiments performed to date.
The research was presented recently at the Conference on Neural Information Processing Systems. In robotics, researchers often claim that we don’t have enough training data. However, the real problem is that the data come from myriad domains, modalities, and robot hardware types, leading to de facto incompatibility.
This new research shows how it’s possible to generate so-called robotic training “policies” by using all of this data as a foundation. Such robotic “policies” take in sensor observations, like camera images or proprioceptive measurements that track the speed and position a robotic arm, and then tells a robot how and where to move, next.
Today, such policies are typically trained using “imitation learning,” where a human either demonstrates actions or teleoperates a robot to generate data, which are fed into an AI model which learns the policy. Because this method involves only a small amount of task-specific data, robots often fail when their environment or task changes, even slightly. To develop a better approach, the MIT re searchers drew inspiration from large language models like GPT-4. These models are pretrained using an enormous amount of diverse language data and then fine-tuned by feeding them a small amount of task-specific data.
Pretraining on so much data helps the models adapt to perform well on a variety of tasks. In the language domain of an LLM, the data are all just sentences. But in robotics, given all the heterogeneity in the data, if you want to pretrain in a similar manner, you need a different architecture. Robotic data take many forms, from camera images to language instructions to depth maps. At the same time, each robot is mechanically unique, with a different number and orientation of arms, grippers, and sensors. Plus, the environments where data are collected vary widely.
The new architecture developed by the MIT researchers is called Heterogeneous Pre trained Transformers (or HPT). HPT unifies data from varied modalities and do mains. The MIT team put a machine-learning model known as “a transformer” into the middle of their architecture, which processes vision and proprioception inputs. A transformer is the same type of model that forms the backbone of large language models.
The researchers also aligned data from vision and proprioception into the type of inputs, called tokens, which the transformer can process. And each input is represented with the same fixed number of tokens. Then the transformer maps all inputs into one shared space, growing into a huge, pretrained model as it processes and learns from more data. The larger the transformer becomes, the better it will perform.
A user only needs to feed HPT a small amount of data on their robot’s design, setup, and the task they want it to perform. Then HPT transfers the knowledge the transformer gained during pretraining to learn the new task. One of the biggest challenges of developing HPT was building the massive dataset to pretrain the transformer. This included 52 datasets with more than 200,000 robot trajectories in four categories, including human demo videos and simulations.
The researchers also needed to develop an efficient way to turn raw proprioception signals from an array of sensors into data the trans former could handle. Proprioception is key to enable a lot of dexterous motions. Because the number of tokens in the architecture is always the same, it places the same importance on proprioception and vision.
When they tested HPT, it improved robot performance by more than 20 percent on simulation and real-world tasks, compared with training from scratch each time. Even when the task was very different from the pretraining data, HPT still improved performance. The long-term objective of HPT is to have a universal robot brain that anyone can down load and use for their robot without any training at all.
|