The Biggest Barrier to a Robotics Revolution



The Biggest Barrier to a Robotics Revolution
Experts believe robots will be able to cost-effectively perform a large share of physical tasks which, until now, have been performed only by humans.
Technology Briefing

Transcript


Except for a few advocates of a “degrowth dystopia,” humans demand stable or growing affluence. Unfortunately, that will become harder to achieve in a world with a rapidly aging and eventually shrinking population. As explained in prior issues, AI and robotics appear to offer the most cost-effective solution to this demographic crisis.

Experts now believe that robots of the 2030s and beyond will be able to cost-effectively perform a large share of physical tasks which, until now, have been performed only by humans. That’s because robots will benefit from all of the traditional economic dynamics of prior automation waves. This includes learning curve effects, economies of scale, quality controls, and economies of scope. Therefore, the enormous costs of training and designing the first high-functioning robot will quickly be amortized over millions of copies bringing the unit costs down to $20,000 or less in 20 years.

At least initially, these robots will need a lot more human supervision than human workers doing the same jobs. Yet they will be different from today’s automation, which must be programmed by the factory or the user to anticipate every possibility. Unfortunately, “robots” that can actually do more than execute traditional programs only exist today as research projects and marketing tools. Meanwhile, the top minds in robotics are struggling to achieve breakthroughs which will transform these academic projects into commercial realities.

In prior Trends issues, our discussions of robot R&D have focused on hardware integration including processing power, actuators, sensors, and power supplies, with relatively little emphasis on the monumental challenge of robot training. Unfortunately, hardware and networks mean little without a new kind of software that can sense, interpret, and adapt to the ever-changing context.

Developing such software platforms represents by far the biggest barrier to making robots commonplace in factories, warehouses, hospitals, and homes. Developing a general-purpose robot that can competently and robustly execute a wide variety of human-like tasks in any home or office environment has been the ultimate goal of robotics since the inception of the field. And given the recent progress of foundational AI models, usually referred to as LLMs, there has been a growing consensus outside the leading robotics labs that scaling existing network architectures by training them on very large datasets is likely the key to that objective. However, those who work in the field every day recognize that this approach is resource-intensive and while it seems to hold great promise, this does not guarantee success.

To appreciate where this trend is taking us, we need to understand what the best and brightest minds in the field have to say about its limitations, costs, and timing. Let’s first consider the main arguments in favor of training foundational AI models on hyper-scale data as a means of enabling general-purpose robotics. Perhaps the most common argument for this approach is that, “it has worked well for Computer Vision and Natural Language Processing, so why not robotics?” Such arguments readily cite the huge leaps embodied in foundation models such as GPT4-V and SAM.

That is, training a large model on an extremely large body of data has recently led to astounding progress on problems thought to be intractable just 3 or 4 years ago. Moreover, doing so has led to a number of emergent capabilities, where trained models are able to perform well at a number of tasks they weren’t explicitly trained for. More importantly, the fundamental methods for training a large model on a very large amount of data is general and not unique to Computer Vision and Natural Language Processing. Thus, there seems to be no reason why we shouldn’t soon see the same incredible “performance leap” in robotics tasks. In fact, we’re already starting to see some evidence that this might work well.

Several noted experts point to the recent RT-X and RT-2 papers from Google DeepMind as evidence that training a single model on large amounts of robotics data yields promising generalization capabilities. Russ Tedrake of MIT pointed to the recent Diffusion Policies paper as showing a similar surprising capability. And Sergey Levine of UC Berkeley highlighted recent efforts and successes from his group in building and deploying a “robot-agnostic” foundation model for navigation.

All of these works are preliminary in that they involve training a relatively small model with a paltry amount of data compared to the full dataset which GPT4-V used for computer vision. However, they certainly imply that scaling up these models and datasets could yield impressive results in robotics. Other leading researchers argue that the inevitable progress in data collection, computing, and foundational models represent waves that roboticists should ride.

The history of AI research has shown that relatively simple algorithms that scale well with data always outperform more complex or clever algorithms that do not. And whether we acknowledge it or not, there will always be more data and better computing. So, AI researchers can either choose to ride this wave, or to ignore it. Riding the wave means recognizing all the progress that’s happened because of large data and large models, and then developing algorithms, tools, and datasets to take advantage of the coming progress. It also means leveraging large pre-trained models from vision and language applications that currently exist and, where possible, applying them to robotics tasks.

Furthermore, it’s highly likely that useful robotics tasks will lie within a relatively simple dataspace, and training a large foundation model will help us identify that dataspace. The dataspace hypothesis as applied to robotics roughly states that, while the space of possible tasks we could conceive of having a robot do is impossibly large and complex, the tasks that actually occur in the real world are much more limited. By training a single model on large amounts of data, we might be able to identify this dataspace.

If we believe that such a dataspace exists for robotics, which intuitively seems likely, then this line of thinking would suggest that robotics is not different from Computer Vision or Natural Language Processing in any fundamental way. So, the same recipe that worked for Computer Vision or Natural Language Processing should be able to discover the dataspace for robotics and yield a shockingly competent generalist robot.

Even if this doesn’t pan out, attempting to train a LLM for general robotics tasks could teach us important things about the dataspace of robotics tasks, and perhaps we can leverage this understanding to address the generalized robotics training challenge. Another persuasive argument for addressing the robot training challenge with enormous data sets and LLMs is that it seems to be the best approach for getting at “commonsense capabilities,” which pervade all of robotics.

Consider the task of having a mobile manipulation robot place a mug onto a table. Even if we ignore the challenging problems of finding and localizing the mug, there are a surprising number of subtleties to this problem. For example: What if the table is cluttered and the robot has to move other objects out of the way? What if the mug accidentally falls on the floor and the robot has to pick it up again, re-orient it, and place it on the table? And what if the mug has something in it, so it’s important it’s never overturned?

These so-called “edge cases” are actually much more common that it might seem, and often are the difference between success and failure for a task. Moreover, these seem to require some sort of ‘common sense’ reasoning to deal with. Many experts argue that LLMs trained on a large amount of data are the best way we know of to yield key aspects of this ‘common sense’ capability. Thus, this might be the best way to address general robotics tasks.

On the flip side, some of the world’s top researchers argue that simply scaling up robotic training sets and associated LLMs may not offer a practical solution when it comes to general robotics tasks. On the other hand, almost no one directly disputes that this approach could work in theory. Instead, most arguments fall into one of two categories: (1) this approach is simply impractical in the real-world, and (2) even if it does “kind of work,” it won’t be good enough to “solve” the ultimate challenges of generalized robotics.

Experts cite four reasons that this brute-force scaling might prove impractical in the real world. First, we currently just don’t have much robotics data, and there’s no clear way we’ll get it. The Internet is chock-full of data for Computer Vision and Natural Language Processing, but very little for robotics. Recent efforts to collect very large robotics datasets have required tremendous amounts of time money, and cooperation, yet have yielded a very small fraction of the amount of vision and text data on the Internet.

Computer Vision and Natural Language Processing got so much data because they had an incredible “data flywheel”: tens of millions of people connecting to and using the Internet. Unfortunately for robotics, there seems to be no reason why people would upload a bunch of sensory input and corresponding action pairs. Collecting a very large robotics dataset seems quite hard and given that we know that a lot of important “emergent” properties only showed up in vision and language models at scale, the inability to get a large dataset could render this scaling approach hopeless.

Second, robots have different embodiments. So, another challenge with collecting a very large robotics dataset is that robots come in a large variety of different shapes, sizes, and form factors. The output control actions that are sent to a Boston Dynamics Spot robot are very different to those sent to a KUKA iiwa arm. Even if we ignore the problem of finding some kind of common output space for a large trained model, the variety in robot embodiments means we’ll probably have to collect data from each robot type, and that makes the data-collection problem even harder.

Third, there is an extremely large variance among the environments in which we want robots to operate. For a robot to really be “general purpose,” it must be able to operate in any practical environment a human might want to put it in. This means operating in any possible home, factory, or office building it might find itself in. Collecting a dataset that has even just one example of every possible building seems impractical. Of course, the hope is that we would only need to collect data in a small fraction of these, and the rest will be handled by generalization. However, we don’t know how much data will be required for this generalization capability to kick in, and it very well could also be impractically large.

And, Fourth, training a model on such a large robotics dataset might be too expensive and energy-intensive for real-world applications. It’s no secret that training large foundation models is expensive, both in terms of money and in energy consumption. GPT 4V—OpenAI’s biggest foundation model at the time of this writing—reportedly cost over $100 million and 50 million KWh of electricity to train.

This is well beyond the budget and resources that any academic lab can currently spare, so a larger robotics foundation model would need to be trained by a company (such as Google or Tesla) or a government agency of some kind. Additionally, depending on how large both the dataset and model itself for such an endeavor are, the costs may balloon to several billion dollars, which might make it completely infeasible. Beyond questions of practicality, there is the issue of whether it’s even possible to get a robotics model to a performance level that’s good enough for the real world.

We now know what can be achieved with Computer Vision and Natural Language Processing, but it remains unclear whether researchers can create something good enough for general purpose robotics. Consider just four objections: First, Vincent Vanhoucke of Google Robotics argues that most, if not all, robot learning approaches cannot be deployed for any practical task.

The reason? Real-world industrial and home applications typically require an accuracy and reliability in excess of 99 percent. What exactly that means varies by application, but it’s safe to say that robot learning algorithms aren’t there yet. In fact, most results presented in academic papers top out at an 80 percent success rate. While that might seem quite close to the 99+ percent threshold, people trying to actually deploy these algorithms have found that it isn’t so; and getting higher success rates requires asymptotically more effort as we get closer to 100 percent.

That means going from 85 to 90 percent might require just as much, if not more, effort than going from 40 to 80 percent. Therefore, getting up to 99+ percent is a fundamentally different animal than getting even up to 80 percent, one that might require a whole host of new techniques beyond just scaling.

Second, existing big models don’t get to 99+ percent even in Computer Vision and Natural Language Processing applications. As impressive and capable as current large models like GPT-4V and DETIC are, even they don’t achieve a 99+ percent or higher success rates on previously unseen tasks. And current robotics models are very far from this level of performance.

In fact, it’s safe to say that the entire robot learning community would be thrilled to have a general model that does as well on robotics tasks as GPT-4V does on Computer Vision and Natural Language Processing tasks. However, even if we had something like this, it wouldn’t perform at 99+ percent, and it’s not clear that it’s even possible to get there using the scaling paradigm.

Third, self-driving car companies have already tried the scaling approach and it hasn’t yet worked! Specifically, a number of self-driving car companies, most notably Tesla and Wayve, have tried training such an end-to-end big model on large amounts of data in order to achieve Level 5 autonomy. Not only do these companies have the engineering resources and money to train such models, but they also have the data.

Tesla, in particular, has a fleet of over 100,000 cars deployed in the real world from which it is constantly collecting and then annotating data. These cars are being teleoperated by experts, making the data ideal for large-scale supervised learning. And despite all this, Tesla has so far been unable to produce a Level 5 autonomous driving system. That’s not to say their approach doesn’t work at all. It competently handles a large number of situations, especially highway driving, and serves as a useful Level 2 driver assist system. However, it’s still far from 99+ percent performance. Moreover, data seems to suggest that Tesla’s approach is faring far worse than Waymo or Cruise, which both use more “modular” systems.

While it isn’t inconceivable that Tesla’s approach could end up catching up and surpassing its competitors’ performance in a year or so, the fact that it hasn’t worked yet should perhaps serve as evidence that the 99+ percent problem is hard to overcome for a large-scale Machine Learning approach. Moreover, given that self-driving is a special case of general robotics, Tesla’s case should give us reason to doubt the large-scale model approach as the complete solution to robotics training, especially in the medium term.

And, Fourth, many real-world robotic tasks will have a relatively long time-horizon. That means accomplishing any such task requires performing a number of correct actions in sequence. Consider the relatively simple problem of making a cup of tea given an electric kettle, water, a box of tea bags, and a mug. Success requires pouring the water into the kettle, turning it on, then pouring the hot water into the mug, and placing a teabag inside it.

If we want to solve this with a model trained to output motor torque commands given pixels as input, we’ll need to send torque commands to all 7 motors at around 40 Hz. Let’s suppose that this tea-making task requires 5 minutes. That requires 7 * 40 * 60 * 5 = 84,000 correct torque commands. This is all just for a single stationary robot arm; things get much more complicated if the robot is mobile or has more than one arm. It is well-known that errors for most tasks tend to compound with longer-horizons.

This is one reason why, despite their ability to produce long sequences of text, even LLMs cannot yet produce completely coherent novels or long stories; that’s because small deviations from a true prediction over time tend to add up and yield extremely large deviations over long-horizons. Given that most, if not all robotics tasks of interest require sending at least thousands, if not hundreds of thousands, of torques in just the right order, even a model that performs fairly well is likely to struggle to fully perform real-world robotics tasks.

So. what’s the bottom line? Mankind is facing a demographic crisis as the labor force ages and eventually shrinks. To void the adverse implications, we need a cost-effective substitute for human labor in many roles. General purpose robots ranging from robot arms to autonomous vehicles to humanoid robots, seem like the most realistic solution. Artificial Intelligence has recently achieved enormous breakthroughs in performing Computer Vision and Natural Language Processing tasks by training ever larger foundational models, known as LLMs, with ever larger data sets.

Assuming continuing exponential progress in computing power and data generation, it’s natural to assume that this approach will deliver similar achievement in robotics. To date, Tesla and certain other autonomous automobile pioneers have implemented this approach with limited success. Today, roboticists and investors are trying to determine whether and when this “scaling paradigm” could resolve the enormous challenge of creating a generalized model for robot control.

Given that robots come in so many forms and need to be able to safely and reliably operate in a wide range of settings, this will become one of the biggest challenges of the coming decade. Given this trend, we offer the following forecasts for your consideration. First, at least for the remainder of the 2020s, scaling-up data and machine learning models will remain the “go-to approach” for developers of general-purpose robots.

Despite arguments showing robotics differs from AI’s Natural Language Processing and Computer Vision applications, the breakthroughs LLMs have achieved in those areas encourage academic researchers and pioneering enterprises to embrace scaling. Only time will tell whether this approach will lead to affordable, safe, and reliable robots working in homes, hospitals, factories, mines, and warehouses.

Second, a scaled-up learning solution for robotics, analogous to what we’ve seen for LLMs, will be prohibitively expensive unless researchers develop alternatives that require less data than currently forecast. Not surprisingly, top researchers are already thinking about creative ways to overcome this problem. For instance, a number of researchers are exploring creative ways to train robots using simulators and then transfer them to the real world.

Other experts want to leverage existing vision, language, and video data and then ‘sprinkle in’ some robotics data. Consider Google’s recent RT-2 model which used a LLM trained on internet-scale vision and language data, and then fine-tuned it on a much smaller set of robotics data to produce impressive performance on robotics tasks. Perhaps through a combination of simulation and pretraining on general vision and language data, the industry will avoid actually having to collect a prohibitive amount of real-world robotics data in order to get scaling to work well for robotics tasks.

Third, human-in-the-loop solutions will go a long way toward overcoming the “99+ percent problem.” Today’s, successful machine learning systems, like Codex and ChatGPT, only work well because a human interacts with and sanitizes their output. Consider the case of coding with Codex: it isn’t intended to directly produce runnable, bug-free code. Instead, it acts as an “intelligent autocomplete function for programmers.”

In doing so, it makes the overall human-machine team more productive than either alone. Therefore, these models don’t have to achieve the 99+ percent performance threshold, because a human can help correct any issues during deployment. That’s not true in most commonplace robotics scenarios. Fortunately, there are some exceptions, such as Autonomous delivery drones and air-taxis.

Today’s “human flight protocols” are based on lots of rules about navigation, proximity, fuel reserves, and system redundancies. Procedures and safety rules give time for human intervention in these machine learning tasks. Where similar opportunities exist, they are likely to dramatically accelerate robot implementation.

Fourth, some roboticists will pursue “hybrid solutions” intended to combine the best elements of familiar machine learning and more traditional control system approaches. Several top roboticists now believe the best medium-term approach to reliable real-world systems will combine learning with so-called “classical approaches.” For instance, a real-world robot already deployed in dozens of hospital systems, uses a hybrid-system combining AI learning for perception and a few select skills, with classical SLAM and path-planning techniques for the rest of its capabilities. And this is not unique; several recent research papers explain how classical controls and planning, together with learning-based approaches can enable much more capability than any solution on its own.

In the long-term, either pure learning or an entirely different set of approaches might prove best but in the short to medium term, most experts agree that this ‘middle path’ is extremely promising.

Fifth, progress will be slower than it should be because roboticists suffer from the pathologies common to all researchers. Fortunately, robotics is not prone to the falsification and reproducibility problems we typically find in biotechnology, psychology and even physics. However, there is a tendency to report only successes and hide failures, which would prevent others from wasting their time. If we hope to deliver affordable general-purpose robots by 2040, roboticists will need to acknowledge and publicize what doesn’t work, as well as what does.

And, Sixth, cost-effective general-purpose robots, will be deployed in the next 20 years only if the pioneers are open to “out-of-the-box thinking.” Like artificial intelligence, robotics is a cutting-edge discipline where there’s much more to learn than we already know. And it’s important to remember that every one of the current approaches was only made possible because the researchers who introduced them were willing to go beyond the “conventional wisdom.”

Comments

No comments have been submitted to date.

Submit A Comment


Comments are reviewed prior to posting. You must include your full name to have your comments posted. We will not post your email address.

Your Name


Your Company
Your E-mail


Your Country
Your Comments