|
|
|
|
|
Computing After Moore's LawTechnology Briefing |
|
TranscriptEvery sector of the economy from finance, communications, and health care to energy, transportation, and national security depend on having faster, cheaper, and better computing capabilities. And for over 60 years, those capabilities have depended on increasingly cost-effective silicon-based devices. To do so, we’ve relied on a reliable stream of price-performance improvements. But that has all started to change. The gap between computing demand and computing supply is growing rapidly. Artificial intelligence is a primary reason for this accelerating demand. We are seeing a revolution in what’s possible due to the development of deep artificial neural networks. They talk, they listen, they paint pictures, they trans late between languages, they play games, they drive, they read X-rays and CT scans, and they expand their capabilities every day. And before they are useful, they have to be trained. It can take months to train the latest LLMs, with this training demanding enormous computing power. According to experts, the computing demand for training LLMs has been doubling roughly every 3.5 months over the past dozen years. But that’s not all. Even after they are trained, they require enormous amounts of computing to do their jobs. At the same time, improvements to the chip-making technology that have enabled this modern revolution in computing and communication are slowing down. The reason gets back to the chip-printing process. For several decades, we have been able to print more and more circuit devices, leading to more and more powerful processors and to larger and larger memories, on the same one-inch silicon chip. Why? Because they are made of transistors and wires, and our printing process kept making the transistors and the wires smaller and smaller. Every 18 months we got twice as many transistors and twice as many wires in the same silicon area, a phenomenon first described by Intel’s Gordon Moore. This repeated doubling is called Moore’s Law. Twice the transistors on the same chip area might have made for hotter and hotter chips; but that didn’t happen. It didn’t happen because we could use low er and lower voltages to drive the smaller wires and transistors. This phenomenon, known as Dennard scaling, let us pack more and more transistors, and switch them faster, without increasing the power per unit of silicon area. Every few years we doubled the transistors per unit area of silicon without increasing power; both chip vendors and customers reaped a windfall. But all good things come to an end! Over the last fifteen years, the Dennard scaling “free lunch” has Expected design costs through 5nm gone away. Chip makers haven’t been able to increase the speed of circuits. And the clocks of processors no longer get faster with each generation. Fortunately, thanks to continued Moore’s Law improvement in transistor density, they were still able to squeeze more and more cores onto chips. But that too is coming to an end. Why? The size of wires and transistors is now down to the 50-angstrom range, and the unit-cell of the silicon crystal lattice is 5 angstroms on a side. So, we’ve reached the physical and economic limits of how much silicon chips can shrink for two reasons. First, as Moore himself said, “there’s no getting around the fact that we make these things out of atoms, and that will mean the end of Moore’s Law.” That’s because quantum properties prevent devices from reliably operating with fewer than a finite number of atoms. Second, the cost of new fabs is dramatically growing for each new generation of chips. So, Moore’s Law could die an “economic death” even before it hits its ultimate physical limits. This all says that we’re at a crossroads. AI will need much more computing than is available using today’s chips. And we cannot expect industry trends of the past few decades to provide the needed extra computing. With Dennard scaling and Moore’s Law running out of gas, we need something new. What will be the answer? Fortunately, in 2024 we witnessed two potentially game-changing breakthroughs which were decades in the making. As we’ll explain, one breakthrough was the launch of the first commercial “wafer-scale computing solution.” This seems ideally positioned to provide an enormous boost to computing in the decade ahead. As explained earlier, we can’t hope to put ever more transistors onto each square millimeter of silicon; instead, this breakthrough let’s us put more square millimeters of silicon onto each chip. The second breakthrough was the first successful demonstration of practical, scalable graphene semiconductors. For 20 years, nano-scientists pursued the elusive goal of transforming 2D carbon sheets into the raw material that could drive the next digital revolution. And that work finally paid off this year, paving the way for a whole new era in computing. Together, these two breakthroughs promise to re move the first truly existential threat to the Digital Techno-Economic Revolution which began in the 1970s. To understand why this is so important, let’s take a look at these two breakthroughs and how we expect them to unleash the full economic potential of the 21st century economy. We’ll start by examining the latest surge of progress in wafer-scale electronics and the implications of this suddenly practical technology. A wafer is a silicon disk, now about 12 inches in diameter. A chip is typically a square no more than an inch on a side — much smaller than a wafer. Normally, many copies of the same chip are printed onto the wafer. Wafer-scale integration is the idea that you make a single chip out of the whole wafer. That means you skip the step of cutting the wafer up. The reason engineers have long dreamed of devel oping wafer-scale computing, and making it work, is that it is a way around off-chip communication, the chief barrier to higher computer performance. Why? Because chip-to-chip data movement is much slower and much less energy efficient than on-chip data movement. And there is another problem, beyond the time to access off-chip data and the energy needed to do so. That reason is to avoid a data bottleneck at the chip boundary, caused by thick off-chip wires. On a conventional chip, the bottom of the chip package is covered with tiny connection points for wires. Most are used to provide power, but many, perhaps a thousand, are for moving data. And since that number of connections is often not enough to move all the data in a timely way, this becomes a choke point, in the same way freeway onramps become congested at rush hour as everyone tries to use the same onramp lanes at the same time. In contrast, when the whole wafer is treated as a single unit containing many kinds of chips, on-chip wires can be packed much more tightly, and the manufacturer can include many more of them. For example, on the first such commercial product, known as the Cerebras Wafer-Scale Engine (or WSE), there are around 20,000 on-chip wires connecting each of the un-diced “chips” to the others. And more than 80 percent of these are dedicated to data rather than control. That’s more than a tenfold improvement in the communication bandwidth possible compared to traditional separated, diced, and packaged chips. This begs the question, “If wafer-scale computing has so many benefits, why is it only emerging now?” The answer is that it’s technically very challenging. In the 1970s and 1980s there were attempts to create wafer-scale solutions, but they all failed. Texas Instruments and ITT tried to develop wafer scale processors. A company called Trilogy also raised over $200 million to build wafer-scale, high-performance systems. But even though the wafers of that era were only about 90 mm across, the technical challenges of making a wafer, as well as powering, cooling, packaging it, and overcoming manufacturing flaws, proved beyond what was then possible. Fortunately, things are different today. In August 2019, Cerebras first announced its Wafer Scale Engine, which at four trillion transistors and 46,225 square millimeters of silicon is the largest chip ever built by a factor of 56 times. To create this revolutionary product required years, and overcoming some major hurdles. What were these hurdles? First, they had to invent a way to communicate across so-called “scribe lines.” Second, the problem of yield-limiting defects had to be resolved. Third, a unique package had to be designed and made for the wafer; one that is compatible with power delivery and cooling as well as making the wafer structurally stable and durable despite cycles of on-and-off power and variable workloads. And fourth, power delivery and cooling problems had to be resolved. The challenges of overcoming these four problems were so big that it took Cerebras from 2019 to 2024 to deliver a truly competitive commercial product. Consider just one example: Because manufacturing WSE’s uniform small core architecture enables redundancy which can address “yield issues” at very low cost methods have improved since the 1980s, defects dropped, but not enough to ensure that all cores on a massive wafer are good. To compensate for these defects, Cerebras’ WSE design builds in extra cores. And the interconnections between cores makes it feasible to “map around” individual defective cores. When those bad cores are identified at manufacturing time, the interconnects between processors are configured to avoid those, using instead the extra processors that are part of the physical design. This makes it possible to deliver a defect-free array of the designed size in the shipped product, despite inevitable manufacturing errors. The most important part of the wafer-scale story is not that it’s overcome the many hurdles blocking it’s path, but that it has arrived at just the right time to make an enormous difference in the computing industry. Until now, the major role of computing in artificial intelligence has been training LLMs. And throughout this phase of the Generative AI revolution, Nvidia’s GPUs have dominated the market. However, in the next 18 months, industry experts expect the market to reach an inflection point as the AI projects that many companies have been training and developing will finally be deployed. At that point, AI workloads will shift from training to what the industry calls inference. Market researchers forecast that the AI inference market is on the cusp of explosive growth. For instance, Verified Market Research predicts that it will reach $90.6 billion by 2030. At that point, speed and efficiency become much more important. And that raises the trillion-dollar question: “Will Nvidia’s line of GPUs be able to maintain their top position in the era of inference?” To answer that question, let’s take a deeper look. Inference is the process by which a trained AI mod El evaluates new data and produces results, such as users chatting with an LLM, or a self-driving car maneuvering through traffic. This is distinct from training, when the model is being shaped behind the scenes before being released. Performance during inference is critical to all AI applications whether it involves split-second real-time interactions or the data analytics that drive long-term decision-making. Historically, AI inference has been performed on GPU chips. This was due to the general superiority of GPUs over CPUs at the parallel computing needed for efficient training over massive datasets. However, as demand for heavy inference workloads increases, GPUs consume significant power, generate high levels of heat, and are expensive to maintain. The third-generation Cerebras Wafer-Scale Engine has emerged just in time to exploit this opportunity. The CS-3 chip is a revolutionary AI processor that sets a new bar for inference performance and efficiency. It boasts 4 trillion transistors, as well as being 56 times larger than the biggest GPUs while containing 3000 times more on-chip memory. That makes it the largest neural network chip ever produced. Consequently, individual WSE chips can handle huge workloads without having to network, an architectural innovation that enables faster processing speeds, greater scalability, and reduced power consumption. Furthermore, the CS-3 WSE excels with LLMs. Bench marks indicate that this Cerebras chip can process 1,800 tokens per second for the Llama 3.1 8B model, far outpacing current GPU-based solutions. Moreover, with pricing starting at a cost of just 10 cents per million tokens, it is very cost-effective. That means the Cerebras Wafer Scale Engine not only represents the first serious challenge to Nvidea’s LLM hegemony, but it also demonstrates that wafer-scale processors have the potential to resolve many short and medium-term challenges which threaten to stymie computing over the next decade. Now let’s consider what lies beyond the next few years when wafer-scale technology might buy silicon an extra decade. Fortunately, 2024 saw the dawning of another breakthrough that’s potentially even bigger. As explained in the journal Nature, it took a research team at Georgia Tech led by Professor Walt de Heer, 20 years to develop the first useful graphene-based semiconductor. A long-standing problem in graphene electronics is that graphene didn't have the right band gap and couldn't switch on and off at the correct ratio. Over the years, many researchers have tried to address this with a variety of methods. The current effort started when de Heer produced so-called epitaxial graphene, which is a single layer that grows on a crystal face of silicon carbide. His 43 team found that when this layer was made properly, the epitaxial graphene chemically bonded to the silicon carbide and started to show semiconducting properties. Then, over the next decade, they persisted in perfecting the material. The resulting technology achieved a 0.6 electron-volt band gap, which represents a crucial step toward realizing graphene based electronics. Furthermore, the team's measurements showed that this graphene semiconductor has 10 times greater mobility than silicon. In other words, the electrons move with very low resistance, which, in electronics, translates to faster computing, while producing far less heat. This not only means less cooling, but it involves far less energy consumption. Just as important, Graphene can also be used in components that operate in the terahertz part of the electromagnetic spectrum. These terahertz frequencies are 100-to-1000 times faster than today’s computers and have been suggested for use in future 6G communications systems. As such, this technology provides a viable path for solving the out-of-control datacenter energy consumption problem as well as bypassing the processor speed limit we now face with silicon. Another key appeal of epitaxial graphene is that it’s compatible with conventional microelectronics manufacturing processes, meaning that it can be readily used with existing computing devices. So, while it’s possible that something else will come along, the Georgia Tech discovery is currently the only two-dimensional semiconductor which has all the necessary properties to be used in nanoelectronics, and its electrical properties are far superior to any other 2D semiconductors currently in development. Consequently, it’s not unreasonable to think that epitaxial graphene could cause a paradigm shift in the field of electronics, allowing for completely new technologies that take advantage of its unique properties. For example, the material allows the quantum mechanical wave properties of electrons to be utilized, which is a requirement for quantum computing. Importantly, semiconducting epi-graphene and wafer-scale computing have significant synergies. When heated, carbon atoms are transported from the carbon surface to the silicon surface to form a buffer layer chemically bonded to the silicon carbide. So, it should be possible to produce wafer-scale single-crystal semiconducting epi-graphene (or SEG). Already, the lead researchers observe there are no major hurdles to producing semiconducting epi graphene on the wafer scale. What’s the bottom line? Even as the industry comes to the end of the silicon road mapped by Moore’s law, new breakthroughs are showing us a way forward. Two of the most exciting are wafer-scale silicon processors and semiconducting epi-graphene. These potential game-changers have been in development for 20 years or more, but only emerged as real-world solutions in 2024. The question is when and to what degree they will reignite the Digital Techno-Economic Revolution. Rather than being overwhelmed by uncertainty, it’s time for technologist, investors, consumers and poli cy makers to realize that these and other opportunities are all around us. We can either choose to seize them or fall victim to despair. Given this trend, we offer the following forecasts for your consideration. First, within the next five years, wafer-scale technology will begin to play a major role in AI inference applications. With the primary hurdles breached, this solution will become increasingly attractive to cloud computing vendors like AWS. Advantages in terms of capital investment, speed, and energy savings will win the day. Nvidea and other GPU vendors will be forced to jump on the wafer-scale bandwagon or be left behind. Second, as we move into the 2030s, semiconducting epi-graphene (or SEG) will increasingly supplant silicon in electronic devices. After 60+ years, experts widely concede that silicon is reaching the end of its useful life at the cutting edge of computing. By combining terahertz frequencies with low energy consumption and compatibility with existing manufacturing processes, epitaxial graphene on silicon carbide offers an ideal solution for chip vendors as well as users. Unless unanticipated technical problems arise, this will become the technology of choice in the years ahead. Third, by the late 2030s, wafer-scale processors based on SEG will dominate data centers. If the Digital Techno-Economic Revolution is going to reach its full potential, computing will need to move beyond the constraints imposed by silicon and inter-chip communications. Combining the two breakthroughs discussed here puts us on the path to doing just that. |
|
Comments
|
|
|
|
|
|
|