How LLMs are Changing Computing
We can and likely will scale the number of parameters infinitely.
Edited by Brian Birnbaum.
LLMs (large language models) like ChatGPT are emerging. As you add more parameters to these models, new features emerge that were not explicitly coded for. From GPT-1 to GPT-3, we have gone from an interesting chatter box to something that begins to exhibit some form of generalized intelligence.
Thus, as we add more parameters to these models, we know that all sorts of useful features will emerge. We also know how to add more parameters, such that a model with 100 trillion parameters or more is now conceivable and will likely be a result of continuing down the path we are on. The industry now has its hands full, with a clear execution roadmap.
Over the next decade, we will likely see the number of parameters in these models increase exponentially. They will become a ubiquitous computing platform, commoditizing intelligence across the board. As it refers to the resulting computing requirements, two things stand out:
These models are large and therefore, will require an ever growing amount of memory to store.
As the models get larger, for them to be truly useful, we need to be exponentially decreasing latency. No one will want slow models.
Today´s GPUs do not have the architecture to accommodate the ever growing number of parameters, because they cannot scale memory arbitrarily. Without enough memory, fitting the exponentially growing number of parameters in successive models becomes problematic. You can break a model down into bits, putting each bit into a different GPU and then stitching together the outputs. However, this approach scales poorly and adds tremendous complexity.
What we need is an architecture that dis-aggregates memory from compute, so that both can be incremented arbitrarily and on the same chip, so as to minimize latency. This is known as Wafer Scale Computing, which refers to the concept of creating a single, gigantic computer chip that spans an entire silicon wafer, rather than cutting that wafer into many smaller individual chips as is traditionally done in semiconductor manufacturing.
Cerebras Systems, a California-based technology company, is one of the pioneers in wafer-scale computing. They have developed what they call the Cerebras Wafer Scale Engine (WSE), which is a massive chip specifically designed for AI workloads. However, this sort of chip is the opposite to chiplets: if you get one part of the chip wrong, you have to throw the entire thing away, which makes it harder than otherwise to obtain great yields.
Whether Cerebras will be a winner in the future or not is a topic for a deep dive and for now, the company is private. Regardless, the fundamental takeaway from this post is that LLMs will scale exponentially and to accommodate them, we need an architecture that can scale memory at will whilst decreasing latency. Whilst AMD and Nvidia do not do Wafer Scale Computing, they are well positioned to bring about disaggregated architectures with their respective interconnect technologies, Infinity Fabric and NVLink.
Until next time!
⚡ If you enjoyed the post, please feel free to share with friends, drop a like and leave me a comment.
You can also reach me at:
Twitter: @alc2022
LinkedIn: antoniolinaresc
good piece..there's also a company in the UK (Cambridge) called Graphcore in this same space. just an FYI..