How LLMs are Changing Computing

We can and likely will scale the number of parameters infinitely.

May 10, 2023

LLMs (large language models) like ChatGPT are emerging. As you add more parameters to these models, new features emerge that were not explicitly coded for. From GPT-1 to GPT-3, we have gone from an interesting chatter box to something that begins to exhibit some form of generalized intelligence.

Thus, as we add more parameters to these models, we know that all sorts of useful features will emerge. We also know how to add more parameters, such that a model with 100 trillion parameters or more is now conceivable and will likely be a result of continuing down the path we are on. The industry now has its hands full, with a clear execution roadmap.

Step by Step into GPT. GPT stands for Generative Pre-Training… | by Yan Xu | Medium — source

Over the next decade, we will likely see the number of parameters in these models increase exponentially. They will become a ubiquitous computing platform, commoditizing intelligence across the board. As it refers to the resulting computing requirements, two things stand out:

These models are large and therefore, will require an ever growing amount of memory to store.
As the models get larger, for them to be truly useful, we need to be exponentially decreasing latency. No one will want slow models.

Today´s GPUs do not have the architecture to accommodate the ever growing number of parameters, because they cannot scale memory arbitrarily. Without enough memory, fitting the exponentially growing number of parameters in successive models becomes problematic. You can break a model down into bits, putting each bit into a different GPU and then stitching together the outputs. However, this approach scales poorly and adds tremendous complexity.

What we need is an architecture that dis-aggregates memory from compute, so that both can be incremented arbitrarily and on the same chip, so as to minimize latency. This is known as Wafer Scale Computing, which refers to the concept of creating a single, gigantic computer chip that spans an entire silicon wafer, rather than cutting that wafer into many smaller individual chips as is traditionally done in semiconductor manufacturing.

If both the memory and the computing engine are on the same chip, this helps minimize the amount of time it takes for the output of the LLM to be computed and displayed.

Cerebras Systems, a California-based technology company, is one of the pioneers in wafer-scale computing. They have developed what they call the Cerebras Wafer Scale Engine (WSE), which is a massive chip specifically designed for AI workloads. However, this sort of chip is the opposite to chiplets: if you get one part of the chip wrong, you have to throw the entire thing away, which makes it harder than otherwise to obtain great yields.

Whether Cerebras will be a winner in the future or not is a topic for a deep dive and for now, the company is private. Regardless, the fundamental takeaway from this post is that LLMs will scale exponentially and to accommodate them, we need an architecture that can scale memory at will whilst decreasing latency. Whilst AMD and Nvidia do not do Wafer Scale Computing, they are well positioned to bring about disaggregated architectures with their respective interconnect technologies, Infinity Fabric and NVLink.

Until next time!

⚡ If you enjoyed the post, please feel free to share with friends, drop a like and leave me a comment.

You can also reach me at:

Twitter: @alc2022

LinkedIn: antoniolinaresc

Disclosure

These are opinions only of the individual author. The contents of this piece do not contain investment advice and the information provided is for educational purposes only and no discussions constitute an offer to sell or the solicitation of an offer to buy any securities of any company. All content is purely subjective and you should do your own due diligence.

Antonio Linares makes no representation, warranty or undertaking, express or implied, as to the accuracy, reliability, completeness or reasonableness of the information contained in the piece. Any assumptions, opinions and estimates expressed in the piece constitute judgments of the author as of the date thereof and are subject to change without notice. Any projections contained in the Information are based on a number of assumptions as to market conditions and there can be no guarantee that any projected outcomes will be achieved. Antonio Linares does not accept any liability for any direct, consequential or other loss arising from reliance on the contents of this presentation. Antonio Linares is not acting as your financial, legal, accounting, tax or other adviser or in any fiduciary capacity.

Investment Ideas by Antonio

How LLMs are Changing Computing

We can and likely will scale the number of parameters infinitely.

Disclosure