Memory godboxes could offer relief from the RAMpocalypse

In modern datacenters, storage can live anywhere — local to the machine, remotely accessed over the network, and/or shared between systems.

The next generation of servers will treat system memory in much the same way. Systems will still have some local DDR5, but the bulk of it will be remotely accessed from what some have taken to calling the memory godbox.

The ongoing DRAM shortage has created a perfect storm for the proliferation of the appliances, which not only allow for memory to be pooled, but also data stored in that memory to be shared by multiple machines simultaneously. In effect, memory becomes a fungible resource.

More importantly, your next round of servers will probably support the tech, if they don’t already.

CXL finally has its moment to shine

The technology at the heart of these memory godboxes isn’t new. Compute Express Link (CXL) has been slowly gaining traction since its introduction seven years ago.

As a quick refresher, CXL defines a common, cache-coherent interface for connecting CPUs, memory, accelerators, and other peripherals.

The technology comes in a couple of different flavors: CXL.mem, CXL.cache, and CXL.io, which, as a whole, have implications for disaggregated compute. Imagine a rack with a CPU node, GPU node, memory node, and storage node, which can talk to one another completely independently. That’s the core idea behind CXL.

CXL piggybacks off the PCIe standard, which means in theory it should be broadly compatible, but, up to this point, it’s primarily been used with memory devices.

The 1.0 spec opened the door to memory expansion modules, which allow you to add more memory by slotting them into a CXL-compatible PCIe slot. To the operating system — assuming you’re running Linux that is — the extra memory is largely transparent, showing up as if it were attached to another CPU socket, just one without any additional compute.

The 2.0 spec, which showed up in 2020, added basic support for switching, which meant memory could be pooled and then allocated to any number of connected systems.

AMD and Intel’s current crop of Epycs and Xeons already support these appliances. But while the memory can be partitioned and reallocated to different machines as needed, two machines can’t work on the same data simultaneously.

Unless you were memory-constrained, the added complexity of CXL 2.0 didn’t offer much benefit over simply using higher capacity DIMMs in the first place. 

At least, not until memory prices went through the roof.

Where things really get interesting is when the 3.0 spec arrives in AMD and Intel’s next-generation of Epycs and Xeons. In fact, from what we understand, Amazon’s Graviton5 CPUs we looked at in December already support the spec.

CXL 3.0 introduces two key capabilities that make it particularly interesting for memory appliances. The first is support for larger topologies: Multiple CXL switches can be stitched together into a fabric. The second is support for memory sharing: Rather than partitioning memory into slices only accessible to one machine at a time, memory can be shared between machines.

In theory this could allow two machines running the same set of workloads to use the memory closer to that of one. It’s a bit like deduplication for memory. In fact, we already do this in virtualized environments like KVM, but it now works across machines.

There are security and performance implications to all of this. Thankfully in CXL 3.1 and later, the consortium introduced confidential computing capabilities into the spec, allowing for isolation where necessary.

On the performance end of things, CXL 3.0 moves to PCIe 6.0 as a baseline, which provides 16 GB/s of bidirectional bandwidth per lane. Assuming 64 lanes of CXL per CPU, that works out to an additional 512 GB/s of bandwidth. So memory bandwidth shouldn’t be too much of an issue for most applications. Latency, on the other hand, is a different story. 

CXL-attached memory is going to add some latency. However, as we’ve previously discussed, the latency isn’t as bad as you’re probably thinking — on the order of a NUMA hop, or about 170 to 250 nanoseconds of round trip latency. Obviously, the farther the memory appliance is from the host CPU, the worse the latency is going to be.

Late last year, the CXL consortium ratified the 4.0 spec, which among other things doubles the bandwidth from 16 GB/s per lane to 32 GB/s by re-basing on PCIe 7.0. However, it’ll be a while before we see appliances based on the spec.

Where’s my memory godbox?

There are several companies developing hardware for these kinds of networked memory appliances. 

Panmnesia’s CXL 3.2-compatible PanSwitch is one of the most sophisticated examples. The switch features 256 lanes of connectivity for CXL memory modules, devices, or CPUs to connect, pool, or share resources.

If you’re okay with memory pooling and don’t need the niceties of CXL 3.0, then there are already several memory appliances available that are compatible with the latest generation of Xeon 6 and Epyc Turin processors.

Liqid’s composable memory platform, for example, can provide a pool of up to 100 TB of DDR5 to as many as 32 hosts. Meanwhile, UnifabriX Max systems provide CXL 1.1 or 2.0 connectivity to 16 or more systems with support for CXL 3.2 already in the works.

We suspect that as more CXL 3.0 compatible CPUs and GPUs hit the market, more of these memory godboxes will appear.

AI eats everything

Don’t get too excited. While network attached memory has the potential to reduce an enterprise’s infrastructure spend, those same qualities make it attractive for the very thing driving the memory shortage in the first place.

AI adoption has driven demand for DRAM off the charts. In addition to the HBM used by GPUs, DDR5 is being used for key value cache offload during inference.

These KV caches store model state and can chew significant amounts of memory — often more than the model itself — in multi-tenant serving scenarios.

Rather than discard these caches and recompile them when the model state is restored, it’s more efficient to offload them to system memory and eventually flash storage.

The problem with using flash storage is that it has a finite write endurance. After a while it wears out. Instead, CXL memory vendors are positioning the tech as a more resilient alternative.

That’s bad news for enterprises looking to these memory godboxes for salvation from the RAMpocalypse. ®

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *