Is NVLink still useful for deep learning setups?

Question

I’m trying to plan a new multi‑GPU deep learning workstation and I keep seeing mixed takes on whether NVLink still matters in 2026. A few years ago it sounded like the “must-have” feature for multi‑GPU training, but now people talk more about faster PCIe, better software sharding, and just using distributed training without relying on GPU-to-GPU memory pooling.

My use case is mostly PyTorch training and some inference experimentation with larger models (think hitting VRAM limits and wanting to avoid aggressive gradient checkpointing). I’m debating between a setup with two GPUs that can be bridged vs. two similar GPUs without NVLink, and I’m not sure what I’d actually gain in practice. Specifically: does NVLink meaningfully help with (1) all-reduce / gradient sync speed compared to PCIe, and (2) model parallelism where tensors have to move between GPUs a lot?

I’m also confused about whether “combined VRAM” is still a realistic expectation with current frameworks, or if that’s basically not a thing unless you’re using very specific setups.

For someone building a 2‑GPU deep learning rig today, when is NVLink still genuinely useful, and when is it not worth optimizing the build around?

dpzqyuvzvj · Accepted Answer

TL;DR: NVLink’s still nice when you’re doing lots of cross‑GPU tensor traffic (model/pipeline parallel), but for plain data-parallel PyTorch it’s usually “meh” vs good PCIe + sane batch sizing. And “combined VRAM” is mostly not a thing unless you’re doing explicit sharding.

For your situation, I’ve done two‑GPU training both ways and honestly the biggest win I saw from NVLink was when I was forced into model parallel stuff (activations flying back and forth). All‑reduce/grad sync? It helped a bit, but not like night-and-day unless my step time was already dominated by comms. Also, I learned the hard way that you dont magically get one big VRAM pool… PyTorch still treats them as separate, so you’re basically in FSDP/ZeRO/tensor-parallel land if you wanna “use both.” If NVLink costs real money/limits choices, I’d be careful. gl!

iqtdznwwkp · Answer

For your situation, NVLink is still useful, but only in the “lots of GPU-to-GPU traffic” cases. For plain 2‑GPU PyTorch data-parallel, I honestly didn’t feel a night-and-day difference vs a solid PCIe setup… it was more like “nice to have” than “must-have”.

What I learned messing with a 2‑GPU rig (some runs bridged, some not):
- **(1) All-reduce / grad sync:** If your step time is dominated by compute, NVLink barely moves the needle. When I cranked batch size down (so comms mattered more), NVLink helped a bit, but it wasn’t magically doubling throughput. You still hit other bottlenecks (CPU, dataloader, kernels, etc).
- **(2) Model/pipeline parallel:** This is where NVLink felt REAL. If you’re doing tensor/pipeline parallel or sharding that constantly moves activations/weights across GPUs, lower latency + higher bandwidth actually shows up. Otherwise… meh.
- **“Combined VRAM”:** yeah so, it’s not like you get one big pool automatically. In PyTorch, it only happens if you explicitly shard (FSDP/ZeRO-ish approaches) or do model parallel. If your model isn’t written/sharded for it, you’ll still OOM on the “biggest layer” GPU.

So I’d say: optimize around NVLink only if you *know* you’ll be doing heavy cross‑GPU tensor movement to dodge checkpointing/VRAM limits. Otherwise, I wouldnt contort the whole build around it. gl!

xhixunuhns · Answer

Hmm, I’ve had a different experience than the “NVLink is mostly meh” takes… IMO if you’re *already* shopping in the tier where NVLink exists, it can be worth paying for, just not for the reasons people think.

For (1) all-reduce: on 2 GPUs I’d say NVLink is only a big deal if you’re doing fat comms every step (big model, big batch, fp32-ish, lots of gradients). If you’re more typical PyTorch DDP, decent PCIe 4/5 is often “good enough” and you’ll hit dataloader/optimizer/memory bottlenecks first. But when I did long runs where utilization kept dipping during sync, faster GPU↔GPU did actually smooth things out.

For (2) model parallel: this is where NVLink is legit. If you’re doing tensor/pipeline parallel or sharded attention blocks and you’re constantly moving activations, NVLink helps *a lot* vs PCIe. It’s not magic, but it hurts less.

The “combined VRAM” thing tho… nah, dont plan on that. You still need explicit sharding (FSDP/DeepSpeed/TP), or you’ll OOM like normal.

Cost tip: if NVLink adds like $500+ overall, I’d rather just buy the bigger-VRAM cards (ex: 2x NVIDIA GeForce RTX 4090 24GB-class) and skip the bridge drama. If price delta is small, take NVLink and call it insurance. idk but yeah, thats how I’d budget it… gl!