I keep seeing people say “you need Tensor Cores for AI,” but I’m not sure how true that is for my use case. I’m mainly doing PyTorch stuff like training small CNNs and fine-tuning a Transformer (FP16/bfloat16 when possible), and I’m comparing GPUs where one has more raw CUDA cores/VRAM but fewer or older Tensor Cores. I care about both training speed and inference latency, but I’m also worried about bottlenecks like memory bandwidth and VRAM limits. In practice, how much do Tensor Cores matter for AI workloads, and when are they basically a must-have vs. a nice-to-have?
For your situation, Tensor Cores matter a LOT… but not always. I’m not 100% sure on every workload, but in my PyTorch messing-around:
- If you’re doing FP16/bfloat16 and your model is matmul-heavy (Transformers), newer Tensor Cores can be a huge speed bump.
- If you’re VRAM-limited or bandwidth-limited, more VRAM can beat “better” Tensor Cores, honestly.
- For inference latency, smaller batch = sometimes less Tensor Core benefit (kernel launch/memory overheads).
So yeah: must-have for modern mixed-precision training, nice-to-have if you’re mostly bottlenecked by VRAM/memory. gl!
Great info, saved!
In my experience, they matter… but only *after* you clear the boring bottlenecks.
- If you’re training/fine-tuning Transformers in FP16/BF16, the matrix-multiply path is basically the whole game. Newer accel blocks = big throughput wins (sometimes 2–4x vs “just more shader cores”), **as long as** your shapes line up and you’re not starving the GPU.
- If you’re VRAM-limited (common with fine-tuning), more memory can beat faster math. Like, an extra 8–12GB can let you bump batch size, sequence length, or avoid offloading/checkpointing pain. A slower GPU that fits the model often finishes sooner than a faster one that’s constantly paging.
- If you’re bandwidth/latency-bound (small CNNs, tiny batch inference), you might not see much. Kernel launch overhead, memory traffic, and non-matmul ops dominate. Lowkey, “more CUDA cores” won’t save you there either.
- Practical rule I use:
- **Must-have**: FP16/BF16 Transformer training, big GEMMs, attention-heavy workloads, or you care about tokens/sec.
- **Nice-to-have**: small models, batch=1 inference, lots of augmentation/data-loader time, or anything constrained by VRAM.
- Actionable: turn on AMP (torch.cuda.amp / autocast), use `torch.backends.cuda.matmul.allow_tf32 = True` (for FP32-ish training), and profile with `torch.profiler` to see if you’re compute- vs memory-bound.
If you tell me the two exact GPUs + VRAM and your seq length/batch, I can give a “pick A vs B” call pretty confidently. gl!
Nice, didn't know that
To sum up the discussion so far: it’s basically a balancing act between math throughput and memory capacity where VRAM usually wins if you’re tight on budget. Adding a market perspective here, you’re often choosing between the 'NVIDIA tax' and raw hardware value. For example, a NVIDIA GeForce RTX 4070 Ti Super is fantastic for its 4th gen Tensor Cores and FP8 support—which is becoming the industry standard for efficient fine-tuning—but that 16GB VRAM can still be a major bottleneck for larger Transformers. Compare that to something like the AMD Radeon RX 7900 XT which gives you 20GB of VRAM for less money; while their 'AI Accelerators' are getting better, the ROCm software stack still isn’t as seamless as CUDA for most PyTorch workflows. Honestly, if you want the best cost-to-performance ratio right now, a used NVIDIA GeForce RTX 3090 is still the king for AI hobbyists. You get that 24GB buffer which is a must-have for decent batch sizes, and the Tensor performance is more than enough for mixed-precision training. Tbh, unless you’re doing massive scale inference where FP8 throughput is life-or-death, prioritize the VRAM (at least 20GB+) over the newest architectural bells and whistles.
Nice, didn't know that
🙌
To give you a more accurate steer, could you clarify which specific GPU models youre deciding between and what your motherboard/CPU setup is? I'm particularly interested in whether youre limited by your power supply or physical space in the chassis, as those factors often dictate which architecture you can actually run reliably at high load. Technical compatibility is often the silent killer for these builds. Here is what I would keep in mind: