I’m trying to fine-tune a small/medium LLM on a single consumer GPU (RTX 3090, 24GB VRAM) and I’m not sure what’s realistic. I’m looking at LoRA/QLoRA and maybe 4-bit quantization, but I’m worried about VRAM, training speed, and stability. What model sizes and settings are actually doable on one GPU?
+1
Hey, i feel u… figuring out what’s “realistic” on one NVIDIA GeForce RTX 3090 24GB is kinda confusing at first. In my experience (couple years dabbling on/off), you can absolutely fine-tune small/medium LLMs on a single 24GB card, you just gotta pick the right approach.
Option A: full fine-tune (fp16/bf16) → Pros: simplest mental model, sometimes best quality. Cons: VRAM explodes. Honestly, on 24GB this is usually a nope past like ~1–3B params unless you go tiny batch, short seq len, and gradient checkpointing… and even then it’s pain.
Option B: LoRA (8/16-bit base) → Pros: way more stable than you’d think, decent speed, VRAM manageable. Cons: base model still eats memory. I’ve had 7B run pretty comfy with LoRA if you keep seq length sane (like 1k–2k) and use grad accumulation.
Option C: QLoRA (4-bit base + LoRA) → Pros: this is the “3090 sweet spot” imo. 7B is basically easy mode, 13B is doable but slower and you’ll be juggling settings. Cons: can be a little more fiddly (paged optimizers, occasional weird instability if LR too spicy).
Practical settings that work well: gradient checkpointing ON, micro-batch 1–2, grad accumulation to reach effective batch, start LR low (like 1e-4-ish for LoRA), and don’t crank context length unless you really need it. If you tell us what model size (7B vs 13B) + target context length, we can sanity-check ur config. gl!
Commenting to find later
Big if true
This ^
TIL! Thanks for sharing