All posts
GPULLMVRAMInferenceBenchmarkingHardware

Which GPU Runs Which LLM? The Complete Hardware Guide

From a 3B model on a laptop to a 405B model on a GPU cluster — how to read VRAM, use the llmfit tool to check what fits, benchmark honestly, pick the right card for your use case, and decide whether to rent or buy. With interactive calculators you can drive yourself.

AS
AgentSwarms Authors
May 30, 2026· 22 min read
GPULLMVRAMInference

Everyone asks the same question the moment they try to run a model locally: will this LLM actually run on my GPU? It feels like it should be simple. It is — once you understand the single number that governs almost everything, learn to estimate it in your head, and know the one free tool that does the arithmetic for you. This is the complete, hands-on guide to matching large language models to the hardware that runs them.

We'll go from a 3-billion-parameter model on a laptop all the way to a 405-billion-parameter model on a rack of datacenter GPUs. Along the way you'll drive interactive calculators, see exactly which card runs which model, learn to benchmark without fooling yourself, understand why a $1,600 desktop card and a $28,000 datacenter card aren't really competing, and work out whether you should rent or buy. Let's start with the number that decides everything.

The one number that decides everything: VRAM

A GPU has its own dedicated memory — VRAM (video RAM). To generate text, the model's weights, its working memory for the conversation (the KV cache), and some runtime overhead must all fit in VRAM at the same time. If they don't, one of two things happens: the model refuses to load, or your framework spills the overflow into ordinary system RAM and inference slows to a crawl — often 10–50× slower. VRAM isn't one factor among many. It's the gate. Everything else is a tiebreaker.

The mental model

Compute (how fast the GPU does math) sets your speed. VRAM (how much fits) sets whether you can run the model at all. Beginners obsess over speed; the first real question is always capacity.

So before anything else, learn to estimate how much VRAM a model needs. Drag the sliders below — change the model size, the context length, and the quantization (we'll explain that next) and watch the requirement light up the GPUs that can hold it.

VRAM needed6.1 GB
weights
overhead
weights 4.0 GBKV 0.7 GBoverhead 1.4 GB
12GB
16GB
24GB
32GB
48GB
80GB
141GB
192GB

Total ≈ weights + KV cache (grows with context) + ~15% runtime overhead. This is the exact sum a tool like llmfit computes for you — drag the sliders and watch a 70B model leave a single 24GB card behind.

Interactive VRAM estimator. VRAM ≈ weights + KV cache + ~15% overhead. Weights scale with model size and precision; the KV cache grows with context length. Watch a 70B model fall off a 24GB card the moment you raise the context window.

The back-of-the-envelope formula

You can do the core estimate in your head. Each parameter takes a fixed number of bytes depending on precision: 2 bytes at FP16, 1 byte at INT8, half a byte at INT4. Multiply by the parameter count, then add roughly 20% for the KV cache and runtime overhead:

VRAM (GB) ≈ params(billions) × bytes_per_param × 1.2

bytes_per_param:  FP16 = 2   INT8 = 1   INT4 = 0.5

Examples (rule of thumb):
  7B  @ FP16 ≈ 7  × 2   × 1.2 ≈ 17 GB
  7B  @ INT4 ≈ 7  × 0.5 × 1.2 ≈  4 GB
  70B @ FP16 ≈ 70 × 2   × 1.2 ≈ 168 GB
  70B @ INT4 ≈ 70 × 0.5 × 1.2 ≈ 42 GB
The 2× shortcut

For a quick gut-check: an FP16 model needs about 2 GB of VRAM per billion parameters, and an INT4 model needs about 0.5 GB per billion. A 13B model at INT4 → ~7–8 GB → fits a humble 12GB card. Memorize those two anchors and you can size almost anything instantly.

Two subtleties the rule of thumb hides. First, the KV cache grows with how much text the model is holding in context — a 128K-token conversation can add many gigabytes on top of the weights, which is why long-context use cases need far more headroom than the weights alone suggest. Second, training and fine-tuning need much more than inference: optimizer states and gradients can triple or quadruple the requirement. Everything in this guide is about inference unless we say otherwise.

Quantization: the lever that changes everything

If VRAM is the gate, quantization is the key that opens it. Models are trained in 16-bit precision, but you don't have to run them that way. Quantization stores each weight in fewer bits — 8, 4, even 3 — shrinking the model dramatically with a surprisingly small hit to quality. It is the single most important technique for running big models on small cards.

1
bytes / param
7.0GB
a 7B model
~97%
quality kept

Half the memory, ~1–2% quality drop. The safe default.

Climb down the precision ladder. Each step roughly halves the memory. INT4 (the popular Q4_K_M format) keeps ~93% of quality while using a quarter of the memory of FP16 — the sweet spot most local setups land on.
  • FP16 / BF16 (2 bytes) — the reference. Matches the published weights exactly. Use when you have the VRAM and want maximum fidelity.
  • INT8 (1 byte) — half the memory, ~1–2% quality drop. A safe, almost-free win.
  • INT4 (0.5 bytes) — a quarter of the memory, ~5–7% quality drop. This is what makes a 70B model run on a single 48GB card. The community default (Q4_K_M) for local inference.
  • INT3 / INT2 — squeeze-territory. Real degradation; reserve for when nothing else fits.
GGUF, AWQ, GPTQ — what are these?

These are quantization formats you'll see on Hugging Face. GGUF is the format llama.cpp/Ollama use (great for CPU + GPU, the Q4_K_M naming). AWQ and GPTQ are GPU-optimized 4-bit formats common with vLLM. Same idea — fewer bits — different packaging for different runtimes.

Which GPU runs which model?

Now the headline question. Combine the VRAM math with quantization and you get a clear map of what runs where. Toggle the precision in the matrix below: green means it fits on a single card, amber means you need to split it across two-to-four GPUs, and red means it takes a whole server.

GPU \ Model3B8B14B27B32B70B405B
RTX 409024GB×2×11
RTX 509032GB×2×8
L40S48GB×6
A100 80GB80GB×4
H200141GB×2
B200192GB×2
fits on one GPU×2–4 multi-GPU×8+ a whole server

Switch the precision and watch the board light up. INT4 is the great equalizer — it pulls a 70B model onto a single 48GB card, while at FP16 the same model needs four.

The fit matrix. Flip between FP16, INT8, and INT4 and watch the board change. At INT4 a single 48GB L40S swallows a 70B model; at FP16 that same model needs four GPUs. Quantization is the great equalizer.

A rough tiering you can memorize

  • 8–12 GB (RTX 3060, 4070, laptop GPUs) — 3B–8B models at INT4. Great for learning, chat, and code assistants.
  • 16–24 GB (RTX 4080, 4090, A10, L4) — up to ~14B comfortably, 32B at INT4 with a short context. The hobbyist and small-prod sweet spot.
  • 32–48 GB (RTX 5090, L40S) — 32B at good precision, 70B at INT4. Serious single-card territory.
  • 80 GB (A100, H100) — 70B at FP16-ish quality, or high-throughput serving of smaller models with big batches.
  • 141–192 GB (H200, B200) and multi-GPU — 100B+ models, MoE giants like Mixtral and DeepSeek, and 405B with several cards linked by NVLink.
Mixture-of-Experts (MoE) is sneaky

Models like Mixtral 8x7B or DeepSeek only activate a fraction of their parameters per token, so they're fast — but all the experts must still sit in VRAM. A '47B active' MoE can still demand the memory of a 90B+ dense model. Size by total parameters, not active ones.

The fastest way to check: the llmfit tool

You don't have to do this arithmetic by hand. llmfit is a free, open-source command-line tool that detects your hardware (RAM, CPU, and GPU VRAM across NVIDIA, AMD, Apple Silicon, and Intel Arc) and tells you exactly which models will run well — and at what quantization. It's the single fastest way to answer 'can my machine run this?'

# Install (pick one)
brew install llmfit                              # macOS / Linux
curl -fsSL https://llmfit.axjns.dev/install.sh | sh
uv tool install -U llmfit                         # via Python/uv
scoop install llmfit                              # Windows

# See what llmfit detected about your machine
llmfit system

# Rank models that fit your hardware perfectly
llmfit fit --perfect -n 5

# Get coding-focused recommendations as JSON
llmfit recommend --json --use-case coding --limit 3

# Plan the requirements for a specific model + context
llmfit plan "Qwen/Qwen3-4B" --context 8192

# Pretend you have a different GPU before you buy it
llmfit --memory=24G --ram=64G fit --perfect -n 5

The clever part is dynamic quantization: instead of a yes/no answer, llmfit walks down the precision ladder (Q8_0 → Q6_K → Q5_K → Q4_K_M → Q3_K_M → Q2_K) and reports the highest-quality quant that still fits your memory. It scores every model 0–100 across four dimensions — quality, speed, fit, and context — and weights them by your use case (chat favors speed, reasoning favors quality). The interactive terminal below shows the same llmfit fit command run on three different machines.

llmfit
$ llmfit fit --perfect -n 3
→ detected: 24GB VRAM · RTX 4090
1.Qwen2.5 14BQ5_Kscore 94
2.Gemma 2 9BQ8_0score 91
3.Llama 3.1 8BQ8_0score 90

One command, hardware-aware. llmfit detects your RAM/VRAM, then ranks models by the best quantization that actually fits — switch the machine and the shortlist changes with it.

The same one-liner, three machines. Switch between a 16GB laptop, a 24GB workstation, and an 80GB server and watch llmfit re-rank the shortlist — always picking the best quantization that actually fits.
Use the --memory flag before you spend money

The single most useful trick: llmfit --memory=48G fit --perfect simulates a GPU you don't own yet. Plan your purchase against the exact models you want to run, instead of guessing from a spec sheet.

llmfit is the CLI workhorse, but a few browser-based calculators are worth bookmarking too — they're handy for sharing a link or sizing a model you can't download yet:

  • LLMfit.io — web VRAM and generation-speed estimator for Llama, Mistral, Qwen, and DeepSeek.
  • NyxKrage's LLM Model VRAM Calculator (Hugging Face Space) — paste a HF model name, pick the quant and context, get the number.
  • APXML 'Can You Run This LLM?' — covers NVIDIA and Apple Silicon side by side.
  • gpu_poor (GitHub) — estimates memory and breaks down weights vs KV cache vs activations for both training and inference.

Choosing the right GPU for your use case

'Which GPU should I buy?' has no single answer — it depends entirely on what you're doing. Running a quantized model for fun has wildly different requirements than serving thousands of users or fine-tuning on your own data. Pick your use case below and see where it points.

Recommended GPU
RTX 4060 Ti 16GB or 4090 (24GB)

Consumer card, INT4 quants, zero cloud bill. 24GB comfortably runs a 14B model.

Start from the job, not the GPU. Learning locally, serving a product, fine-tuning with LoRA, and full training each have a different right answer — and the gap between them is enormous.
  1. 1Define the workload — inference or training? One user or many concurrent? How big is the model you actually need (often smaller than you think)?
  2. 2Set the context length — long documents and agent histories balloon the KV cache. Budget VRAM for your worst-case context, not your average.
  3. 3Pick a precision — can you accept INT4? That alone may drop you from a datacenter card to a consumer one.
  4. 4Add headroom — leave 15–20% of VRAM free for the cache and runtime, or you'll hit out-of-memory errors under load.
  5. 5Then choose the card — the smallest, cheapest GPU that clears all of the above with headroom wins.

Desktop GPUs vs datacenter GPUs

Here's a question that confuses almost everyone: if an RTX 4090 has 24GB of fast memory for $1,600, why does an H100 with 80GB cost $28,000? Surely that's a 17× markup for ~3× the memory? The answer is that they aren't built for the same job — and a lot of what you pay for on a datacenter card is invisible on a spec sheet.

🖥️ RTX 4090
desktop flagship
🏢 H100
datacenter
VRAM
24 GB GDDR6X
80 GB HBM3
Memory bandwidth
~1.0 TB/s
~3.35 TB/s
Price
~$1,600
~$28,000
Multi-GPU link
PCIe only
NVLink 900 GB/s
ECC memory
No
Yes
Partitioning (MIG)
No
Up to 7 instances
FP8 / Transformer Engine
Limited
Yes
Datacenter license
Prohibited by EULA
Licensed
Duty cycle
Desktop / bursty
24/7 sustained

On paper a 4090 looks like a steal. The H100's premium is NVLink for splitting huge models, ECC for correctness, MIG for sharing one card across tenants, FP8 for speed — and a license that lets you legally rack it 24/7. Different jobs, not just different price tags.

Press the button to reveal what the datacenter premium actually buys. NVLink for splitting huge models, ECC memory for correctness over long runs, MIG for sharing one card across tenants, FP8 for speed — and crucially, a license that permits 24/7 datacenter operation.
  • Memory & bandwidth — datacenter cards use HBM (high-bandwidth memory): an H100 moves ~3.35 TB/s vs a 4090's ~1 TB/s. Bandwidth, not raw compute, is usually what caps token generation speed.
  • NVLink — datacenter GPUs link directly at 900 GB/s so a model can be split across many cards as if they were one. Consumer cards are stuck talking over slower PCIe.
  • ECC memory — error-correcting memory catches bit-flips that would silently corrupt a multi-day training run. Consumer cards skip it.
  • MIG (Multi-Instance GPU) — one A100/H100 can be carved into up to 7 isolated GPUs to serve many tenants. Consumer cards can't.
  • Licensing & duty cycle — NVIDIA's driver EULA restricts consumer GeForce cards in datacenters, and consumer cards aren't designed to run pinned at 100% 24/7. Datacenter cards are.
The practical takeaway

Building locally, learning, or serving modest traffic? A desktop RTX card is a phenomenal deal — don't pay the datacenter tax you don't need. Running a 24/7 service, splitting a 100B+ model, or multi-tenant serving? That's exactly what the expensive cards are for.

Benchmarking: measure, don't guess

Once a model fits, the next question is how fast. But 'fast' isn't one number, and the headline 'tokens per second' on a vendor slide is almost always measured at a batch size that makes the marketing look good. To benchmark honestly you need to know which metrics matter and how they trade off.

  • TTFT (time to first token) — how long until the user sees anything. Dominated by the prefill of the prompt; this is what makes a chatbot feel responsive.
  • ITL / TPOT (inter-token latency) — the delay between each streamed token. Drives how fast the text appears to type.
  • Throughput (total tokens/sec) — how many tokens the server produces across all users at once. This sets your cost per token.
  • p95 / p99 latency — the slow tail. Averages hide the users who waited 4 seconds; the tail is what they remember.

The crucial insight is the throughput-vs-latency tradeoff. Batching more requests together raises total throughput (and lowers your cost per token) but makes each individual user wait longer. Benchmarking is the art of finding the batch size where both are acceptable. Drag the slider and watch the two pull against each other:

218
total tok/s
27
tok/s per user
192ms
TTFT
Throughput (serving efficiency)
Per-user latency cost

Bigger batches raise total throughput (lower $/token) but make every individual user wait longer. Benchmarking is finding the batch size where both are acceptable — measure tok/s, TTFT, and the p99 tail, never just one.

Batch size is the master dial of serving. Larger batches push total throughput up and cost/token down, but per-user latency and TTFT climb. There's no free lunch — only a sweet spot for your workload.

Tools that benchmark honestly

  • vLLM's `benchmark_serving.py` — the de facto standard for measuring real serving throughput and latency under concurrent load.
  • NVIDIA GenAI-Perf — vendor tool for TTFT, ITL, and throughput across batch sizes and concurrency levels.
  • llmfit bench — quick inference benchmarks plus community-contributed numbers right in the CLI.
  • llmperf / LLMPerf (Ray) — load-tests an API endpoint the way real traffic would.
  • MLPerf Inference — the industry's standardized, audited benchmark for comparing hardware apples-to-apples.
# Benchmark real serving throughput + latency with vLLM
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct &

python benchmarks/benchmark_serving.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --request-rate 10 \
  --num-prompts 500
# → reports TTFT, TPOT, and throughput at your target load
Benchmark your workload, not theirs

A number measured with 2,000-token prompts and batch size 256 tells you nothing about your 200-token chat at batch size 4. Always benchmark with prompt lengths, output lengths, and concurrency that match how you'll actually use the model.

Cost, availability, and buy vs rent

GPUs are expensive and, for the top end, genuinely scarce. Understanding both the sticker price and the rental economics is what separates a sustainable setup from a surprise invoice. Here's the rough 2026 landscape, then an interactive way to find your own breakeven.

  • RTX 4090 (24GB) — ~$1,600 to buy; ~$0.30–0.70/hr in the cloud. The price-performance king for local work.
  • RTX 5090 (32GB) — ~$2,000 to buy; limited cloud availability as of mid-2026.
  • L40S (48GB) — ~$1/hr cloud; a popular inference-serving workhorse.
  • A100 80GB — ~$15k to buy; ~$1.07–3.40/hr cloud. The mature, widely-available datacenter standard.
  • H100 80GB — ~$28k to buy; ~$2.00–3.90/hr on specialist clouds, but $8–12/hr on hyperscalers like AWS/GCP/Azure.
  • H200 (141GB) / B200 (192GB) — the frontier; ~$2–4/hr where available, but supply is tight and often reserved.
Availability is a real constraint

Specialist clouds (RunPod, Lambda, Vast.ai, Spheron) are typically 60–85% cheaper than the big three hyperscalers for the same card. Spot/preemptible instances cut another 50–80% but can be reclaimed at any moment — perfect for fault-tolerant batch jobs, risky for a live service.

The big decision is buy vs rent, and it comes down to utilization. A GPU you use a few hours a week should almost always be rented. A GPU pinned near 24/7 usually justifies buying — or a long-term cloud reservation. The crossover point is the whole game. Pick a card and slide your monthly usage to find it:

☁️ Rent (cloud)
$400/mo
$2.5/hr × 160 hrs
🛒 Own (amortized)
$898/mo
$28,000 over 36 mo + power
Breakeven at ~359 hrs/month. Below that, renting wins; above it, buying pays off. At 160 hrs, rent is cheaper.

Sporadic experiments? Rent by the second. A GPU pegged near 24/7 (730 hrs/mo)? Owning — or a long-term reservation — usually wins. The crossover is the whole decision.

Buy-vs-rent breakeven. Cloud cost scales linearly with hours used; ownership is a fixed monthly amortization. Below the breakeven hours, renting wins; above it, owning does. For bursty experimentation, the cloud is almost always cheaper.

A worked example, end to end

Let's make it concrete. Say you're building a customer-support assistant. You've decided a 14B model gives you the quality you need, you'll serve it to a handful of concurrent users, and your prompts plus retrieved context run to about 8K tokens. What hardware do you need?

  1. 1Size the model. 14B at INT4 ≈ 14 × 0.5 × 1.2 ≈ 8.4 GB of weights. Add ~2 GB for an 8K KV cache and overhead → ~11 GB total.
  2. 2Check the fit. That clears a 16GB card with headroom — but for concurrent users you want room for a bigger batch and a larger KV cache, so step up to a 24GB card.
  3. 3Confirm with llmfit. llmfit --memory=24G plan "Qwen/Qwen2.5-14B" --context 8192 confirms it fits at a high-quality quant (Q5_K or Q8_0).
  4. 4Decide buy vs rent. Pre-launch with spiky traffic? Rent a 4090 or L4 at ~$0.30–0.70/hr. Steady 24/7 production? An owned 4090, or a reserved L40S for ECC + datacenter licensing.
  5. 5Benchmark before launch. Run benchmark_serving.py at your real prompt sizes and target request rate, then tune batch size for an acceptable TTFT.
The result

A 14B support assistant runs happily on a single 24GB GPU you can rent for under a dollar an hour — no datacenter card required. Sizing it correctly saved you from over-buying an 80GB H100 by 30×.

The cheat sheet

  • VRAM is the gate. It decides if a model runs; compute decides how fast.
  • ~2 GB per billion params at FP16, ~0.5 GB at INT4. Memorize this and you can size anything.
  • Quantization is the key. INT4 keeps ~93% of quality at a quarter of the memory — it's how big models reach small cards.
  • Context inflates the KV cache. Long conversations and documents need real extra headroom.
  • Use llmfit. llmfit fit --perfect answers 'what runs on my machine?' in one command — and --memory simulates hardware before you buy.
  • Match the GPU to the job. Desktop cards are unbeatable value for local work; datacenter cards earn their price on NVLink, ECC, MIG, and 24/7 licensing.
  • Benchmark your own workload. TTFT, throughput, and the p99 tail — at your real prompt sizes and concurrency.
  • Rent for bursty, buy for sustained. Find the breakeven hours and let utilization decide.

The mystique around 'do I need a fancy GPU for AI?' evaporates once you can do the memory math. A model is just weights that have to fit, plus a cache that grows with context. Estimate the number, check it with llmfit, pick the smallest card that clears it with headroom, and benchmark before you trust it. That's the whole craft — and now it's yours.

The right GPU isn't the biggest one you can afford. It's the smallest one that runs your model with headroom to spare.

Want to put a model you've sized to work? Spin up a swarm in the AgentSwarms canvas and wire it to your chosen model — local or hosted — then watch the live cost-and-token meter confirm your math in real time.


Was this useful?

Comments

Sign in to join the discussion.

Loading comments…