Skip to content

Qwen3.6 27B on AMD RX 7900 XT — BIOS & LM Studio Optimization Guide

Getting from 3 tok/s to 35 tok/s on a 20 GB AMD GPU requires fixing three bottlenecks in order:

  1. BIOS configuration,
  2. GPU layer offloading
  3. KV cache management.

BIOS Settings

SettingValueNotes
Above 4G DecodingEnabledPrerequisite — must be on before Resize BAR appears
Resizable BAR / SAMEnabledWithout this, model weights silently load into shared system RAM
CSMDisabledBlocks Resize BAR from activating even when "enabled"
PCIe Slot ModePCIe 4.0 x16Some boards default to 3.0, halving bandwidth
EXPO / XMPEnabledRAM at rated speed improves CPU <-> GPU transfer

Verify in GPU-Z: Resizable BAR -> Yes, Bus Interface → PCIe x16 4.0 @ x16 4.0, dedicated VRAM usage ~17–18 GB when model is loaded.


LM Studio Settings

GPU Offload — Set to 64 (all layers)

Leaving even a few layers on CPU forces every token to cross the PCIe bus. Always offload all layers. (~32 GB/s) instead of staying in VRAM (~640 GB/s).

⚠️ The Layer Trade-Off

With a large context window, all 64 layers on GPU can cause very slow prompt processing because the KV cache overflows VRAM and spills into system RAM. The fix is not to move layers back to CPU — that just trades one problem for another. Fix the KV cache instead (see below).

K/V Cache Quantization — The Key Fix

Set both K Cache and V Cache Quantization Type to Q8_0.

This halves KV cache VRAM usage with some quality loss (KL divergence < 0.04 on Qwen3.6). It resolves KV cache overflow, restoring fast prompt processing while keeping all 64 layers on GPU. Avoid Q4_0 — it noticeably degrades tool calling and long-context quality on Qwen models.

Other Settings

SettingValue
Context Length50000 (100K causes KV cache overflow)
Flash AttentionEnabled
Offload KV Cache to GPU MemoryEnabled
Unified KV CacheEnabled
Keep Model in MemoryEnabled
Evaluation Batch Size512

Results

Fix AppliedTok/s
Baseline (no Resize BAR)~3 tok/s
After BIOS fix~7 tok/s
After 64 GPU layers~18 tok/s
After K/V Cache Q8_0 + 50K context~35 tok/s

Final Working Configuration

Model:                  Qwen3.6-27B Q4_K_M
GPU:                    AMD RX 7900 XT (20 GB VRAM)
Context Length:         50000
GPU Offload:            64
Flash Attention:        ON
K Cache Quantization:   Q8_0
V Cache Quantization:   Q8_0
Evaluation Batch Size:  512

Released under the MIT License.