Skip to content

Instantly share code, notes, and snippets.

@patleeman
Created May 16, 2026 11:52
Show Gist options
  • Select an option

  • Save patleeman/0eb26e1dfaec852a194882270511f51e to your computer and use it in GitHub Desktop.

Select an option

Save patleeman/0eb26e1dfaec852a194882270511f51e to your computer and use it in GitHub Desktop.
DS4 64GB Mac 1.58-bit quantization experiment prompt

DS4 64GB Mac Quantization Experiment Prompt

We want to investigate whether antirez/ds4 can be modified to produce a smaller-than-current-Q2 DeepSeek V4 Flash GGUF that may fit and run on a 64GB MacBook.

Context:

  • DS4 currently uses a DeepSeek V4 Flash-specific quantization scheme.
  • Current Q2 is not whole-model 2-bit. It aggressively quantizes routed MoE experts while leaving sensitive non-expert tensors at higher precision.
  • Existing DS4 Q2 policy:
    • routed expert gate/up: IQ2_XXS
    • routed expert down: Q2_K
    • shared experts, router, projections, output, etc.: higher precision
    • preferred variant uses imatrix activation statistics.
  • Unsloth has discussed dynamic 1.58-bit / ternary-style quantization. We want to borrow the useful idea — dynamic rescue of sensitive blocks — without throwing away DS4's architecture-specific insight.

Goal: Produce and evaluate an experimental quantization profile:

routed expert gate/up: dynamic 1.58-ish ternary, with Q2 rescue for sensitive blocks
routed expert down: Q2_K initially
everything else: unchanged from DS4 q2-imatrix

The product target is not maximum benchmark glory. The target is: can DeepSeek V4 Flash become realistically usable on a 64GB Mac with a practical context size and disk KV?

Recommended environment:

  • Linux CPU memory-optimized machine
  • 128GB system RAM minimum, 256GB preferred
  • 500GB-1TB disk
  • GPU not required for first phase

Important: system RAM is the main constraint for quantization, not VRAM.

Phase 0 — Baseline:

  1. Clone DS4:
git clone https://github.com/antirez/ds4
cd ds4
  1. Build tools:
make cpu
make -C gguf-tools
  1. Download current artifacts:
./download_model.sh q2-imatrix
./download_model.sh q4-imatrix
  1. Obtain the HF DeepSeek V4 Flash safetensors required by gguf-tools/deepseek4-quantize.

  2. Record:

  • q2-imatrix GGUF size
  • q4-imatrix GGUF size
  • tensor type breakdown
  • which tensor families dominate bytes

If no inspector exists, write a small GGUF tensor-size inspector.

Phase 1 — Offline feasibility first: Do not start with Metal/CUDA kernels.

Build an offline quantization quality harness that:

  1. Reads representative routed expert tensors from HF safetensors or GGUF.
  2. Samples multiple layers and experts.
  3. Handles gate/up/down separately.
  4. Quantizes with:
    • existing DS4 IQ2_XXS
    • proposed ternary/1.58-ish quantizer
  5. Measures:
    • plain MSE
    • imatrix-weighted MSE
    • per-layer/expert outliers

Question to answer: Can ternary/1.58-ish gate/up compete with IQ2_XXS under imatrix-weighted error?

Phase 2 — Add experimental ternary quantizer: Add an experimental format in gguf-tools/quants.*.

Conceptual encoding:

block:
  scale
  packed ternary codes {-1, 0, +1}
  optional group metadata

Start simple. Quality first, perfect packing later.

Knobs:

  • block size
  • scale estimation
  • zero threshold
  • imatrix-weighted threshold
  • optional per-block fallback to Q2

Note: if initially packed as 2 bits/code, it may not produce real memory savings yet. That is acceptable for quality feasibility. Real 64GB feasibility eventually needs actual storage/runtime savings.

Phase 3 — Dynamic rescue: Implement a budgeted rescue policy:

Start with routed gate/up blocks as ternary.
Compute imatrix-weighted error.
Promote worst blocks to Q2 until hitting quality threshold or size budget.
Keep routed down as Q2_K.

Track:

  • estimated size reduction
  • percentage ternary blocks
  • percentage rescued Q2 blocks
  • weighted error vs current DS4 q2-imatrix

Phase 4 — Generate experimental GGUF: Modify deepseek4-quantize policy:

routed expert gate/up: ternary or ternary+Q2 rescued blocks
routed expert down: Q2_K
non-expert tensors: same as q2-imatrix

Produce an artifact like:

DeepSeek-V4-Flash-Experimental-Q1GateUp-Q2Down.gguf

Record actual file size and tensor breakdown.

Phase 5 — Runtime support only after quality looks sane: Order:

  1. CPU/reference dequant for correctness.
  2. CUDA path if convenient.
  3. Metal path for final MacBook test.

Do not optimize first. Make it correct, then fast.

Phase 6 — Quality eval: Compare against DS4 q2-imatrix.

Minimum gates:

  • official continuation/logit tests if available
  • NLL / greedy continuation eval
  • coding prompts
  • JSON/tool-call reliability
  • long-context smoke test

Score:

  • size reduction
  • loadability
  • tokens/sec
  • quality degradation
  • failure modes

Phase 7 — 64GB Mac test: On the MacBook:

make
./ds4 -m experimental.gguf --ctx 4096 --nothink -n 64

Then increase:

ctx 8192
ctx 16384
ctx 32768

Measure:

  • does it load?
  • memory pressure?
  • Metal allocation failure?
  • tokens/sec?
  • qualitative behavior?

Kill criteria:

  • ternary gate/up has much worse weighted error than IQ2_XXS
  • dynamic rescue needs so many Q2 blocks that size savings are tiny
  • down must be aggressively shrunk and quality collapses
  • real packing/runtime overhead erases theoretical savings

Continue criteria:

  • gate/up ternary saves meaningful bytes
  • Q2 rescue rate stays modest
  • eval degradation is small
  • estimated model+runtime memory fits under roughly 55GB

Recommended first concrete task: Write a GGUF/tensor size inspector and offline quant-error harness for routed expert tensors. Get hard numbers before kernel work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment