We want to investigate whether antirez/ds4 can be modified to produce a smaller-than-current-Q2 DeepSeek V4 Flash GGUF that may fit and run on a 64GB MacBook.
Context:
- DS4 currently uses a DeepSeek V4 Flash-specific quantization scheme.
- Current Q2 is not whole-model 2-bit. It aggressively quantizes routed MoE experts while leaving sensitive non-expert tensors at higher precision.
- Existing DS4 Q2 policy:
- routed expert gate/up:
IQ2_XXS - routed expert down:
Q2_K - shared experts, router, projections, output, etc.: higher precision
- preferred variant uses imatrix activation statistics.
- routed expert gate/up:
- Unsloth has discussed dynamic 1.58-bit / ternary-style quantization. We want to borrow the useful idea — dynamic rescue of sensitive blocks — without throwing away DS4's architecture-specific insight.
Goal: Produce and evaluate an experimental quantization profile:
routed expert gate/up: dynamic 1.58-ish ternary, with Q2 rescue for sensitive blocks
routed expert down: Q2_K initially
everything else: unchanged from DS4 q2-imatrix
The product target is not maximum benchmark glory. The target is: can DeepSeek V4 Flash become realistically usable on a 64GB Mac with a practical context size and disk KV?
Recommended environment:
- Linux CPU memory-optimized machine
- 128GB system RAM minimum, 256GB preferred
- 500GB-1TB disk
- GPU not required for first phase
Important: system RAM is the main constraint for quantization, not VRAM.
Phase 0 — Baseline:
- Clone DS4:
git clone https://github.com/antirez/ds4
cd ds4- Build tools:
make cpu
make -C gguf-tools- Download current artifacts:
./download_model.sh q2-imatrix
./download_model.sh q4-imatrix-
Obtain the HF DeepSeek V4 Flash safetensors required by
gguf-tools/deepseek4-quantize. -
Record:
- q2-imatrix GGUF size
- q4-imatrix GGUF size
- tensor type breakdown
- which tensor families dominate bytes
If no inspector exists, write a small GGUF tensor-size inspector.
Phase 1 — Offline feasibility first: Do not start with Metal/CUDA kernels.
Build an offline quantization quality harness that:
- Reads representative routed expert tensors from HF safetensors or GGUF.
- Samples multiple layers and experts.
- Handles gate/up/down separately.
- Quantizes with:
- existing DS4
IQ2_XXS - proposed ternary/1.58-ish quantizer
- existing DS4
- Measures:
- plain MSE
- imatrix-weighted MSE
- per-layer/expert outliers
Question to answer: Can ternary/1.58-ish gate/up compete with IQ2_XXS under imatrix-weighted error?
Phase 2 — Add experimental ternary quantizer:
Add an experimental format in gguf-tools/quants.*.
Conceptual encoding:
block:
scale
packed ternary codes {-1, 0, +1}
optional group metadata
Start simple. Quality first, perfect packing later.
Knobs:
- block size
- scale estimation
- zero threshold
- imatrix-weighted threshold
- optional per-block fallback to Q2
Note: if initially packed as 2 bits/code, it may not produce real memory savings yet. That is acceptable for quality feasibility. Real 64GB feasibility eventually needs actual storage/runtime savings.
Phase 3 — Dynamic rescue: Implement a budgeted rescue policy:
Start with routed gate/up blocks as ternary.
Compute imatrix-weighted error.
Promote worst blocks to Q2 until hitting quality threshold or size budget.
Keep routed down as Q2_K.
Track:
- estimated size reduction
- percentage ternary blocks
- percentage rescued Q2 blocks
- weighted error vs current DS4 q2-imatrix
Phase 4 — Generate experimental GGUF:
Modify deepseek4-quantize policy:
routed expert gate/up: ternary or ternary+Q2 rescued blocks
routed expert down: Q2_K
non-expert tensors: same as q2-imatrix
Produce an artifact like:
DeepSeek-V4-Flash-Experimental-Q1GateUp-Q2Down.gguf
Record actual file size and tensor breakdown.
Phase 5 — Runtime support only after quality looks sane: Order:
- CPU/reference dequant for correctness.
- CUDA path if convenient.
- Metal path for final MacBook test.
Do not optimize first. Make it correct, then fast.
Phase 6 — Quality eval:
Compare against DS4 q2-imatrix.
Minimum gates:
- official continuation/logit tests if available
- NLL / greedy continuation eval
- coding prompts
- JSON/tool-call reliability
- long-context smoke test
Score:
- size reduction
- loadability
- tokens/sec
- quality degradation
- failure modes
Phase 7 — 64GB Mac test: On the MacBook:
make
./ds4 -m experimental.gguf --ctx 4096 --nothink -n 64Then increase:
ctx 8192
ctx 16384
ctx 32768
Measure:
- does it load?
- memory pressure?
- Metal allocation failure?
- tokens/sec?
- qualitative behavior?
Kill criteria:
- ternary gate/up has much worse weighted error than IQ2_XXS
- dynamic rescue needs so many Q2 blocks that size savings are tiny
- down must be aggressively shrunk and quality collapses
- real packing/runtime overhead erases theoretical savings
Continue criteria:
- gate/up ternary saves meaningful bytes
- Q2 rescue rate stays modest
- eval degradation is small
- estimated model+runtime memory fits under roughly 55GB
Recommended first concrete task: Write a GGUF/tensor size inspector and offline quant-error harness for routed expert tensors. Get hard numbers before kernel work.