DS4 64GB Mac Quantization Experiment Prompt

We want to investigate whether antirez/ds4 can be modified to produce a smaller-than-current-Q2 DeepSeek V4 Flash GGUF that may fit and run on a 64GB MacBook.

Context:

DS4 currently uses a DeepSeek V4 Flash-specific quantization scheme.
Current Q2 is not whole-model 2-bit. It aggressively quantizes routed MoE experts while leaving sensitive non-expert tensors at higher precision.
Existing DS4 Q2 policy:
- routed expert gate/up: IQ2_XXS
- routed expert down: Q2_K
- shared experts, router, projections, output, etc.: higher precision
- preferred variant uses imatrix activation statistics.
Unsloth has discussed dynamic 1.58-bit / ternary-style quantization. We want to borrow the useful idea — dynamic rescue of sensitive blocks — without throwing away DS4's architecture-specific insight.

Goal: Produce and evaluate an experimental quantization profile:

routed expert gate/up: dynamic 1.58-ish ternary, with Q2 rescue for sensitive blocks
routed expert down: Q2_K initially
everything else: unchanged from DS4 q2-imatrix

The product target is not maximum benchmark glory. The target is: can DeepSeek V4 Flash become realistically usable on a 64GB Mac with a practical context size and disk KV?

Recommended environment:

Linux CPU memory-optimized machine
128GB system RAM minimum, 256GB preferred
500GB-1TB disk
GPU not required for first phase

Important: system RAM is the main constraint for quantization, not VRAM.

Phase 0 — Baseline:

Clone DS4:

git clone https://github.com/antirez/ds4
cd ds4

Build tools:

make cpu
make -C gguf-tools

Download current artifacts:

./download_model.sh q2-imatrix
./download_model.sh q4-imatrix

Obtain the HF DeepSeek V4 Flash safetensors required by gguf-tools/deepseek4-quantize.
Record:

q2-imatrix GGUF size
q4-imatrix GGUF size
tensor type breakdown
which tensor families dominate bytes

If no inspector exists, write a small GGUF tensor-size inspector.

Phase 1 — Offline feasibility first: Do not start with Metal/CUDA kernels.

Build an offline quantization quality harness that:

Reads representative routed expert tensors from HF safetensors or GGUF.
Samples multiple layers and experts.
Handles gate/up/down separately.
Quantizes with:
- existing DS4 IQ2_XXS
- proposed ternary/1.58-ish quantizer
Measures:
- plain MSE
- imatrix-weighted MSE
- per-layer/expert outliers

Question to answer: Can ternary/1.58-ish gate/up compete with IQ2_XXS under imatrix-weighted error?

Phase 2 — Add experimental ternary quantizer: Add an experimental format in gguf-tools/quants.*.

Conceptual encoding:

block:
  scale
  packed ternary codes {-1, 0, +1}
  optional group metadata

Start simple. Quality first, perfect packing later.

Knobs:

block size
scale estimation
zero threshold
imatrix-weighted threshold
optional per-block fallback to Q2

Note: if initially packed as 2 bits/code, it may not produce real memory savings yet. That is acceptable for quality feasibility. Real 64GB feasibility eventually needs actual storage/runtime savings.

Phase 3 — Dynamic rescue: Implement a budgeted rescue policy:

Start with routed gate/up blocks as ternary.
Compute imatrix-weighted error.
Promote worst blocks to Q2 until hitting quality threshold or size budget.
Keep routed down as Q2_K.

Track:

estimated size reduction
percentage ternary blocks
percentage rescued Q2 blocks
weighted error vs current DS4 q2-imatrix

Phase 4 — Generate experimental GGUF: Modify deepseek4-quantize policy:

routed expert gate/up: ternary or ternary+Q2 rescued blocks
routed expert down: Q2_K
non-expert tensors: same as q2-imatrix

Produce an artifact like:

DeepSeek-V4-Flash-Experimental-Q1GateUp-Q2Down.gguf

Record actual file size and tensor breakdown.

Phase 5 — Runtime support only after quality looks sane: Order:

CPU/reference dequant for correctness.
CUDA path if convenient.
Metal path for final MacBook test.

Do not optimize first. Make it correct, then fast.

Phase 6 — Quality eval: Compare against DS4 q2-imatrix.

Minimum gates:

official continuation/logit tests if available
NLL / greedy continuation eval
coding prompts
JSON/tool-call reliability
long-context smoke test

Score:

size reduction
loadability
tokens/sec
quality degradation
failure modes

Phase 7 — 64GB Mac test: On the MacBook:

make
./ds4 -m experimental.gguf --ctx 4096 --nothink -n 64

Then increase:

ctx 8192
ctx 16384
ctx 32768

Measure:

does it load?
memory pressure?
Metal allocation failure?
tokens/sec?
qualitative behavior?

Kill criteria:

ternary gate/up has much worse weighted error than IQ2_XXS
dynamic rescue needs so many Q2 blocks that size savings are tiny
down must be aggressively shrunk and quality collapses
real packing/runtime overhead erases theoretical savings

Continue criteria:

gate/up ternary saves meaningful bytes
Q2 rescue rate stays modest
eval degradation is small
estimated model+runtime memory fits under roughly 55GB

Recommended first concrete task: Write a GGUF/tensor size inspector and offline quant-error harness for routed expert tensors. Get hard numbers before kernel work.

patleeman/ds4-q158-experiment-prompt.md

Select an option

No results found

Select an option

No results found

DS4 64GB Mac Quantization Experiment Prompt