Drop-in compression proxy for LLMs. 532 regex rules, bounded LRU cache, local LLM semantic stage, 15–75% token savings. Same Rust engine that runs in production behind app.nyquest.ai — now with a published Docker image.
No telemetry · No phone-home · Privacy mode by default
No "lite" version. No fork that lags behind. The compression engine that processes every request on the production app is the one in this repository — sanitized for public release with hardcoded keys and internal paths removed. v3.2.0 adds a RegexSet prefilter, zero-allocation rule misses via Cow<str>, a bounded LRU compression cache, per-rule hit telemetry, a regression suite of 47 passing tests, and a multi-stage Dockerfile auto-published to GHCR.
All three converge on a running proxy at localhost:5400. Health check, web dashboard, OpenAI-compatible endpoint, Anthropic-native endpoint — all there.
# Multi-stage Dockerfile — Rust builder + Debian-slim runtime, non-root user, tini PID-1 docker pull ghcr.io/nyquest-ai/nyquest-engine:3.2.0 docker run --rm --network host \ -e NYQUEST_CACHE_CAPACITY=2048 \ ghcr.io/nyquest-ai/nyquest-engine:3.2.0 # Or via docker-compose (includes healthcheck, log volume, host networking) git clone https://github.com/Nyquest-ai/nyquest-rust-fullstack-3.2.0 ~/nyquest \ && cd ~/nyquest \ && bash scripts/run-local.sh
# Hardware check → deps → Rust → build → semantic stage → preflight → wizard → start
git clone https://github.com/Nyquest-ai/nyquest-rust-fullstack-3.2.0 ~/nyquest \
&& bash ~/nyquest/install.sh
cargo build --release ./target/release/nyquest preflight -v # System check + tier recommendation ./target/release/nyquest install # Interactive setup wizard ./target/release/nyquest serve # Start proxy on :5400
nyquest install --defaults \ --set port=8080 \ --set compression_level=0.7 \ --set semantic_enabled=true
After install: curl localhost:5400/health · dashboard at localhost:5400/dashboard · cache snapshot at localhost:5400/admin/v2/compression/cache
Every request flows through this pipeline. Stages 2 and 5 are toggleable; stages 1, 3, 4, 6 always run.
Stages run in microseconds (1, 3, 4) to a few hundred ms (5, GPU). Hot path stays under 1ms (p50 <1ms measured) · cached identical content returns in ~19µs (~3,730× speedup vs cold).
A single 0.0–1.0 knob. Server default is 0.7 (balanced). Override per-request with x-nyquest-level.
| Level | Strategy | Rules Active | Typical Savings |
|---|---|---|---|
0.0 | Pass-through (metrics only) | — | 0% |
0.2 | Filler removal | ~95 rules | 5–12% |
0.5 | Structural compression | ~250 rules | 15–28% |
0.7 | Default · balanced | ~390 rules | 18–33% |
1.0 | Aggressive + format + minify | 532 rules | 23–35% rules · up to 75% with semantic |
The preflight command (nyquest preflight -v) inspects your machine and recommends a tier. You can always force a different one in nyquest.yaml.
Pure regex pipeline. No semantic stage. Ridiculously fast hot path — sub-millisecond p50.
Rules + Qwen 2.5 1.5B on a small GPU. 200–350ms semantic latency. The production tier.
No GPU? CPU runs the semantic stage at 1–4 seconds per call. Same savings, slower path.
Measured live on the v3.2.0 Docker container (Ubuntu 24.04 host, --network host). Full breakdown in CHANGELOG.md.
| Test | Result | Detail |
|---|---|---|
| Reference prompt (358 tokens) @ L0.3 | 26.6% saved | 319 → 234 tokens |
| Reference prompt @ L0.5 | 28.2% saved | 319 → 229 tokens |
| Reference prompt @ L1.0 | 34.8% saved | 319 → 208 tokens |
| Aggregate across 8 realistic scenarios @ L1.0 | 22.6% mean | 2,201 → 1,703 tokens · Haiku 4.5 conservative profile |
| Semantic stage — system prompts | 55.9% saved | 200–350ms on GPU |
| Semantic stage — conversation history | 75% saved | 200–350ms on GPU |
| Health throughput (ab -n 5000 -c 1) | 8,266 req/s | p50 <1ms · host networking |
| Concurrent (ab -n 5000 -c 20) | 18,287 req/s | p50 1ms · p95 1ms · p99 2ms |
| Cache hit (warm replay) | ~3,730× speedup | ~70.9 ms cold → ~19 µs hit on ~500-char prompt |
| Failed requests | 0 / 10,000 | across single-thread + concurrent runs |
| Memory footprint | 48.5 MB RSS | 15 threads · Docker container |
| Test suite | 47 / 47 passing | 16 safety-invariant snapshots · 6 measurement reports · 25 role personas |
Compression rows measured against Claude Haiku 4.5, which uses the engine's conservative profile (raised in v3.2.0 with specific-marker matching so -mini / -flash / -nano / -light always win over family-name substrings). Frontier models (Opus, Sonnet, GPT-5.5, Grok-3) use the aggressive profile and typically see 6–10 percentage points higher savings on the same prompts.
Lifetime stats from the live engine's /metrics endpoint, processing real traffic behind app.nyquest.ai:
Average is low because production traffic skews toward short prompts — the long-context savings are where the engine earns its keep.
Per-prompt compression on Haiku 4.5 (conservative profile). Each scenario is a verbose, professional system prompt of the kind real users send. Re-measured on v3.2.0 (tests/v320_compression_report.rs).
| Scenario | @ 0.5 | @ 0.7 | @ 1.0 |
|---|---|---|---|
| Customer Support | 28.1% | 33.4% | 34.7% |
| Legal Review | 14.1% | 16.0% | 27.5% |
| Data Science | 14.6% | 16.5% | 18.1% |
| Travel Planning | 22.9% | 23.3% | 25.9% |
| Code Review | 11.7% | 13.8% | 20.7% |
| Financial Advisor | 12.5% | 10.9% | 17.5% |
| HR Policy | 11.6% | 13.6% | 18.2% |
| Medical Education | 9.3% | 10.8% | 14.3% |
| AGGREGATE (mean) | 15.9% | 17.7% | 22.6% |
Savings calculated against published provider rates. With BYOK + the engine in front of your app, these come out of your bill, not Nyquest's.
| Model | Price / 1M input | 100M tok/mo cost | Saved @ 0.7 | Saved @ 1.0 |
|---|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $100 | $17.70 | $22.60 |
| Claude Sonnet 4.6 | $3.00 | $300 | $53.10 | $67.80 |
| Claude Opus 4.8 | $5.00 | $500 | $88.50 | $113.00 |
| GPT-5.5 | $5.00 | $500 | $88.50 | $113.00 |
| Grok-3 | $3.00 | $300 | $53.10 | $67.80 |
Standard provider rates as of May 2026. Savings calculated against v3.2.0 aggregate mean (8-scenario suite, conservative profile). Add the semantic stage and the savings double or triple on long-context workloads.
Point your existing OpenAI or Anthropic client at localhost:5400 instead of the cloud endpoint. Done. The proxy auto-detects which provider you're calling and translates between formats as needed.
from openai import OpenAI client = OpenAI( base_url="http://localhost:5400/v1", api_key="your-api-key", ) # Use any model — Nyquest auto-routes to the right provider response = client.chat.completions.create( model="claude-haiku-4-5-20251001", # or gpt-5-5, grok-3, gemini-3-flash messages=[{"role": "user", "content": "Hello!"}], )
import anthropic client = anthropic.Anthropic( base_url="http://localhost:5400", api_key="your-anthropic-key", ) response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=1024, messages=[{"role": "user", "content": "Hello!"}], )
# Anthropic native curl -X POST http://localhost:5400/v1/messages \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -d '{"model":"claude-haiku-4-5-20251001","max_tokens":100, "messages":[{"role":"user","content":"Hello!"}]}' # OpenAI-compatible curl -X POST http://localhost:5400/v1/chat/completions \ -H "Authorization: Bearer $API_KEY" \ -d '{"model":"claude-haiku-4-5-20251001", "messages":[{"role":"user","content":"Hello!"}]}'
# Override compression level for this request curl -H "x-nyquest-level: 1.0" ... # Route to a specific provider curl -H "x-nyquest-base-url: https://api.openai.com" ... # Enable OpenClaw agent mode (tool pruning, schema min, sliding window) curl -H "x-nyquest-openclaw: true" ... # Override response-compression age (turns left untouched) curl -H "x-nyquest-response-age: 2" ...
regex::RegexSet at startup. On every call, only candidate rules run — rules that can't match are skipped without a full text scan. Combined with Cow<str> zero-allocation misses, the hot path stays well under 1ms.sha256(content) + level + profile + mode. Default 2048 entries / 64KB max entry size. Tunable via NYQUEST_CACHE_CAPACITY / NYQUEST_CACHE_MAX_ENTRY_SIZE. Warm replay: ~3,730× speedup on a ~500-char prompt.-mini, -flash, -nano, -light) always win over family-name substrings.tool_use and image blocks pass through byte-identical. Python tuples preserved. Four-digit dates preserved. Role declarations rewritten as Role: X. instead of stripped. Fail-closed length contract: returns original if compression produced a longer string./admin/v2/compression/rules exposes per-category and per-rule hit counts (no prompt content). /admin/v2/compression/cache exposes cache snapshot (entries, hits, misses, evictions). /dashboard for live request stats.docker-compose.yml with --network host and a prod profile (read-only root FS, resource caps). Auto-published to ghcr.io/nyquest-ai/nyquest-engine on every v* tag.tool_use pass-through, non-English text). 6 measurement reports. 25 role personas across 7 categories.| Command | What it does |
|---|---|
nyquest install | Interactive 11-section setup wizard |
nyquest install --defaults | Headless mode with defaults (CI / Docker) |
nyquest preflight -v | OS, CPU, RAM, disk, GPU, Ollama, network — verbose with tier rec |
nyquest doctor | 10-point health check (config, port, keys, dashboard, logs, systemd) |
nyquest config show | Display all resolved config values |
nyquest config set <key> <value> | Set a single value (dot-notation) |
nyquest serve | Start the proxy server (default if no command) |
v3.2.0 is current. main is what runs in production. Pull the Docker image from ghcr.io/nyquest-ai/nyquest-engine:3.2.0 or build from source. PRs welcome — see CONTRIBUTING.md.