Engine — Nyquest · Open Source Compression Proxy for LLMs

// Quickstart

Three ways to install

All three converge on a running proxy at localhost:5400. Health check, web dashboard, OpenAI-compatible endpoint, Anthropic-native endpoint — all there.

# Multi-stage Dockerfile — Rust builder + Debian-slim runtime, non-root user, tini PID-1
docker pull ghcr.io/nyquest-ai/nyquest-engine:3.2.0
docker run --rm --network host \
  -e NYQUEST_CACHE_CAPACITY=2048 \
  ghcr.io/nyquest-ai/nyquest-engine:3.2.0

# Or via docker-compose (includes healthcheck, log volume, host networking)
git clone https://github.com/Nyquest-ai/nyquest-rust-fullstack-3.2.0 ~/nyquest \
  && cd ~/nyquest \
  && bash scripts/run-local.sh

# Hardware check → deps → Rust → build → semantic stage → preflight → wizard → start
git clone https://github.com/Nyquest-ai/nyquest-rust-fullstack-3.2.0 ~/nyquest \
  && bash ~/nyquest/install.sh

cargo build --release
./target/release/nyquest preflight -v    # System check + tier recommendation
./target/release/nyquest install         # Interactive setup wizard
./target/release/nyquest serve           # Start proxy on :5400

nyquest install --defaults \
  --set port=8080 \
  --set compression_level=0.7 \
  --set semantic_enabled=true

After install: curl localhost:5400/health · dashboard at localhost:5400/dashboard · cache snapshot at localhost:5400/admin/v2/compression/cache

// Pipeline

Six stages, one binary

Every request flows through this pipeline. Stages 2 and 5 are toggleable; stages 1, 3, 4, 6 always run.

01Normalizededup, conflict resolution, speculation boundaries

02OpenClaw Agent Mode7-strategy agentic optimization (toggleable)

03Cache Reordersort for provider prefix-cache hits

04Rule Compress532 regex rules · RegexSet prefilter · LRU cache · telegraph · code minify · format optimizer

05Semantic LLMQwen 2.5 1.5B via Ollama · 56% system · 75% history (toggleable)

06Auto-scale + Forwarddynamic level based on context window · provider routing

Stages run in microseconds (1, 3, 4) to a few hundred ms (5, GPU). Hot path stays under 1ms (p50 <1ms measured) · cached identical content returns in ~19µs (~3,730× speedup vs cold).

// Compression Levels

Dial it from off to aggressive

A single 0.0–1.0 knob. Server default is 0.7 (balanced). Override per-request with x-nyquest-level.

Level	Strategy	Rules Active	Typical Savings
`0.0`	Pass-through (metrics only)	—	0%
`0.2`	Filler removal	~95 rules	5–12%
`0.5`	Structural compression	~250 rules	15–28%
`0.7`	Default · balanced	~390 rules	18–33%
`1.0`	Aggressive + format + minify	532 rules	23–35% rules · up to 75% with semantic

// Hardware Tiers

Run anywhere from a Pi to a workstation

The preflight command (nyquest preflight -v) inspects your machine and recommends a tier. You can always force a different one in nyquest.yaml.

TIER 1

Rules only

15–37% savings

Pure regex pipeline. No semantic stage. Ridiculously fast hot path — sub-millisecond p50.

2+ CPU cores
512 MB RAM
No GPU required
Runs on a Raspberry Pi 4

TIER 2 · RECOMMENDED

GPU semantic

15–75% savings

Rules + Qwen 2.5 1.5B on a small GPU. 200–350ms semantic latency. The production tier.

4+ CPU cores
6 GB RAM
2+ GB VRAM (GTX 1080+ class)
Ollama installed

TIER 3

CPU semantic

15–75% savings

No GPU? CPU runs the semantic stage at 1–4 seconds per call. Same savings, slower path.

4+ CPU cores
8 GB RAM
No GPU
Ollama installed

// Benchmarks

Production numbers, not synthetic

Measured live on the v3.2.0 Docker container (Ubuntu 24.04 host, --network host). Full breakdown in CHANGELOG.md.

Test	Result	Detail
Reference prompt (358 tokens) @ L0.3	26.6% saved	319 → 234 tokens
Reference prompt @ L0.5	28.2% saved	319 → 229 tokens
Reference prompt @ L1.0	34.8% saved	319 → 208 tokens
Aggregate across 8 realistic scenarios @ L1.0	22.6% mean	2,201 → 1,703 tokens · Haiku 4.5 conservative profile
Semantic stage — system prompts	55.9% saved	200–350ms on GPU
Semantic stage — conversation history	75% saved	200–350ms on GPU
Health throughput (ab -n 5000 -c 1)	8,266 req/s	p50 <1ms · host networking
Concurrent (ab -n 5000 -c 20)	18,287 req/s	p50 1ms · p95 1ms · p99 2ms
Cache hit (warm replay)	~3,730× speedup	~70.9 ms cold → ~19 µs hit on ~500-char prompt
Failed requests	0 / 10,000	across single-thread + concurrent runs
Memory footprint	48.5 MB RSS	15 threads · Docker container
Test suite	47 / 47 passing	16 safety-invariant snapshots · 6 measurement reports · 25 role personas

Compression rows measured against Claude Haiku 4.5, which uses the engine's conservative profile (raised in v3.2.0 with specific-marker matching so -mini / -flash / -nano / -light always win over family-name substrings). Frontier models (Opus, Sonnet, GPT-5.5, Grok-3) use the aggressive profile and typically see 6–10 percentage points higher savings on the same prompts.

// Production Cumulative

Lifetime stats from the live engine's /metrics endpoint, processing real traffic behind app.nyquest.ai:

18,085

Requests

6.1M

Tokens In

1.54M

Tokens Saved

10.9%

Avg Mixed

77.7%

Single-Req Max

Average is low because production traffic skews toward short prompts — the long-context savings are where the engine earns its keep.

// 8 Real-World Scenarios

Per-prompt compression on Haiku 4.5 (conservative profile). Each scenario is a verbose, professional system prompt of the kind real users send. Re-measured on v3.2.0 (tests/v320_compression_report.rs).

Scenario	@ 0.5	@ 0.7	@ 1.0
Customer Support	28.1%	33.4%	34.7%
Legal Review	14.1%	16.0%	27.5%
Data Science	14.6%	16.5%	18.1%
Travel Planning	22.9%	23.3%	25.9%
Code Review	11.7%	13.8%	20.7%
Financial Advisor	12.5%	10.9%	17.5%
HR Policy	11.6%	13.6%	18.2%
Medical Education	9.3%	10.8%	14.3%
AGGREGATE (mean)	15.9%	17.7%	22.6%

// Cost Impact at Scale

What 100M tokens/month looks like

Savings calculated against published provider rates. With BYOK + the engine in front of your app, these come out of your bill, not Nyquest's.

Model	Price / 1M input	100M tok/mo cost	Saved @ 0.7	Saved @ 1.0
Claude Haiku 4.5	$1.00	$100	$17.70	$22.60
Claude Sonnet 4.6	$3.00	$300	$53.10	$67.80
Claude Opus 4.8	$5.00	$500	$88.50	$113.00
GPT-5.5	$5.00	$500	$88.50	$113.00
Grok-3	$3.00	$300	$53.10	$67.80

Standard provider rates as of May 2026. Savings calculated against v3.2.0 aggregate mean (8-scenario suite, conservative profile). Add the semantic stage and the savings double or triple on long-context workloads.

// Integration

Drop-in. Nothing to change in your code.

Point your existing OpenAI or Anthropic client at localhost:5400 instead of the cloud endpoint. Done. The proxy auto-detects which provider you're calling and translates between formats as needed.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5400/v1",
    api_key="your-api-key",
)

# Use any model — Nyquest auto-routes to the right provider
response = client.chat.completions.create(
    model="claude-haiku-4-5-20251001",  # or gpt-5-5, grok-3, gemini-3-flash
    messages=[{"role": "user", "content": "Hello!"}],
)

import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:5400",
    api_key="your-anthropic-key",
)

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}],
)

# Anthropic native
curl -X POST http://localhost:5400/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{"model":"claude-haiku-4-5-20251001","max_tokens":100,
       "messages":[{"role":"user","content":"Hello!"}]}'

# OpenAI-compatible
curl -X POST http://localhost:5400/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -d '{"model":"claude-haiku-4-5-20251001",
       "messages":[{"role":"user","content":"Hello!"}]}'

# Override compression level for this request
curl -H "x-nyquest-level: 1.0" ...

# Route to a specific provider
curl -H "x-nyquest-base-url: https://api.openai.com" ...

# Enable OpenClaw agent mode (tool pruning, schema min, sliding window)
curl -H "x-nyquest-openclaw: true" ...

# Override response-compression age (turns left untouched)
curl -H "x-nyquest-response-age: 2" ...

// Features

What ships in v3.2.0

532 rules, 19 categories

Filler removal, verbose phrases, imperative conversion, clause collapse, code minify (Python/JS/Bash), JSON→CSV, AI-output noise, and more. Tier-gated by compression level. Every rule has an atomic hit counter exposed via the admin API.

RegexSet prefilter

Each category builds a regex::RegexSet at startup. On every call, only candidate rules run — rules that can't match are skipped without a full text scan. Combined with Cow<str> zero-allocation misses, the hot path stays well under 1ms.

Bounded LRU cache

Compression results keyed by sha256(content) + level + profile + mode. Default 2048 entries / 64KB max entry size. Tunable via NYQUEST_CACHE_CAPACITY / NYQUEST_CACHE_MAX_ENTRY_SIZE. Warm replay: ~3,730× speedup on a ~500-char prompt.

Local LLM stage

Qwen 2.5 1.5B via Ollama. Condenses long system prompts (~56%) and conversation history (~75%) with provider-aware fallback to extractive compression on timeout.

OpenClaw Agent Mode

7-strategy pipeline for autonomous agents: tool result pruning, schema minimization, thought-block compression, error dedup, sliding window, cache injection, file-view condensation.

Per-model profiles

Aggressive (Opus, Sonnet, GPT-5), Balanced (unknown), Conservative (Haiku, Mini, Flash, Nano, Light). v3.2.0 fixed selector ordering so specific markers (-mini, -flash, -nano, -light) always win over family-name substrings.

Safety invariants

tool_use and image blocks pass through byte-identical. Python tuples preserved. Four-digit dates preserved. Role declarations rewritten as Role: X. instead of stripped. Fail-closed length contract: returns original if compression produced a longer string.

Admin observability

/admin/v2/compression/rules exposes per-category and per-rule hit counts (no prompt content). /admin/v2/compression/cache exposes cache snapshot (entries, hits, misses, evictions). /dashboard for live request stats.

Docker + GHCR

Multi-stage Dockerfile (Rust builder + Debian-slim runtime, non-root user, tini PID-1). docker-compose.yml with --network host and a prod profile (read-only root FS, resource caps). Auto-published to ghcr.io/nyquest-ai/nyquest-engine on every v* tag.

Test suite

47 / 47 passing. 16 regression snapshots for safety invariants (Python tuple syntax, four-digit dates, profile-detection ordering, byte-identical tool_use pass-through, non-English text). 6 measurement reports. 25 role personas across 7 categories.

Privacy mode

No telemetry. No phone-home. Prompt content is never logged — admin endpoints expose hit counts and regex sources, not data. Run it air-gapped.

Encrypted key vault

Optional AES-256-GCM encryption for API keys at rest. Keys never logged, never written to request logs, only decrypted in memory at dispatch.

// CLI

Operate from the command line

Command	What it does
`nyquest install`	Interactive 11-section setup wizard
`nyquest install --defaults`	Headless mode with defaults (CI / Docker)
`nyquest preflight -v`	OS, CPU, RAM, disk, GPU, Ollama, network — verbose with tier rec
`nyquest doctor`	10-point health check (config, port, keys, dashboard, logs, systemd)
`nyquest config show`	Display all resolved config values
`nyquest config set <key> <value>`	Set a single value (dot-notation)
`nyquest serve`	Start the proxy server (default if no command)

The engine that powers
Nyquest. Open source.

The exact code running behind app.nyquest.ai.

Three ways to install

Six stages, one binary

Dial it from off to aggressive

Run anywhere from a Pi to a workstation

Rules only

GPU semantic

CPU semantic

Production numbers, not synthetic

What 100M tokens/month looks like

Drop-in. Nothing to change in your code.

What ships in v3.2.0

Operate from the command line

Run it. Star it. Open an issue.

The engine that powersNyquest. Open source.

The exact code running behind app.nyquest.ai.

Three ways to install

Six stages, one binary

Dial it from off to aggressive

Run anywhere from a Pi to a workstation

Rules only

GPU semantic

CPU semantic

Production numbers, not synthetic

What 100M tokens/month looks like

Drop-in. Nothing to change in your code.

What ships in v3.2.0

Operate from the command line

Run it. Star it. Open an issue.

The engine that powers
Nyquest. Open source.