How It Works — Nyquest

// Request Lifecycle

From keystroke to response

Six stages on the request path. Each one is observable in the inspect panel.

01 · AUTH

Authenticate

JWT for the web app, nq-v1-... for API integrations, or BYOK keys for direct provider access. Each credential type has different statefulness — JWT requests carry memory, API-key requests are stateless by design.

02 · DISPATCH

Tier Selection

The AutoRouter classifies the request — complexity and domain (code, reasoning, creative, vision, general) — and picks a tier (T0–T3) and model. Model choice inside the tier is informed by judged-quality scorecards kept per model, per domain, refreshed continuously from graded real answers. Circuit breakers track provider health and fail over automatically.

03 · RECALL

Memory + Context

For app sessions, the memory system embeds your message and vector-searches your fact store for relevant prior context, scored by confidence. Sanitized facts are injected into system context. API-key requests skip this step.

04 · GROUND

Web Grounding (optional)

If the message benefits from current information, a detector triggers web search. Top results are fetched, content-type-gated and size-capped, then stitched into the prompt. Sources are recorded for provenance.

05 · COMPRESS

Compression Engine

The compression sidecar applies regex rules, prunes turn history above the context cap, and optionally invokes a semantic LLM stage on long system prompts. Token telemetry is captured. Tool schemas pass through untouched.

06 · STREAM

Dispatch + Stream

The optimized request goes upstream through the provider layer — OpenRouter, NVIDIA Build, or Nyquest's local tier — to the selected model. Tokens stream back via SSE, with response headers carrying telemetry: original tokens, compressed tokens, savings %, web-grounding sources, and the model that actually served the request.

// Routing Layer

The tier system

The dispatcher is what makes Nyquest a routing platform vs. a direct proxy. Every chat request is classified to a tier, with explicit defaults per plan.

Tier	Default Model	Available On	Use Case
T0	`local (llama 3.2)`	All accounts	The simplest requests are served on Nyquest's own hardware at $0 when the local tier is available — streaming included.
T1	`claude-haiku-4-5`	Pro	Default cheap path. Fast iteration, simple questions, quick drafts.
T2	`claude-sonnet-4-6`	Pro	Pro default for complex tasks. Strong reasoning, instruction-following.
T3	`claude-opus-4-6`	Pro	Heaviest workloads. Reasoning-intensive analysis, long-horizon planning.
Free chain	`rotating free models`	All accounts (free mode + failover)	A provider-diverse chain of currently-free models, live-probed around the clock and rebuilt automatically when the free ecosystem shifts.

Free accounts route to whichever models are currently no-cost in the catalog (the available set rotates), plus BYOK with their own provider keys. The dispatcher tiers above apply once an account flips to Pro on first wallet funding.

Circuit breakers

If a tier fails N times in a row (default 5, configurable), the dispatcher opens the breaker for 60 seconds and routes to the next tier. Health is tracked per-tier and visible in admin.

Manual override

Pin any specific model from the picker and bypass the dispatcher entirely. The pinned model handles the whole conversation — no automatic switching.

No black box

Every request shows which tier was selected, which model actually served it, why, and how many tokens it cost. Visible in the inspect panel and usage analytics.

// Self-Learning Routing

A router that learns

Routing isn't a static table. A sample of real answers is independently graded by an AI judge on factuality, consistency, completeness, and hedging — and those grades flow straight back into which model gets your next request.

Per-domain scorecards

Quality is tracked per model, per domain — the model that writes the best code isn't the one that reasons best. Scores are recency-weighted with a two-week half-life, so silent provider upgrades (and regressions) show up in routing within days.

Bandit exploration

When a proven winner exists, you get the proven winner. A small, strictly capped exploration budget goes to the most promising under-tested candidates, allocated by upper-confidence-bound (UCB) bandit math — the same class of algorithm behind large recommender systems.

Self-healing catalog

Models that upstream providers list but can't actually serve are detected on first failure and removed automatically — and re-enabled automatically when they come back. The free-model chain is live-probed every few hours. What's in the picker works.

// Memory Subsystem

Memory is separate from history

Conversation history is the literal messages. Memory is a curated, embedding-indexed store of facts extracted across conversations.

Extractor

After a conversation turn, an LLM-driven extractor distills noteworthy facts. Output is structured: subject, predicate, confidence, source.

Sanitizer

Before storage, facts are scrubbed for PII, medication names, and numerics that look like keys or credentials. Anything ambiguous gets dropped, not stored.

Confidence scoring

Each fact carries a 0..1 score. Newer and more-corroborated facts rank higher. Stale or contradicted facts decay.

Vector recall

At chat time, your message is embedded and the top-K facts are retrieved by vector search. Only the most relevant ones are injected into context.

Provenance

Every recalled fact records what triggered its retrieval and which conversation it came from — auditable in the grounding-events log.

Stateless API

Memory recall fires only on JWT (web app) requests. API-key requests skip recall entirely — clean developer-API behavior.

// Web Grounding

Live data, on demand

When a chat needs current information, the web subsystem fetches actual page content — not cached snippets, not summaries from a middleman.

Your message │ ▼ Detector scores grounding need │ ▼ Search DuckDuckGo · top results │ ▼ Fetcher sandboxed · content-type gated · size-capped │ ▼ Grounding stitched into system prompt · sources logged │ ▼ Response includes: x-nyquest-web-grounded · x-nyquest-web-sources · x-nyquest-web-urls

No third-party search API key required. Toggle web grounding per-request via header, or let the detector decide.

// Under the Hood

The compression engine

Stage 05 of the lifecycle is a separate Rust service that runs as a sidecar to the backend. It's open-source and the same engine you can self-host.

Normalize

Whitespace cleanup, markdown noise removal, tokenizer-unfriendly character handling. Reduces entropy before any other stage runs.

Rule compression

532 regex rules across 19 categories: filler removal, verbose-phrase substitution, redundancy collapse, structural noise. A RegexSet prefilter and zero-allocation Cow<str> rule misses keep the hot path sub-millisecond on typical prompts.

Context optimization

Trims older turns based on a token cap (50K on prod) while preserving the last 4 turns verbatim and a 6-turn minimum. Critical instructions stay in scope.

Semantic stage

For long history or large system prompts, an LLM rewrites and condenses while preserving meaning. Falls back to extractive (rule-based) on timeout.

Provider awareness

Per-LLM hooks adapt compression to the target provider's tokenizer and prompt conventions. Tool schemas, image blocks, and audio blocks pass through untouched.

Telemetry

Every call returns original tokens, compressed tokens, savings %, latency, and which stages fired. Visible to admin and inspect-panel users.

Open-source at Nyquest-ai/nyquest-rust-fullstack-3.2.0 · Docker image at ghcr.io/nyquest-ai/nyquest-engine · MIT/Apache-2.0 · v3.2

// Performance

Engine benchmarks

Measured on the v3.2 Docker container with --network host. Sub-millisecond hot path; cached identical content returns in microseconds; semantic stage adds latency only on long inputs.

Test	Result	Detail
Health throughput (single thread)	8,266 req/s	p50 <1ms · host networking
Concurrent (20 workers, 5000 req)	18,287 req/s	p50 1ms · p95 1ms · p99 2ms
Cache hit (warm replay)	~3,730×	~70.9 ms cold → ~19 µs hit
Failed requests	0 / 10,000	across single + concurrent runs
Memory footprint	48.5 MB RSS	15 threads · Docker container
Production, live traffic (measured July 2026)	33% avg token savings	30-day window · grows 15% → 75% as conversations run 7 → 19 messages

// Reliability

How the pipeline reduces hallucination risk

Compression isn't just about cost. Cleaner inputs reduce the conditions that cause models to hallucinate.

Hallucinations increase when:

Prompts are ambiguous or instructions are buried in verbosity

Constraints are diffuse, spread across paragraphs

Context window fills with redundant or contradictory history

Critical instructions silently evict at token limits

The pipeline addresses each:

Normalization extracts and surfaces buried constraints

Conflict detection finds and resolves contradictions in rules

Redundancy pruning collapses repeated instructions

Context optimization keeps important instructions in window

What happens
when you send a message

From keystroke to response

Authenticate

Tier Selection

Memory + Context

Web Grounding (optional)

Compression Engine

Dispatch + Stream

The tier system

A router that learns

Memory is separate from history

Live data, on demand

The compression engine

Engine benchmarks

How the pipeline reduces hallucination risk

Hallucinations increase when:

The pipeline addresses each:

Signal processing for language

See it in action

What happenswhen you send a message

From keystroke to response

Authenticate

Tier Selection

Memory + Context

Web Grounding (optional)

Compression Engine

Dispatch + Stream

The tier system

A router that learns

Memory is separate from history

Live data, on demand

The compression engine

Engine benchmarks

How the pipeline reduces hallucination risk

Hallucinations increase when:

The pipeline addresses each:

Signal processing for language

See it in action

What happens
when you send a message