// Request Lifecycle

From keystroke to response

Six stages on the request path. Each one is observable in the inspect panel.

01 · AUTH

Authenticate

JWT for the web app, nq-v1-... for API integrations, or BYOK keys for direct provider access. Each credential type has different statefulness — JWT requests carry memory, API-key requests are stateless by design.
02 · DISPATCH

Tier Selection

The dispatcher classifies the request and picks a tier (T1/T2/T3) based on your plan and the work involved. Per-tier circuit breakers track health; if a provider degrades, the request fails over to the next tier automatically.
03 · RECALL

Memory + Context

For app sessions, the memory system embeds your message and vector-searches your fact store for relevant prior context, scored by confidence. Sanitized facts are injected into system context. API-key requests skip this step.
04 · GROUND

Web Grounding (optional)

If the message benefits from current information, a detector triggers web search. Top results are fetched, content-type-gated and size-capped, then stitched into the prompt. Sources are recorded for provenance.
05 · COMPRESS

Compression Engine

The compression sidecar applies regex rules, prunes turn history above the context cap, and optionally invokes a semantic LLM stage on long system prompts. Token telemetry is captured. Tool schemas pass through untouched.
06 · STREAM

Dispatch + Stream

The optimized request goes upstream to the selected model. Tokens stream back via SSE, with response headers carrying telemetry: original tokens, compressed tokens, savings %, web-grounding sources, and the model that actually served the request.
// Routing Layer

The tier system

The dispatcher is what makes Nyquest a routing platform vs. a direct proxy. Every chat request is classified to a tier, with explicit defaults per plan.

TierDefault ModelAvailable OnUse Case
T1 claude-haiku-4-5 Pro Default cheap path. Fast iteration, simple questions, quick drafts.
T2 claude-sonnet-4-6 Pro Pro default for complex tasks. Strong reasoning, instruction-following.
T3 claude-opus-4-7 Pro Heaviest workloads. Reasoning-intensive analysis, long-horizon planning.
T1-alt gemini-flash-1-5 All accounts (failover) Budget alternative. Used when T1 circuit-breaker trips on a provider outage.

Free accounts route to whichever models are currently no-cost in the catalog (the available set rotates), plus BYOK with their own provider keys. The dispatcher tiers above apply once an account flips to Pro on first wallet funding.

Circuit breakers
If a tier fails N times in a row (default 5, configurable), the dispatcher opens the breaker for 60 seconds and routes to the next tier. Health is tracked per-tier and visible in admin.
Manual override
Pin any specific model from the picker and bypass the dispatcher entirely. The pinned model handles the whole conversation — no automatic switching.
No black box
Every request shows which tier was selected, which model actually served it, why, and how many tokens it cost. Visible in the inspect panel and usage analytics.
// Memory Subsystem

Memory is separate from history

Conversation history is the literal messages. Memory is a curated, embedding-indexed store of facts extracted across conversations.

Extractor
After a conversation turn, an LLM-driven extractor distills noteworthy facts. Output is structured: subject, predicate, confidence, source.
Sanitizer
Before storage, facts are scrubbed for PII, medication names, and numerics that look like keys or credentials. Anything ambiguous gets dropped, not stored.
Confidence scoring
Each fact carries a 0..1 score. Newer and more-corroborated facts rank higher. Stale or contradicted facts decay.
Vector recall
At chat time, your message is embedded and the top-K facts are retrieved by vector search. Only the most relevant ones are injected into context.
Provenance
Every recalled fact records what triggered its retrieval and which conversation it came from — auditable in the grounding-events log.
Stateless API
Memory recall fires only on JWT (web app) requests. API-key requests skip recall entirely — clean developer-API behavior.
// Web Grounding

Live data, on demand

When a chat needs current information, the web subsystem fetches actual page content — not cached snippets, not summaries from a middleman.

Your message Detector scores grounding need Search DuckDuckGo · top results Fetcher sandboxed · content-type gated · size-capped Grounding stitched into system prompt · sources logged Response includes: x-nyquest-web-grounded · x-nyquest-web-sources · x-nyquest-web-urls

No third-party search API key required. Toggle web grounding per-request via header, or let the detector decide.

// Under the Hood

The compression engine

Stage 05 of the lifecycle is a separate Rust service that runs as a sidecar to the backend. It's open-source and the same engine you can self-host.

Normalize
Whitespace cleanup, markdown noise removal, tokenizer-unfriendly character handling. Reduces entropy before any other stage runs.
Rule compression
532 regex rules across 19 categories: filler removal, verbose-phrase substitution, redundancy collapse, structural noise. A RegexSet prefilter and zero-allocation Cow<str> rule misses keep the hot path sub-millisecond on typical prompts.
Context optimization
Trims older turns based on a token cap (50K on prod) while preserving the last 4 turns verbatim and a 6-turn minimum. Critical instructions stay in scope.
Semantic stage
For long history or large system prompts, an LLM rewrites and condenses while preserving meaning. Falls back to extractive (rule-based) on timeout.
Provider awareness
Per-LLM hooks adapt compression to the target provider's tokenizer and prompt conventions. Tool schemas, image blocks, and audio blocks pass through untouched.
Telemetry
Every call returns original tokens, compressed tokens, savings %, latency, and which stages fired. Visible to admin and inspect-panel users.

Open-source at Nyquest-ai/nyquest-rust-fullstack-3.2.0 · Docker image at ghcr.io/nyquest-ai/nyquest-engine · MIT/Apache-2.0 · v3.2.0

// Performance

Engine benchmarks

Measured on the v3.2.0 Docker container with --network host. Sub-millisecond hot path; cached identical content returns in microseconds; semantic stage adds latency only on long inputs.

TestResultDetail
Health throughput (single thread)8,266 req/sp50 <1ms · host networking
Concurrent (20 workers, 5000 req)18,287 req/sp50 1ms · p95 1ms · p99 2ms
Cache hit (warm replay)~3,730×~70.9 ms cold → ~19 µs hit
Failed requests0 / 10,000across single + concurrent runs
Memory footprint48.5 MB RSS15 threads · Docker container
Production cumulative1.54M tokens savedacross 18,085+ real requests · max 77.7% on a single request
// Reliability

How the pipeline reduces hallucination risk

Compression isn't just about cost. Cleaner inputs reduce the conditions that cause models to hallucinate.

Hallucinations increase when:

Prompts are ambiguous or instructions are buried in verbosity
Constraints are diffuse, spread across paragraphs
Context window fills with redundant or contradictory history
Critical instructions silently evict at token limits

The pipeline addresses each:

Normalization extracts and surfaces buried constraints
Conflict detection finds and resolves contradictions in rules
Redundancy pruning collapses repeated instructions
Context optimization keeps important instructions in window
// The Concept

Signal processing for language

Same principle as audio compression. Preserve the frequencies that carry meaning. Remove the noise that doesn't.

Original signal
Compressed signal

See it in action

Open the app and send a message. The full pipeline runs on every request, and the inspect panel shows you what happened.

Open Nyquest All Products