Every request flows through the same pipeline: authenticate, classify, optionally enrich with memory and web context, compress, dispatch to the right model, stream the response back. Visible. Auditable. Overrideable at every step.
// Request Lifecycle
From keystroke to response
Six stages on the request path. Each one is observable in the inspect panel.
01 · AUTH
Authenticate
JWT for the web app, nq-v1-... for API integrations, or BYOK keys for direct provider access. Each credential type has different statefulness — JWT requests carry memory, API-key requests are stateless by design.
02 · DISPATCH
Tier Selection
The dispatcher classifies the request and picks a tier (T1/T2/T3) based on your plan and the work involved. Per-tier circuit breakers track health; if a provider degrades, the request fails over to the next tier automatically.
03 · RECALL
Memory + Context
For app sessions, the memory system embeds your message and vector-searches your fact store for relevant prior context, scored by confidence. Sanitized facts are injected into system context. API-key requests skip this step.
04 · GROUND
Web Grounding (optional)
If the message benefits from current information, a detector triggers web search. Top results are fetched, content-type-gated and size-capped, then stitched into the prompt. Sources are recorded for provenance.
05 · COMPRESS
Compression Engine
The compression sidecar applies regex rules, prunes turn history above the context cap, and optionally invokes a semantic LLM stage on long system prompts. Token telemetry is captured. Tool schemas pass through untouched.
06 · STREAM
Dispatch + Stream
The optimized request goes upstream to the selected model. Tokens stream back via SSE, with response headers carrying telemetry: original tokens, compressed tokens, savings %, web-grounding sources, and the model that actually served the request.
// Routing Layer
The tier system
The dispatcher is what makes Nyquest a routing platform vs. a direct proxy. Every chat request is classified to a tier, with explicit defaults per plan.
Tier
Default Model
Available On
Use Case
T1
claude-haiku-4-5
Pro
Default cheap path. Fast iteration, simple questions, quick drafts.
T2
claude-sonnet-4-6
Pro
Pro default for complex tasks. Strong reasoning, instruction-following.
Budget alternative. Used when T1 circuit-breaker trips on a provider outage.
Free accounts route to whichever models are currently no-cost in the catalog (the available set rotates), plus BYOK with their own provider keys. The dispatcher tiers above apply once an account flips to Pro on first wallet funding.
Circuit breakers
If a tier fails N times in a row (default 5, configurable), the dispatcher opens the breaker for 60 seconds and routes to the next tier. Health is tracked per-tier and visible in admin.
Manual override
Pin any specific model from the picker and bypass the dispatcher entirely. The pinned model handles the whole conversation — no automatic switching.
No black box
Every request shows which tier was selected, which model actually served it, why, and how many tokens it cost. Visible in the inspect panel and usage analytics.
// Memory Subsystem
Memory is separate from history
Conversation history is the literal messages. Memory is a curated, embedding-indexed store of facts extracted across conversations.
Extractor
After a conversation turn, an LLM-driven extractor distills noteworthy facts. Output is structured: subject, predicate, confidence, source.
Sanitizer
Before storage, facts are scrubbed for PII, medication names, and numerics that look like keys or credentials. Anything ambiguous gets dropped, not stored.
Confidence scoring
Each fact carries a 0..1 score. Newer and more-corroborated facts rank higher. Stale or contradicted facts decay.
Vector recall
At chat time, your message is embedded and the top-K facts are retrieved by vector search. Only the most relevant ones are injected into context.
Provenance
Every recalled fact records what triggered its retrieval and which conversation it came from — auditable in the grounding-events log.
Stateless API
Memory recall fires only on JWT (web app) requests. API-key requests skip recall entirely — clean developer-API behavior.
// Web Grounding
Live data, on demand
When a chat needs current information, the web subsystem fetches actual page content — not cached snippets, not summaries from a middleman.
Your message│▼Detectorscores grounding need│▼SearchDuckDuckGo · top results│▼Fetchersandboxed · content-type gated · size-capped│▼Groundingstitched into system prompt · sources logged│▼Response includes:x-nyquest-web-grounded · x-nyquest-web-sources · x-nyquest-web-urls
No third-party search API key required. Toggle web grounding per-request via header, or let the detector decide.
// Under the Hood
The compression engine
Stage 05 of the lifecycle is a separate Rust service that runs as a sidecar to the backend. It's open-source and the same engine you can self-host.
Normalize
Whitespace cleanup, markdown noise removal, tokenizer-unfriendly character handling. Reduces entropy before any other stage runs.
Rule compression
532 regex rules across 19 categories: filler removal, verbose-phrase substitution, redundancy collapse, structural noise. A RegexSet prefilter and zero-allocation Cow<str> rule misses keep the hot path sub-millisecond on typical prompts.
Context optimization
Trims older turns based on a token cap (50K on prod) while preserving the last 4 turns verbatim and a 6-turn minimum. Critical instructions stay in scope.
Semantic stage
For long history or large system prompts, an LLM rewrites and condenses while preserving meaning. Falls back to extractive (rule-based) on timeout.
Provider awareness
Per-LLM hooks adapt compression to the target provider's tokenizer and prompt conventions. Tool schemas, image blocks, and audio blocks pass through untouched.
Telemetry
Every call returns original tokens, compressed tokens, savings %, latency, and which stages fired. Visible to admin and inspect-panel users.
Measured on the v3.2.0 Docker container with --network host. Sub-millisecond hot path; cached identical content returns in microseconds; semantic stage adds latency only on long inputs.
Test
Result
Detail
Health throughput (single thread)
8,266 req/s
p50 <1ms · host networking
Concurrent (20 workers, 5000 req)
18,287 req/s
p50 1ms · p95 1ms · p99 2ms
Cache hit (warm replay)
~3,730×
~70.9 ms cold → ~19 µs hit
Failed requests
0 / 10,000
across single + concurrent runs
Memory footprint
48.5 MB RSS
15 threads · Docker container
Production cumulative
1.54M tokens saved
across 18,085+ real requests · max 77.7% on a single request
// Reliability
How the pipeline reduces hallucination risk
Compression isn't just about cost. Cleaner inputs reduce the conditions that cause models to hallucinate.
Hallucinations increase when:
Prompts are ambiguous or instructions are buried in verbosity
Constraints are diffuse, spread across paragraphs
Context window fills with redundant or contradictory history
Critical instructions silently evict at token limits
The pipeline addresses each:
Normalization extracts and surfaces buried constraints
Conflict detection finds and resolves contradictions in rules