How We Built a Security-Sensitive Ask Pipeline That Doesn't Leak Your Data

Security answers live in the worst possible places: buried in audit reports, scattered across policy PDFs, locked behind NDAs. The people who need those answers fastest aka. sales teams mid-deal, customer success handling renewals... are usually the farthest from the source.

At Orbiq, we built Slack Ask so teams can type /orbiq ask in any channel and get sourced, access-controlled answers from their trust center in seconds. What looks like a simple Slack command on the surface is actually a distributed retrieval-and-generation pipeline with strict privilege boundaries, staged evidence scoring, and multi-model routing.

This post walks through how we designed it and why "just wrap a prompt" was not an option.

Three surfaces, one platform

We didn't start with three ask surfaces. We started with one, then learned that different contexts demand different trust boundaries and latency contracts.

Today, Orbiq has three entrypoints into the same underlying ask infrastructure:

Ask API — a Q&A API for users, agents and integrations. Used by compliance teams and automated workflows that need structured, typed responses e.g. multi select, select, boolean, text options are all supported.

Trust Center Answer — contact-scoped answers for approved external contacts visiting a company's trust center portal. Every response respects the contact's access level, NDA status, and document assignments.

Slack Ask — conversational Q&A from Slack with strict evidence gating. Designed for speed, but never at the cost of surfacing something the asker shouldn't see. Supporting multiple questions in a single ask with the warning policy regarding the sensitivity of the doc e.g. "This document contains NDA-sensitive information make sure you have one in place" ...

All three converge on the same platform, but they differ intentionally in authentication, response contracts, and how aggressively they optimize for latency.

Layered privilege model

When you're generating answers from sensitive compliance documents, the auth model has to be more than "check a token at the edge." We use four layers of privilege, each scoped to limit blast radius.

Contact Claims

In additon to the standard tenant claims, we augment every public Trust Center request with contact-level claims e.g. NDA status, session freshness, document accesses...

Service-to-service security

Each domain has dedicated authentication. We validate each using timing-safe comparison. If you compromise one token, you get access to one domain, not all of them.

Tenant-scoped data access

Inside the async layer, every data read requires a tenant-specific access token fetched from a secrets manager at runtime. If the token is missing or revoked, the system fails closed. Even if a request reaches the jobs platform, it can't read data without a valid tenant-scoped secret.

Slack ingress verification

Slack commands and events go through signature verification with timestamp freshness checks to prevent replay attacks. Signature validation uses timing-safe comparison before any callback is accepted.

Where we say "no" — leak prevention before the model sees anything

Most AI systems try to prevent data leaks through prompt engineering. We do it in the retrieval layer, before any content reaches the model context window.

Contact gating. For trust center answers, the async layer loads the contact's access profile and rejects requests from non-approved contacts before retrieval begins.

Access-level enforcement on documents. We classify every document with an access level, and enforce it before evidence text enters the model context:

Internal — never surfaced externally, period.
Public — shared with the world.
Restricted — shareable only if the document is explicitly assigned to the requesting contact.
Requires NDA — shareable only if the contact has a completed NDA.

Search-level filtering for external users. Even the initial search query excludes internal content in its filters. You can't retrieve what you can't query.

Sensitivity flagging for internal users. When an answer references NDA-protected or internal-only content, Slack Ask flags it in the response before anyone can copy-paste it into an external conversation.

Response contracts: why we use both `200` and `202`

Not every surface can afford to wait for generation to finish, and not every surface should pretend answers are instant.

Admin Ask: optimistic sync with async fallback

When a human or agent submits a question via the api/v1/ask endpoint, the system creates the ask record, queues the job, and polls for up to three seconds. If a terminal state arrives in time, it returns a 200 with the answer inline. If not, it returns 202 with a reference ID for polling.

This gives fast-path DX when generation is quick as many typed, structured questions resolve in under two seconds, without blocking the client on long-tail requests.

Trust Center Answer: async-first by design

External-facing answers always return 202 with an answer ID. The frontend polls for status, and the server updates the record when generation completes.

This keeps portal request lifetimes predictable and avoids timeout issues for external users on slower connections. It also means the trust center frontend can show meaningful loading states instead of a hanging spinner.

Following the Open Responses standard

Our response contracts align with the Open Responses specification, the new open-source standard for building multi-provider, interoperable LLM interfaces. Open Responses defines a shared schema for request/response patterns, streaming events, and tool invocation across providers. By adopting it, our 200 fast-path and 202 async-with-polling patterns follow a portable contract that any client or integration can implement against, regardless of which model provider sits behind the pipeline.

Staged retrieval: paying only for what you need

The Slack Ask pipeline doesn't run a single monolithic retrieval step. It runs progressively through stages, and stops early when it has enough evidence.

Stage 1: Knowledge base. Search the structured knowledge base for direct matches. If the answer is well-covered by existing Q&A pairs or policy statements, we score and potentially return here.

Stage 2: Document metadata. If knowledge base evidence is insufficient, search document metadata for relevant sources. This is cheaper than full-text retrieval but often enough to identify the right documents.

Stage 3: Reranked document snippets. Only if the first two stages don't produce high-confidence evidence do we perform full snippet extraction, reranking, and deep context assembly.

Each stage produces an evidence score. If a stage clears the confidence threshold, the pipeline short-circuits. For common compliance questions like "Are you SOC 2 certified?", "Where is data hosted?", "Do you support SSO?", the answer usually resolves at stage one, saving both latency and token cost.

Model routing and observability

We route to different models based on the question type and desired output format:

Short, factual questions route to a fast, cost-efficient model.
Questions that benefit from deeper reasoning route to a thinking-optimized model.
Default questions go through a fuller generation flow.

Model keys are environment-configurable, so we can swap providers without code changes.

The important part is observability parity. Every model call is traced with the same instrumentation:

Provider, model, and token usage metadata are captured per call.
Generation and embedding spans are exported to our observability platform.
Numeric confidence scores (search_answer_confidence, slack_ask_confidence..) are written as scored metrics.

This gives us apples-to-apples comparison when tuning quality, latency, and cost across models. When we test a new provider or model version, we can see exactly how confidence and cost distributions shift before rolling it out.

On-demand OCR with deduplication

Not every document in a trust center has machine-readable text. Some are scanned PDFs, some are image-heavy compliance certificates.

When the pipeline encounters a document without extracted text, it triggers on-demand OCR:

Resolve the active published version.
Fetch the asset properties and prepare limits to prevent resource exhaustion.
Run OCR and persist the extracted markdown for future reuse.

Confidence as a first-class output

Every answer carries a confidence score, normalized to a [0, 1] range and persisted alongside the answer. Confidence can come from multiple sources depending on the path:

Model-provided confidence fields in structured output.
Evidence and citation heuristics based on retrieval quality.
Stage evaluator outcomes from the progressive pipeline.

We write confidence to both storage and our observability platform, which lets us monitor reliability trends over time and build product controls on top.

For example, showing a "low confidence" indicator in the UI or requiring human review before an answer is shared externally. Or sharing the history and the performance as part of our customer success conversations.

What this architecture gives us

We could have built Slack Ask as a thin wrapper around a retrieval-augmented generation call. It would have shipped faster. It also would have leaked data the first time someone asked about a document they shouldn't have access to.

Instead, we treat "ask" as a security-sensitive distributed workflow:

Strict privilege boundaries with layered auth from edge to data layer.
Leak prevention at the retrieval layer, not just in prompts.
Latency-aware response contracts that match each surface's UX needs.
Staged retrieval that optimizes cost without sacrificing answer quality.
Multi-model routing with consistent observability across providers.
Confidence scoring as a foundation for trust.

If you're building AI features on top of sensitive data, the architecture around the model matters as much as the model itself. The hard part isn't getting an LLM to generate a plausible answer, it's building a production-grade AI infrastructure.

We chose EU-sovereign models

One decision we haven't touched on yet: model provider selection. For our default production pipeline, we opted for Mistral, an EU-headquartered, EU-sovereign model provider.

In our benchmarks, Mistral models perform roughly 5–10% below Anthropic's Claude on our answer evaluation suite. That's a gap, and we track it continuously through our confidence scores. But for an EU-first platform serving European enterprises navigating NIS2, DORA, and GDPR requirements, data sovereignty isn't negotiable.

Our architecture makes this trade-off manageable. Model keys are environment-configurable, so switching providers per surface or per tenant is possible.

Observability parity means we can measure the quality delta precisely and make informed decisions about when to route to which provider. And staged retrieval means the model sees less, so the performance gap matters less: high-quality evidence in, high-quality answers out, regardless of which model does the final generation.

We'd rather ship slightly less impressive benchmarks with full EU data residency than chase leaderboard scores that don't match our customers' compliance reality.

Orbiq is a trust center platform that helps B2B companies turn security and compliance from sales blockers into competitive advantages. Learn more about Slack Ask or see our trust center in action.