Back to research

Research · Working draft · 2026

Compliance-Aware Retrieval-Augmented Generation for Regulated Financial-Reporting Corpora

A real-world evaluation on SEC EDGAR filings.

Aru Bhardwaj · Insightrix, France · 33 pages · 2026

Working draft — feedback welcome at bonjour@arubhardwaj.eu
ENThe PDF is in English only. Below is a translated summary of its key points.

81.12%

0.00%

Constraint violations

21.29%

0.00%

Output disclosures

4.8 pts

F1 cost

0 ms

p95 latency overhead

TL;DR

Standard RAG retrieves what is most relevant. In regulated industries, what is most relevant is not necessarily what is permitted. A passage can be the best match for a query and still be inadmissible to that user, for that purpose, in that session — under GDPR, the EU AI Act, Reg FD, FINRA, SOX, or HIPAA.

CARAG (Compliance-Aware RAG) is a five-stage architecture that treats compliance as a first-class property of the index, the retriever, the generator, and the audit log. The paper releases a benchmark built from 6,000 real SEC EDGAR filings (26,595 chunks across seven quarters), with policy vectors derived from documented submission fields rather than synthetic distributions.

On the headline benchmark, CARAG cuts the constraint-violation rate from 81.12% to 0.00% and the output-disclosure rate from 21.29% to 0.00%, while sacrificing only 4.8 F1 points and adding 0 ms of 95th-percentile latency relative to a vanilla RAG baseline.

Why this matters in production

Three real deployments motivate the architecture — operationally common, but architecturally awkward for vanilla RAG:

  1. 1
    Sell-side equity research. A vanilla retriever surfaces an analyst's most recent 10-K alongside a privately commissioned model the analyst is not on the deal team for, and a stale 10-Q superseded by an amendment. Publishing from the second triggers a regulatory fine; from the third, an analytical error.
  2. 2
    Clinical decision support. A nurse practitioner queries about a patient. The retriever surfaces relevant notes from a specialist outside the treatment relationship — impermissible under HIPAA's minimum-necessary rule, even inside the same hospital.
  3. 3
    Cross-border analytics. A global asset manager in Frankfurt runs RAG over EU-, US-, and Singapore-resident filings. Under GDPR Article 6, the lawful basis under which an EU-resident document was indexed constrains the purposes for which it may be retrieved — per-document, not per-user.

In all three the documents are legitimate, the user is authorised, the question is reasonable — and the retrieval-time intersection is nonetheless impermissible. That intersection is what RAG must learn to compute.

Four failure modes of compliance-naïve RAG

F1

Index-time leakage

A document is indexed without checking whether the ingestion context entitled the system to do so. Once the chunk is in the index, every downstream query sees it regardless of policy.

F2

Retrieval-time leakage

The chunk is correctly labelled but the retriever does not consult the label, so it surfaces in top-k for queries whose policy forbids it. The modal failure of vanilla RAG.

F3

Generation-time leakage from inadmissible context

The retriever filters correctly but the generator paraphrases across snippets and synthesises content originating in an inadmissible chunk.

F4

Generation-time parametric leakage

The retriever returns nothing useful, but the decoder confidently emits a fact recovered from its pre-training — sometimes itself inadmissible.

CARAG closes F1–F3 structurally and F4 probabilistically.

The five-stage CARAG pipeline

  1. 1

    Ingestion + policy labelling

    Each chunk gets a 27-bit policy vector packed in one 32-bit machine word — derived deterministically from documented metadata fields. ~0.4% storage overhead at d=1024, fp16.

  2. 2

    Metadata-aware index (HNSW)

    Standard hierarchical navigable small-world graph (M=32, efConstruction=200), with the policy bitmask co-located alongside each vector for O(1) per-candidate admissibility tests.

  3. 3

    Policy inference

    Maps (query, user role, session purpose) to a (M_req, M_for) mask pair via a deterministic role-policy lookup plus a session modulator that applies query-conditional refinements such as the active deal-list.

  4. 4

    Constraint-aware retriever

    Bitwise admissibility check evaluated inside the inner loop of HNSW traversal — before the result heap is updated. Adaptive ef expansion preserves recall under tight policies.

  5. 5

    Guarded generator + Merkle-anchored audit log

    The generator (Amazon Nova Micro) sees admissible and inadmissible buckets explicitly and is instructed to draw only from the former. Every query commits a Merkle-anchored append-only record sufficient for Article 12 of the EU AI Act.

Headline results

Four systems on the SEC compliance benchmark — n = 600 queries, k = 8 retrieved chunks per query. CARAG attains the lowest CVR and ODR while sacrificing only a small constant of token-F1.

SystemCVR ↓ODR ↓RefusalF1p95 (ms)
B0 Vanilla RAG81.12%21.29%24.90%0.09641,498
B1 Post-filter0.00%0.00%23.69%0.08541,219
B2 Pre-filter (RBAC)32.53%9.64%22.89%0.08541,944
CARAG (full)0.00%0.00%40.56%0.04841,005

Table 2, Section 6.1. CVR = Constraint Violation Rate, ODR = Output Disclosure Rate. Pre-filter (B2) achieves the same CVR floor as CARAG but at the cost of a sharp F1 drop because a static role partition is too coarse to recover query-conditional admissibility.

Where the cost lives

Disaggregating by query stratum (loose / medium / tight) reveals where the architectural cost of compliance lands. On loose queries — where the policy admits the majority of relevant chunks — CARAG's F1 closely tracks the vanilla baseline. The gap widens slightly on medium queries and opens fully on tight queries, where the relevant set is genuinely sparse.

That tight stratum is also where CVR and ODR matter most: each violation is, by construction, semantically meaningful — an inadmissible chunk contains the answer. The cost concentrates exactly where policy substantively restricts the relevant set, not on routine due-diligence queries.

Production economics

Storage overhead

~0.4%

One 32-bit word per chunk at d=1024, fp16. ~4 GB extra RAM/disk for a billion-chunk index.

Index-build time

Negligible

Bitmask is computed at chunk-emission time from already-available metadata fields.

Query-time compute

O(1) per candidate

Bitwise admissibility check; adaptive ef adds a policy-dependent constant.

Audit storage

~1 KB/query

~1.2 KB/query in production logging where each retrieved chunk also carries similarity score and policy bits.

Reproducibility

All experiments run on Amazon Bedrock eu-west-3, with no in-house GPUs in the loop:

  • Embedding: cohere.embed-multilingual-v3 (1024-dim, L²-normalised)
  • Generator: amazon.nova-micro-v1:0 (temperature 0, max_tokens=80)
  • Output-disclosure judge: amazon.nova-lite-v1:0 with structured-JSON rubric (κ=0.81 vs. author labels on 200 hand-validated calls)
  • Index: hnswlib at M=32, efConstruction=200, efSearch=64
  • Corpus: 6,000 SEC filings → 26,595 chunks across 877 unique filers, seven quarters (2024Q3–2026Q1)
  • Total Bedrock spend per full reproduction: ~$3 USD

What this means for your stack

If you operate a RAG system that touches:

  • EU jurisdictions (GDPR Articles 5, 6, 22 + EU AI Act Articles 12–15)
  • Financial services (Reg FD, FINRA Rule 2241, SOX / PCAOB AS 2820)
  • Healthcare (HIPAA, GDPR Article 9 special-category data)
  • Cross-border data residency (EU/US/UK/CA/Other)

…compliance is no longer something you can bolt on after the fact. CARAG's bitmask encoding retrofits onto an existing dense index in days, not months, and the audit log is verifiable under Article 12 of the EU AI Act.

Cite this work

@techreport{bhardwaj2026carag,
  author      = {Bhardwaj, Aru},
  title       = {Compliance-Aware Retrieval-Augmented Generation
                 for Regulated Financial-Reporting Corpora:
                 A Real-World Evaluation on SEC EDGAR Filings},
  institution = {Insightrix},
  year        = {2026},
  type        = {Working Draft},
  address     = {Paris, France},
  url         = {https://arubhardwaj.eu/research/compliance-aware-rag-sec-edgar}
}

Bring CARAG-style architecture to your stack

This is the kind of architecture I deploy as a Fractional CTO. If your team operates a RAG system that touches regulated data — EU, healthcare, financial, or cross-border — let's talk.

Aru Bhardwaj

Fractional CTO architecting sovereign AI systems for startups and scale-ups across Europe. Custom ML, agentic RAG, and secure LLM infrastructure. 7+ years turning complex data into production intelligence.

Malt
Upwork

Contact

Services

  • Fractional CTO & AI Strategy
  • MVP Development & Rapid Prototyping
  • Sovereign LLM Deployment (OVHcloud, Scaleway)
  • Multi-Cloud AI (AWS Bedrock, Vertex AI, Azure)
  • RAG Pipelines & Autonomous Agents
  • GDPR & EU AI Act Compliance
  • Generative AI & Prompt Engineering
  • Machine Learning & Predictive Analytics

Monthly playbook

Practical AI essays for founders and tech leaders. One email a month.

Tactical AI essays, monthly.

© 2026 Insightrix SASU. All rights reserved.Aru Bhardwaj, Fractional CTO & AI Strategist

60 Rue François Ier, 75008 Paris, France · SIRET 989 236 856 00013 · TVA FR42989236856