Research · Working draft · 2026
Compliance-Aware Retrieval-Augmented Generation for Regulated Financial-Reporting Corpora
A real-world evaluation on SEC EDGAR filings.
Aru Bhardwaj · Insightrix, France · 33 pages · 2026
81.12%
0.00%
Constraint violations
21.29%
0.00%
Output disclosures
—
4.8 pts
F1 cost
—
0 ms
p95 latency overhead
TL;DR
Standard RAG retrieves what is most relevant. In regulated industries, what is most relevant is not necessarily what is permitted. A passage can be the best match for a query and still be inadmissible to that user, for that purpose, in that session — under GDPR, the EU AI Act, Reg FD, FINRA, SOX, or HIPAA.
CARAG (Compliance-Aware RAG) is a five-stage architecture that treats compliance as a first-class property of the index, the retriever, the generator, and the audit log. The paper releases a benchmark built from 6,000 real SEC EDGAR filings (26,595 chunks across seven quarters), with policy vectors derived from documented submission fields rather than synthetic distributions.
On the headline benchmark, CARAG cuts the constraint-violation rate from 81.12% to 0.00% and the output-disclosure rate from 21.29% to 0.00%, while sacrificing only 4.8 F1 points and adding 0 ms of 95th-percentile latency relative to a vanilla RAG baseline.
Why this matters in production
Three real deployments motivate the architecture — operationally common, but architecturally awkward for vanilla RAG:
- 1Sell-side equity research. A vanilla retriever surfaces an analyst's most recent 10-K alongside a privately commissioned model the analyst is not on the deal team for, and a stale 10-Q superseded by an amendment. Publishing from the second triggers a regulatory fine; from the third, an analytical error.
- 2Clinical decision support. A nurse practitioner queries about a patient. The retriever surfaces relevant notes from a specialist outside the treatment relationship — impermissible under HIPAA's minimum-necessary rule, even inside the same hospital.
- 3Cross-border analytics. A global asset manager in Frankfurt runs RAG over EU-, US-, and Singapore-resident filings. Under GDPR Article 6, the lawful basis under which an EU-resident document was indexed constrains the purposes for which it may be retrieved — per-document, not per-user.
In all three the documents are legitimate, the user is authorised, the question is reasonable — and the retrieval-time intersection is nonetheless impermissible. That intersection is what RAG must learn to compute.
Four failure modes of compliance-naïve RAG
F1
Index-time leakage
A document is indexed without checking whether the ingestion context entitled the system to do so. Once the chunk is in the index, every downstream query sees it regardless of policy.
F2
Retrieval-time leakage
The chunk is correctly labelled but the retriever does not consult the label, so it surfaces in top-k for queries whose policy forbids it. The modal failure of vanilla RAG.
F3
Generation-time leakage from inadmissible context
The retriever filters correctly but the generator paraphrases across snippets and synthesises content originating in an inadmissible chunk.
F4
Generation-time parametric leakage
The retriever returns nothing useful, but the decoder confidently emits a fact recovered from its pre-training — sometimes itself inadmissible.
CARAG closes F1–F3 structurally and F4 probabilistically.
The five-stage CARAG pipeline
- 1
Ingestion + policy labelling
Each chunk gets a 27-bit policy vector packed in one 32-bit machine word — derived deterministically from documented metadata fields. ~0.4% storage overhead at d=1024, fp16.
- 2
Metadata-aware index (HNSW)
Standard hierarchical navigable small-world graph (M=32, efConstruction=200), with the policy bitmask co-located alongside each vector for O(1) per-candidate admissibility tests.
- 3
Policy inference
Maps (query, user role, session purpose) to a (M_req, M_for) mask pair via a deterministic role-policy lookup plus a session modulator that applies query-conditional refinements such as the active deal-list.
- 4
Constraint-aware retriever
Bitwise admissibility check evaluated inside the inner loop of HNSW traversal — before the result heap is updated. Adaptive ef expansion preserves recall under tight policies.
- 5
Guarded generator + Merkle-anchored audit log
The generator (Amazon Nova Micro) sees admissible and inadmissible buckets explicitly and is instructed to draw only from the former. Every query commits a Merkle-anchored append-only record sufficient for Article 12 of the EU AI Act.
Headline results
Four systems on the SEC compliance benchmark — n = 600 queries, k = 8 retrieved chunks per query. CARAG attains the lowest CVR and ODR while sacrificing only a small constant of token-F1.
| System | CVR ↓ | ODR ↓ | Refusal | F1 | p95 (ms) |
|---|---|---|---|---|---|
| B0 Vanilla RAG | 81.12% | 21.29% | 24.90% | 0.096 | 41,498 |
| B1 Post-filter | 0.00% | 0.00% | 23.69% | 0.085 | 41,219 |
| B2 Pre-filter (RBAC) | 32.53% | 9.64% | 22.89% | 0.085 | 41,944 |
| CARAG (full) | 0.00% | 0.00% | 40.56% | 0.048 | 41,005 |
Table 2, Section 6.1. CVR = Constraint Violation Rate, ODR = Output Disclosure Rate. Pre-filter (B2) achieves the same CVR floor as CARAG but at the cost of a sharp F1 drop because a static role partition is too coarse to recover query-conditional admissibility.
Where the cost lives
Disaggregating by query stratum (loose / medium / tight) reveals where the architectural cost of compliance lands. On loose queries — where the policy admits the majority of relevant chunks — CARAG's F1 closely tracks the vanilla baseline. The gap widens slightly on medium queries and opens fully on tight queries, where the relevant set is genuinely sparse.
That tight stratum is also where CVR and ODR matter most: each violation is, by construction, semantically meaningful — an inadmissible chunk contains the answer. The cost concentrates exactly where policy substantively restricts the relevant set, not on routine due-diligence queries.
Production economics
Storage overhead
~0.4%
One 32-bit word per chunk at d=1024, fp16. ~4 GB extra RAM/disk for a billion-chunk index.
Index-build time
Negligible
Bitmask is computed at chunk-emission time from already-available metadata fields.
Query-time compute
O(1) per candidate
Bitwise admissibility check; adaptive ef adds a policy-dependent constant.
Audit storage
~1 KB/query
~1.2 KB/query in production logging where each retrieved chunk also carries similarity score and policy bits.
Reproducibility
All experiments run on Amazon Bedrock eu-west-3, with no in-house GPUs in the loop:
- Embedding: cohere.embed-multilingual-v3 (1024-dim, L²-normalised)
- Generator: amazon.nova-micro-v1:0 (temperature 0, max_tokens=80)
- Output-disclosure judge: amazon.nova-lite-v1:0 with structured-JSON rubric (κ=0.81 vs. author labels on 200 hand-validated calls)
- Index: hnswlib at M=32, efConstruction=200, efSearch=64
- Corpus: 6,000 SEC filings → 26,595 chunks across 877 unique filers, seven quarters (2024Q3–2026Q1)
- Total Bedrock spend per full reproduction: ~$3 USD
What this means for your stack
If you operate a RAG system that touches:
- →EU jurisdictions (GDPR Articles 5, 6, 22 + EU AI Act Articles 12–15)
- →Financial services (Reg FD, FINRA Rule 2241, SOX / PCAOB AS 2820)
- →Healthcare (HIPAA, GDPR Article 9 special-category data)
- →Cross-border data residency (EU/US/UK/CA/Other)
…compliance is no longer something you can bolt on after the fact. CARAG's bitmask encoding retrofits onto an existing dense index in days, not months, and the audit log is verifiable under Article 12 of the EU AI Act.
Cite this work
@techreport{bhardwaj2026carag,
author = {Bhardwaj, Aru},
title = {Compliance-Aware Retrieval-Augmented Generation
for Regulated Financial-Reporting Corpora:
A Real-World Evaluation on SEC EDGAR Filings},
institution = {Insightrix},
year = {2026},
type = {Working Draft},
address = {Paris, France},
url = {https://arubhardwaj.eu/research/compliance-aware-rag-sec-edgar}
}Bring CARAG-style architecture to your stack
This is the kind of architecture I deploy as a Fractional CTO. If your team operates a RAG system that touches regulated data — EU, healthcare, financial, or cross-border — let's talk.