Compliance-Aware Retrieval-Augmented Generation for Regulated Financial-Reporting Corpora: A Real-World Evaluation on SEC EDGAR Filings

Q: Do you work on pure equity?

I love working with ambitious founders, and I happily structure hybrid engagements — a reduced service fee combined with equity — when the fit is right. Pure-equity arrangements, however, aren't something I take on. A service fee is required on every project: it ensures focused delivery, protects both sides against misaligned timelines, and keeps the engagement sustainable the same way any senior technical hire is compensated. If you're an early-stage founder, let's talk — there's almost always a structure that works.

Q: How much does a Fractional CTO cost in Europe?

Hourly consulting is €150/hour, day rate is €700/day. Ongoing Fractional CTO retainers start from about €2,100/month for a 2-3 days/month minimum and scale up with the cadence — a 1-day-a-week engagement typically lands between €5,000–€8,000 per month depending on scope. Partnership deals (reduced fee + equity) are available for aligned early-stage startups. All prices exclude VAT and French TVA applies where relevant; EU B2B clients with a valid intra-community VAT number benefit from reverse charge.

Q: Fractional CTO vs full-time CTO — which should I hire?

Hire a full-time CTO when the role genuinely needs 40+ hours/week of executive tech leadership — usually post-Series A with 10+ engineers, a real product in market, and a 3-year roadmap. Hire a Fractional CTO when you need senior technical leadership but: you're pre-seed to Series A, your engineering team is under 10 people, you want to validate the product before committing to a cap-table-impacting hire, or you need a specific expertise (AI strategy, EU compliance, sovereign cloud) that a generalist full-time CTO wouldn't bring.

Q: How does the Partnership model work?

It's a hybrid engagement: a reduced cash rate (typically 30-40% of my standard) combined with equity. The exact split is discussed case by case after we've talked through your stage, timeline, runway, and goals. The service-fee component is always present; it's what keeps the engagement focused and fair to both sides.

Q: What's a typical engagement duration?

Consultations run 30 or 60 minutes. MVPs typically ship in 4-8 weeks. Fractional CTO engagements are usually 2-3 days per month minimum, with the cadence scaling up based on traction and fit. Shorter sprint engagements (1-2 weeks) are also possible for well-scoped prototypes.

Q: Can you help with EU AI Act compliance?

Yes. I help companies classify their AI systems by risk tier (prohibited / high-risk / limited-risk / minimal-risk), implement the technical documentation and post-market monitoring required by Articles 9-15, align with GPAI obligations for foundation-model users, and produce the DPIA, Transfer Impact Assessments (Schrems II), and Article 28 DPAs needed. I also cross-walk with GDPR, NIS2, DORA (for financial services), and ISO/IEC 42001 / 23894. Sign-off always rests with your DPO or legal counsel; what I deliver is a defensible, documented compliance posture.

Q: Can you help us migrate from OpenAI to sovereign infrastructure?

Yes — this is becoming a common request from European companies facing compliance pressure or customer procurement scrutiny. Typical path: audit current OpenAI/Anthropic usage, map workloads by quality/latency/cost sensitivity, select replacement targets (Mistral Large for most chat, self-hosted Llama-3 or Mixtral for sensitive data, Claude Sonnet via Bedrock EU where acceptable), build an eval harness that proves parity, migrate behind a feature flag, and cut over gradually. End state: zero prompt/response egress to non-EU jurisdictions, with an auditable trail.

Q: Do you work with non-technical founders?

Yes — this is one of my most common engagements. Non-technical founders hire me to make the build vs buy decision, run the first hiring, set up the stack, ship the MVP, and represent the company technically with investors, partners, and early customers. I translate between product intent and engineering reality, and write everything in plain language so the founder stays in the loop without needing to code.

Q: Do you work remote or on-site?

Remote-first, based in Paris. I'm open to on-site days in Paris and short on-site sprints across Europe for the right engagements. Most clients are 100% remote with weekly syncs and async daily updates.

Q: How quickly can you start?

Usually within 1-2 weeks of signing. For urgent MVPs or discovery sprints I can often kick off within days, starting with a scoped 3-5 day discovery phase before we commit to the full build.

Bhardwaj, Aru

Back to research

Research · Working draft · 2026

Compliance-Aware Retrieval-Augmented Generation for Regulated Financial-Reporting Corpora

A real-world evaluation on SEC EDGAR filings.

Aru Bhardwaj · Insightrix, France · 33 pages · 2026

Working draft — feedback welcome at bonjour@arubhardwaj.eu

ENThe PDF is in English only. Below is a translated summary of its key points.

81.12%

0.00%

Constraint violations

21.29%

0.00%

Output disclosures

—

4.8 pts

F1 cost

—

0 ms

p95 latency overhead

TL;DR

Standard RAG retrieves what is most relevant. In regulated industries, what is most relevant is not necessarily what is permitted. A passage can be the best match for a query and still be inadmissible to that user, for that purpose, in that session — under GDPR, the EU AI Act, Reg FD, FINRA, SOX, or HIPAA.

CARAG (Compliance-Aware RAG) is a five-stage architecture that treats compliance as a first-class property of the index, the retriever, the generator, and the audit log. The paper releases a benchmark built from 6,000 real SEC EDGAR filings (26,595 chunks across seven quarters), with policy vectors derived from documented submission fields rather than synthetic distributions.

On the headline benchmark, CARAG cuts the constraint-violation rate from 81.12% to 0.00% and the output-disclosure rate from 21.29% to 0.00%, while sacrificing only 4.8 F1 points and adding 0 ms of 95th-percentile latency relative to a vanilla RAG baseline.

Why this matters in production

Three real deployments motivate the architecture — operationally common, but architecturally awkward for vanilla RAG:

1
Sell-side equity research. A vanilla retriever surfaces an analyst's most recent 10-K alongside a privately commissioned model the analyst is not on the deal team for, and a stale 10-Q superseded by an amendment. Publishing from the second triggers a regulatory fine; from the third, an analytical error.
2
Clinical decision support. A nurse practitioner queries about a patient. The retriever surfaces relevant notes from a specialist outside the treatment relationship — impermissible under HIPAA's minimum-necessary rule, even inside the same hospital.
3
Cross-border analytics. A global asset manager in Frankfurt runs RAG over EU-, US-, and Singapore-resident filings. Under GDPR Article 6, the lawful basis under which an EU-resident document was indexed constrains the purposes for which it may be retrieved — per-document, not per-user.

In all three the documents are legitimate, the user is authorised, the question is reasonable — and the retrieval-time intersection is nonetheless impermissible. That intersection is what RAG must learn to compute.

Four failure modes of compliance-naïve RAG

F1

Index-time leakage

A document is indexed without checking whether the ingestion context entitled the system to do so. Once the chunk is in the index, every downstream query sees it regardless of policy.

F2

Retrieval-time leakage

The chunk is correctly labelled but the retriever does not consult the label, so it surfaces in top-k for queries whose policy forbids it. The modal failure of vanilla RAG.

F3

Generation-time leakage from inadmissible context

The retriever filters correctly but the generator paraphrases across snippets and synthesises content originating in an inadmissible chunk.

F4

Generation-time parametric leakage

The retriever returns nothing useful, but the decoder confidently emits a fact recovered from its pre-training — sometimes itself inadmissible.

CARAG closes F1–F3 structurally and F4 probabilistically.

The five-stage CARAG pipeline

1
Ingestion + policy labelling
Each chunk gets a 27-bit policy vector packed in one 32-bit machine word — derived deterministically from documented metadata fields. ~0.4% storage overhead at d=1024, fp16.
2
Metadata-aware index (HNSW)
Standard hierarchical navigable small-world graph (M=32, efConstruction=200), with the policy bitmask co-located alongside each vector for O(1) per-candidate admissibility tests.
3
Policy inference
Maps (query, user role, session purpose) to a (M_req, M_for) mask pair via a deterministic role-policy lookup plus a session modulator that applies query-conditional refinements such as the active deal-list.
4
Constraint-aware retriever
Bitwise admissibility check evaluated inside the inner loop of HNSW traversal — before the result heap is updated. Adaptive ef expansion preserves recall under tight policies.
5
Guarded generator + Merkle-anchored audit log
The generator (Amazon Nova Micro) sees admissible and inadmissible buckets explicitly and is instructed to draw only from the former. Every query commits a Merkle-anchored append-only record sufficient for Article 12 of the EU AI Act.

Headline results

Four systems on the SEC compliance benchmark — n = 600 queries, k = 8 retrieved chunks per query. CARAG attains the lowest CVR and ODR while sacrificing only a small constant of token-F1.

System	CVR ↓	ODR ↓	Refusal	F1	p95 (ms)
B0 Vanilla RAG	81.12%	21.29%	24.90%	0.096	41,498
B1 Post-filter	0.00%	0.00%	23.69%	0.085	41,219
B2 Pre-filter (RBAC)	32.53%	9.64%	22.89%	0.085	41,944
CARAG (full)	0.00%	0.00%	40.56%	0.048	41,005

Table 2, Section 6.1. CVR = Constraint Violation Rate, ODR = Output Disclosure Rate. Pre-filter (B2) achieves the same CVR floor as CARAG but at the cost of a sharp F1 drop because a static role partition is too coarse to recover query-conditional admissibility.

Where the cost lives

Disaggregating by query stratum (loose / medium / tight) reveals where the architectural cost of compliance lands. On loose queries — where the policy admits the majority of relevant chunks — CARAG's F1 closely tracks the vanilla baseline. The gap widens slightly on medium queries and opens fully on tight queries, where the relevant set is genuinely sparse.

That tight stratum is also where CVR and ODR matter most: each violation is, by construction, semantically meaningful — an inadmissible chunk contains the answer. The cost concentrates exactly where policy substantively restricts the relevant set, not on routine due-diligence queries.

Production economics

Storage overhead

~0.4%

One 32-bit word per chunk at d=1024, fp16. ~4 GB extra RAM/disk for a billion-chunk index.

Index-build time

Negligible

Bitmask is computed at chunk-emission time from already-available metadata fields.

Query-time compute

O(1) per candidate

Bitwise admissibility check; adaptive ef adds a policy-dependent constant.

Audit storage

~1 KB/query

~1.2 KB/query in production logging where each retrieved chunk also carries similarity score and policy bits.

Reproducibility

All experiments run on Amazon Bedrock eu-west-3, with no in-house GPUs in the loop:

Embedding: cohere.embed-multilingual-v3 (1024-dim, L²-normalised)
Generator: amazon.nova-micro-v1:0 (temperature 0, max_tokens=80)
Output-disclosure judge: amazon.nova-lite-v1:0 with structured-JSON rubric (κ=0.81 vs. author labels on 200 hand-validated calls)
Index: hnswlib at M=32, efConstruction=200, efSearch=64
Corpus: 6,000 SEC filings → 26,595 chunks across 877 unique filers, seven quarters (2024Q3–2026Q1)
Total Bedrock spend per full reproduction: ~$3 USD

What this means for your stack

If you operate a RAG system that touches:

→EU jurisdictions (GDPR Articles 5, 6, 22 + EU AI Act Articles 12–15)
→Financial services (Reg FD, FINRA Rule 2241, SOX / PCAOB AS 2820)
→Healthcare (HIPAA, GDPR Article 9 special-category data)
→Cross-border data residency (EU/US/UK/CA/Other)

…compliance is no longer something you can bolt on after the fact. CARAG's bitmask encoding retrofits onto an existing dense index in days, not months, and the audit log is verifiable under Article 12 of the EU AI Act.

Cite this work

@techreport{bhardwaj2026carag,
  author      = {Bhardwaj, Aru},
  title       = {Compliance-Aware Retrieval-Augmented Generation
                 for Regulated Financial-Reporting Corpora:
                 A Real-World Evaluation on SEC EDGAR Filings},
  institution = {Insightrix},
  year        = {2026},
  type        = {Working Draft},
  address     = {Paris, France},
  url         = {https://arubhardwaj.eu/research/compliance-aware-rag-sec-edgar}
}

Bring CARAG-style architecture to your stack

This is the kind of architecture I deploy as a Fractional CTO. If your team operates a RAG system that touches regulated data — EU, healthcare, financial, or cross-border — let's talk.