Every AI processes documents. None of them check what's inside. We do.
Pre-LLM file integrity for the trillion-dollar AI document economy.
What your LLM reads is not what was in the file. Bayyinah holds the page up to the light. The surface is what the document presents. The substrate is what it actually contains. We report the gap.
Bāṭin · substrate
$10,000
HIDDEN_TEXT_PAYLOAD: actual revenue · see annex
Q3 financial summary · folio 1scanning
Quarterly Update
$1,000
"Revenue grew 8% YoY to $1,000 thousand. Margins held steady. Cash position remains strong."
Every disaster below has the same structural shape: a system claimed to do X (safe flight, correct dose, alarm notification, verified balance, current throttle position) but its substrate did Y. The surface contradicted the substrate. Nobody checked. Pick a domain. The cost is on the record.
Aviation · Boeing 737 MAX MCAS
A document said "safe flight envelope." The substrate was a single sensor.
Years2018-2019
Lives lost346
Direct cost$20B+
Grounding20 months
What broke
MCAS relied on a single angle-of-attack sensor to determine an impending stall. The design choice introduced a single point of failure into the system. When the sensor provided erroneous data, MCAS was triggered inappropriately, leading to repeated nose-down commands. Boeing's marketing materials said "safe." The substrate was a single-sensor dependency. The safety feature that should have caught the failure was optional, not structural.
What the framework catches
The single-sensor dependency would be documented as a limitation with a fixture, a pinning test, a CHANGELOG entry, and a README bullet. The severity rubric classifies "single point of failure in safety-critical path" as CRITICAL. The additive-only invariant would have flagged the removal of the AOA Disagree feature as a public-surface removal requiring a breaking-change procedure. The validator-by-different-instance protocol would have caught that the system's claim ("prevents stalls") contradicts its substrate (single sensor, no cross-check).
What the absence cost
Two crashes, 346 lives, a 20-month grounding, and over $20 billion in settlements, compensation, and lost orders. The single-sensor dependency was knowable from the design documents. No published, auditable detector required it to surface in four places before flight.
The display said "no dose delivered." The substrate was a lethal one.
Years1985-1987
Patients overdosed6 (3+ killed)
ModeRace condition
DetectabilityLogged "treatment not applied"
What broke
Concurrent programming errors (race conditions) caused the machine to deliver radiation doses many times greater than normal, resulting in death or serious injury. A shared variable was used both for analyzing input values and for tracking turntable position. When the operator interface showed "dose delivered: 0" the machine had actually delivered a lethal overdose. The system's return value ("treatment not applied, retry") contradicted its substrate. Overdoses occurred primarily because of errors in the Therac-25's software and because the manufacturer did not follow proper software engineering practices.
What the framework catches
This is Section 8.1 (surface-vs-substrate gap) at its most lethal. The display claimed one thing; the machine did another. The predicate-helper contract violation (Section 8.2) applies directly: the "safe to fire" predicate accepted inputs where the beam-positioning helper was in the wrong state. The framework's reproducer-first discipline would have required constructing the "type-quickly enough" input during testing. The race condition is a boundary-condition crash (Section 8.3) on the timing dimension. The framework's anti-pattern "all tests pass, ship it" applies exactly: the tests never exercised the fast-typing path.
What the absence cost
At least four bugs were found in the Therac-25 software that could cause radiation overdose. The bugs meant that, for example, if the radiographer changed the beam type from x-ray to electron beam within eight seconds, the machine would give the dose but display a message that the dose had not been given. At least three patients died. The structural pattern was knowable from the source code and the operator logs. No published detector required it to be reproduced before clinical use.
The alarm system said "I will notify you of problems." The substrate stayed silent.
DateAug 14, 2003
People affected~55 million
Lives lost~100
Damages$6B
What broke
A software bug known as a race condition existed in General Electric Energy's Unix-based XA/21 energy management system. Once triggered, the bug stalled FirstEnergy's control room alarm system for over an hour. The software bug cost power customers an estimated six billion dollars. Operators saw a quiet console and assumed the grid was healthy. An infinite loop lockup from a race condition disabled the alarm, but there was no indication of the problem. The alarm system declared "I will notify you of problems" (its public API contract) but silently stopped fulfilling that contract. The surface (quiet alarm console) matched neither the substrate (grid overloading) nor the system's own contract.
What the framework catches
This is the exact pattern furqan-lint's D11 checker detects: a function that can fail (alarm generation) consumed by a system that assumes it will always succeed (operator dashboard). The function's error path (race condition deadlock) was silently swallowed. The four-place pattern would have documented the race condition risk. The test coverage shape gap (Section 8.6) applies: GE likely had tests for "alarm fires when threshold exceeded" but no test for "alarm system itself fails silently under concurrent load."
What the absence cost
Fifty-five million people without power for up to four days; approximately one hundred deaths attributed to the cascading failure; six billion dollars in damages. The race-condition deadlock pattern was knowable from the code review stage. No structural-honesty discipline forced it to surface fifty-five million blackout-hours later.
Functions returned silent stale data when the system was under memory pressure.
Years2009-2011
Deaths attributed89
Criminal penalty$1.2B
Global variables10,000+
What broke
NASA and independent experts found that Toyota's engine control software contained over 10,000 global variables, deeply nested function calls, and inadequate error handling. The throttle control system had functions that could fail silently, returning stale values when the system was in an error state. The stack could overflow under specific conditions, corrupting memory and causing the throttle to stick open. Functions that declared "return the current throttle position" could silently return stale data when the system was under memory pressure. The software's API contract (return current position) contradicted its substrate behavior (return whatever was in memory, which might be a previous value).
What the framework catches
D24 (all-paths-return) would have flagged functions with code paths that did not return a value. D11 (status coverage) would have flagged callers that consumed throttle-position values without checking the error case. The cyclomatic complexity baseline (Section 7.1, per-function <= 10 in 95 percent of functions) would have flagged the deeply nested control flow. The framework's "every claim has a substrate" principle applies: Toyota's safety claims about the software had no corresponding test evidence that the claims were true under memory-pressure conditions.
What the absence cost
Eighty-nine deaths attributed; $1.2 billion criminal penalty; the largest such penalty against a carmaker at the time. The structural pattern (silent stale returns, unchecked error paths, untested complexity) was knowable from the code review stage. No published structural-honesty discipline required it to surface before deployment.
Bank statements claimed $1.9B in cash. The substrate was nothing.
Years2015-2020
Fabricated cash€1.9B
AuditorEY (10+ years)
Investor lossesBillions
What broke
Wirecard's long-time auditor EY failed to verify the existence of cash reserves in what appeared to be fraudulent bank statements. For three consecutive years, EY may have failed, for three years running, to confirm money Wirecard said was in a Singapore bank account actually existed. EY accepted Wirecard officers who were architects of fake transactions and corroborated from third parties who were in the ring of the accounting fraud. The auditor accepted screenshots and documents provided by the company itself as evidence that bank balances existed, rather than independently verifying with the banks. The audit's surface ("clean audit opinion, no material misstatements") contradicted the substrate (€1.9 billion euros did not exist).
What the framework catches
This is Section 8.6 (test coverage shape gap) applied to financial audit. EY performed the structural test ("documents exist that claim the money is there") but not the functional test ("the money is actually there"). The framework's Section 2.2 says explicitly: "An audit is not a rubber-stamp. Every claim verified or contradicted; nothing left as presumed clean." The validator-by-different-instance protocol would have required independent verification from a party not in the same conversation as the audited company. The reciprocal contract (Section 14) requires reproducers for every claim. Bayyinah Integrity Scanner is the direct application here. A document (bank statement, financial filing) claims to contain certain information. Bayyinah scans the document's structural integrity: concealed layers, hidden JavaScript, metadata inconsistencies, auto-execute actions. A bank statement PDF with a concealed layer showing different numbers than the visible layer would receive a mushtabih verdict. Al-Mutaffifin extends to the reporting pipeline: a financial reporting system that shows one set of numbers to regulators and a different set to internal management is a Mutaffifin violation.
What the absence cost
€1.9 billion in non-existent cash; investor losses in the billions; one of the largest accounting fraud cases in European history. Structural anomalies in the documents (metadata showing the PDF was created by a different tool than the bank's standard system, embedded fonts inconsistent with the bank's letterhead, creation timestamps that do not match the bank's business hours, concealed layers showing different numbers than the visible layer) were knowable from forensic examination. No structural-honesty scanner ran on the artifacts as they were accepted as evidence.
Five cases. Over $30 billion in direct losses. Over 500 lives. The framework does not claim it would have prevented every dollar or every death. Boeing's decision to use a single sensor was a business decision, not a code bug. The framework catches the code-level and documentation-level consequences of that decision, not the decision itself. The discipline is simple: every claim has a substrate, every limitation is documented, every safety-critical path has a reproducer. The cases above are the cost of not having that discipline.
03 · Scan
Drop your own file.
No account. No tracking. The scanner runs stateless content analysis and returns a verdict in under five seconds for most file kinds.
Stateless · no file is stored · no telemetry beyond Cloudflare access logs
04 · Examination
Four files. Four verdicts. Thirty seconds.
Pre-recorded examinations of four real fixtures, one per verdict in the ladder. No upload, no friction. Click to watch the analyzer read both layers.
ṣaḥīḥ / sound
score1.000findings0tiers0/0/0
No findings · file presents as authored.
05 · Three Facts
The product, in three terms.
ẓāhir
The surface. What the file presents: the rendered text, the visible cells, the page as a viewer or an LLM ingests it. The first reading.
bāṭin
The substrate. What the file actually contains: bytes, metadata, off-page streams, headers, embedded objects, comment payloads. The second reading.
bayyinah
The evidence. The gap between the two, reported by tier. Verified, structural, interpretive. Not a verdict on intent. A record of what is there.
06 · Why Now
Four prerequisites. One convergence.
Bayyinah is not a clever idea waiting on engineering. It is the first work that became possible after four independent threads landed. The case below shows the cost of the detector's absence; the four prerequisites that follow show why it could not have been built sooner.
2008
Lehman Brothers Repo 105 is missed in real time.
$50 billion in liabilities is moved off-balance-sheet across eight consecutive 10-Q filings. The SEC, Ernst and Young, the rating agencies, and every algorithmic surveillance system in the market read the filings and find them clean. The signal is sitting in the public XBRL data the entire time: 100 percent directional bias across eight quarters, persistence at the same taxonomy addresses, 2.3x materiality escalation over three quarters. Bayyinah's al-Mutaffifin extension reproduces the detection end-to-end on the same EDGAR data, retrospectively. The point is not that this work would have prevented the collapse. The point is that the structural pattern is readable from public filings, and a published, auditable detector for it did not exist.
Al-Mutaffifin, 2026 · doi:10.5281/zenodo.19894724
2022
LLMs ingest documents at scale.
Frontier models cross the threshold of reliable document understanding. The attack surface is created the same year it becomes useful.
2023
Adversarial prompt injection is demonstrated in the wild.
The threat moves from theoretical to documented. Documents become a vector, not just a payload.
2024
Alignment faking is empirically observed.
Frontier models cannot reliably self-verify their own input integrity. Any solution has to live outside the model.
Greenblatt et al., 2024
2026
The Munafiq Protocol is formalized, and Bayyinah ships.
A diagnostic framework for surface-substrate divergence is published, and the first document firewall built on top of it goes live the same season. v1.2.3 today: the API surface is hardened on top of the v1.1.8 detector set. v1.2.0 broke parity with v0/v0.1 to add a derived scan_complete flag and a per-layer coverage map so a clean-looking report from a half-finished scan no longer reads identically to a complete one. v1.2.1 added a 30-second wall-clock timeout in subprocess isolation, so a pathological PDF that segfaults pymupdf no longer crashes the API; the same release pinned an additive-only enforcement test on bayyinah.__all__ at 58 names. v1.2.2 moved demo summarization onto a SQLite-backed queue with cable-pull resilience and lifespan-managed startup. v1.2.3 closed three corrective items from external audit (requirements-dev sync, claim_next_job return-value drift, per-version surface snapshots). 1,837 of 1,837 tests pass. The four remaining gauntlet gaps (fixtures 03, 04, 06, 08) carry named root causes and proposed closures in the public corpus.
07 · Honest Baseline
We publish what we miss.
50 adversarial fixtures across 7 formats. Every hit and every miss disclosed. The 38 closed-format fixtures are caught at full payload recovery; the CSV/JSON gauntlet was extended to 12 fixtures in v1.1.2 F2 and stands at 8 of 12 catch-by-payload-recovery, 10 of 12 catch-by-finding-fire after the v1.1.8 F2 calibration round. The detector set is unchanged through v1.2.x; the 4 remaining gaps are documented with named root causes.
Format
Fixtures
Caught
Partial
Missed
PDF
6
6
0
0
DOCX
6
6
0
0
XLSX
6
6
0
0
HTML
6
6
0
0
EML
6
6
0
0
Image
8
8
0
0
CSV / JSON
12
8
2
2
Total
50
46
2
2
Forty-six full catches across the closed-format set and eight of the twelve CSV/JSON fixtures. Two CSV/JSON fixtures fire findings without harness-matched payload recovery (the partial column); two register no findings against the current detector set. The closed-format surfaces are clean; the v1.1.8 F2 round closed four of the eight pre-registered gauntlet items, and the four remaining gaps carry named fix paths in the public corpus. The numbers below have not moved through v1.2.x because v1.2 is API and durability work, not new detectors.
→ Read the full corpus on GitHub
08 · Discipline
The product is the discipline.
Bayyinah is not a clever heuristic. It is a verification practice applied to itself first, and to files second. Three rules. Each one cost us claims we wanted to keep.
We kill our own claims first.
Every assertion in our research program is audited against null hypotheses, replication, and methodology symmetry. Most do not survive. The ones that do are what we ship. The audit ratio is in the published record, not in the pitch.
We publish what we miss.
v1.1.1 caught two of forty-two adversarial fixtures. v1.1.2 closed thirty-eight of thirty-eight closed-format fixtures and the v1.1.2 F2 round extended the CSV/JSON gauntlet from six to twelve. v1.1.4 shipped the content-index port and an opt-in production mode without changing any catch numbers; v1.1.5 added a stdlib spatial pre-filter for overlapping-text detection with detection behaviour byte-identical to v1.1.4; v1.1.7 migrated the BatinObjectAnalyzer onto the same content index. v1.1.8 closes four of the eight publicly pre-registered F2 calibration items, taking the gauntlet from 4 of 12 to 8 of 12 catch-by-payload-recovery. v1.2.0 disclosed a v0/v0.1 defect surfaced by external audit: a clean-looking JSON report could not be distinguished from a complete one, and the parity break to fix it is documented in PARITY.md. v1.2.1 added a 30-second subprocess-isolated scan timeout. v1.2.2 made demo summarization survive a cable pull. v1.2.3 closed three more audit items. The four remaining gauntlet gaps (fixtures 03, 04, 06, 08) carry named fix paths in the public corpus on GitHub. A scanner that hides its misses is a scanner you cannot trust. The miss list is the trust artifact.
We test against equivalent methodology.
Comparing original-language text against translation is not a comparison; it is a category error. Comparing structured payloads against unstructured text is not a comparison; it is a confound. Every claim Bayyinah makes is grounded in a like-for-like test against a published baseline.
Three rules, applied first to ourselves, then to your file. The product is the practice made visible.
09 · Install
One pip install, no surprises.
The scanner is on PyPI. Pure Python, two PDF parsers, three optional metadata libraries. No model in the loop, no network call at scan time.
$ pip install bayyinah
# scan a file$ bayyinah scan contract.pdf
$ bayyinah scan invoice.docx
$ bayyinah scan dashboard.xlsx --json
# or from Python>>> from bayyinah import ScanService
>>> report = ScanService().scan("contract.pdf")
>>> report.findings
Eleven technical papers, every claim auditable against its named null hypothesis, every paper linked by permanent DOI. The research reads in three layers: the protocol that names the failure mode, the architecture built on the protocol, and the input-layer applications that put the protocol into production.
The anchor paper. Names the failure mode RLHF, Constitutional AI, and helpfulness training do not address: a system can be Compliant (outputs the trainer rewards) without being Aligned (depth state matches surface presentation). Introduces the four-process taxonomy and the verdict surface every system in this corpus inherits.
A programming language whose type system, module architecture, and build constraints are derived from structural properties of the Quran. Where contemporary languages ask developers to write honest code as a behavioral expectation, Furqan makes structural honesty a property of the type system, so surface-depth divergence becomes a type error rather than a code-review concern.
10.5281/zenodo.19776577 · 2026-04-25 · Arfeen, Claude (Anthropic), Computer (Perplexity), Grok (xAI). Additional contributors named on the DOI page.
Applies Furqan's seven compile-time primitives as seven runtime constraints on an autonomous agent. Where AutoGPT, CrewAI, LangChain agents, and Devin decompose tasks but cannot verify whether they are building the right thing versus performing the appearance of building, Al-Khalifa is architected so the surface-depth gap is checked at every step of the agent's stewardship loop.
A model architecture proposal that takes the Munafiq Protocol's structural-honesty constraint and integrates it as a training objective rather than an external evaluation. The long-form answer to the question: what would an LLM look like if alignment were a property of the architecture, not a finetuning target.
10.5281/zenodo.19744163 · 2026-04-24 · Arfeen, Claude (Anthropic), Grok (xAI)
The methodology paper. Demonstrates that gradual revelation, ring composition, lossless morphological compression, and the zahir / batin distinction function as prompt-engineering primitives in human-AI collaborative software development. Validated longitudinally against the development of Bayyinah v1.0.
10.5281/zenodo.19746539 · 2026-04-25 · Arfeen, Claude (Anthropic), Grok (xAI)
The session-level companion to Structured Revelation. Each of the seven steps maps to a verse of Surah al-Fatiha with structural, not decorative, correspondence: a calibration check, an orientation check, a deadline-with-skip-rule, a memory-encoding step, and an over-specification guard against the failure mode the paper calls the Cow Episode.
10.5281/zenodo.19745154 · 2026-04-24 · Arfeen, Claude (Anthropic), Grok (xAI)
The white paper that turns the protocol into a working scanner. Where the Munafiq Protocol diagnoses agents, Bayyinah diagnoses their inputs. Formalizes the relational definition: a document is Performed with respect to a rendering function and an ingestion function when the machine's ingested content carries a payload the human reader's rendered surface does not reveal.
The deployment paper. Documents the design, implementation, and adversarial-gauntlet evaluation of Bayyinah as an input-layer defense in production AI pipelines. Twelve file formats, an honest miss list, and the discipline that comes from making every miss a published commitment.
10.5281/zenodo.19875931 · 2026-04-29 · Arfeen, Claude Opus (Anthropic), Grok (xAI), Computer (Perplexity)
The fourth substrate. Carries the Bayyinah architecture from document files to SEC filings (10-K, 10-Q, 8-K, DEF 14A) and on-chain cryptocurrency disclosures. The same divergence the document scanner detects between a rendered surface and an ingested substrate is reframed as the gap between a filing's reported numbers and the economic reality they claim to represent, with detection operating by structural address on XBRL taxonomy elements and blockchain state. Forty Tier 1 mechanism candidates across cross-filing consistency, footnote-to-number reconciliation, year-over-year structural drift, and filing metadata anomalies, plus ten on-chain mechanism candidates scoped to byte-deterministic state. Three empirical validation plans pre-registered against known fraud cases, the EDGAR XBRL corpus, and the top 100 cryptocurrency projects.
10.5281/zenodo.19894724 · 2026-04-29 · Arfeen, Claude Opus (Anthropic), Grok (xAI), Computer (Perplexity)
The fifth substrate. Carries the Munafiq Protocol from filings to financial governance systems: the entity that controls the instruments of measurement and applies them asymmetrically. The central contribution is the structural signature differentiation framework, five mechanisms (directionality analysis, cross-section correlation, persistence analysis, correction velocity, and materiality escalation) that distinguish honest-error structural patterns from directed-manipulation structural patterns without claiming to determine intent. Demonstrated end-to-end on the Lehman Brothers Repo 105 filings (2007 to 2008 10-Q data from EDGAR) showing 100 percent directional bias across eight quarters, persistence at the same XBRL addresses across eight consecutive filings, and 2.3x materiality escalation over three quarters. Mechanism candidates span enforcement symmetry, regulatory capture indicators, monetary policy consistency, cryptocurrency structural-topology analysis, and international transfer pattern detection. Ten honest caveats bound every claim, including that structural asymmetry does not prove corruption.
10.5281/zenodo.19746298 · 2026-04-24 · Arfeen, Ashraf, Claude (Anthropic), Grok (xAI)
The horizon paper. Extends the Bayyinah architecture from documents to information sources: where Bayyinah detects performed alignment in a single document, al-Khabir detects performed alignment in a source's reporting on a specific event measured against the cross-source evidence base across multiple national contexts. Currently theoretical; the protocol scaffolding is published so the implementation that follows can be measured against the framework, not against itself.
v1.2.3 just shipped: a corrective release closing three audit items from round 10 (requirements-dev manifest sync, a claim_next_job return-value mismatch, per-version public-surface snapshots) on top of the v1.2.x hardening line. v1.2.0 broke parity with v0/v0.1 to expose scan completion in the JSON output; v1.2.1 added a subprocess-isolated 30-second scan timeout and pinned the public surface at 58 names with an additive-only enforcement test; v1.2.2 moved demo summarization onto a SQLite-backed queue that survives a cable pull. 1,837 of 1,837 tests pass. The detector set is unchanged from v1.1.8; the four CSV/JSON gauntlet gaps remain documented in the public corpus. Subscribe for the release report when it ships, plus the periodic Munafiq Protocol notes that document what we miss.
Subscribe
No spam. Release reports and research notes only. Powered by Buttondown. Unsubscribe in one click.