Model Weight Integrity: Verifying Authenticity Before Every Load

Why model weights are executable code

A model weight file determines what the model computes — altering weights changes behavior as surely as altering source code
Unlike source code, weight files are large binary blobs that are difficult to inspect visually
A weight file that can be silently replaced between training and inference gives an attacker control over model behavior without touching any application code
Threat model: supply chain attack on the weight hosting infrastructure, man-in-the-middle on the weight download, insider modification post-training

Cryptographic hash verification

Every approved weight file has a SHA-256 (or SHA-3) hash recorded in a Vault-backed manifest
The hash is computed at training completion and signed by the training pipeline's identity before being written to the manifest
At load time, the serving infrastructure recomputes the hash and compares it to the signed manifest value
Hash mismatch: load aborted, alert fired, the serving pod does not start
Verification runs on every load — not just at deployment time

HSM-backed signing keys

The signing key for weight manifests is stored in the HSM (same HSM as Vault's unseal key)
Training pipeline authenticates to the HSM via its OIDC workload identity before signing
Signing key cannot be extracted from the HSM — even a compromised training pipeline cannot forge a signature without HSM access
Key rotation: new signing key issued annually; transition period where both old and new keys are trusted during the rotation window

Weight storage and access control

Approved weights stored in a locked S3 bucket (Object Lock, COMPLIANCE mode) with versioning enabled
IAM policy: only the signing principal can write new weight versions; serving infrastructure is read-only
Weight download over TLS with certificate pinning to the storage endpoint — MITM interception detected
No serving infrastructure stores weights locally beyond the current serving session — weights are loaded, verified, and released

What hash verification does and does not cover

Hash verification answers one question precisely: is this the exact file that was signed at training completion? It answers that question with cryptographic certainty. It does not, and cannot, answer a different question: was the training process that produced this file compromised before it ever got signed?

A weight file modified after training — swapped on a storage volume, intercepted in transit, replaced by a compromised serving node — is caught by hash verification, every time, on every load. This is the threat model hash verification exists for, and it covers it completely.

A backdoor introduced during training — through poisoned training data, a compromised training pipeline that produces a model with hidden behavior, or a subtly adversarial fine-tuning step — produces a weight file that is entirely legitimate from the hash verification's perspective. It was signed correctly, by the correct pipeline, because the pipeline itself produced the backdoored weights as its normal output. There is no hash to mismatch.

Detecting training-time backdoors is an open problem, not a solved one

Be direct about this with anyone relying on this module for assurance: there is no reliable, general technique for detecting a backdoor introduced during training by examining the resulting weights. This is an active area of research, and published results are typically specific to particular backdoor types, trigger patterns, or model architectures — they do not generalize to "detect any backdoor in any model."

What this module's controls actually provide, honestly stated:

Hash verification and HSM-backed signing close the post-training tampering vector completely. This is real, mechanical security and should be implemented regardless.
Behavioral evaluation against a fixed test set (a canary suite run on every model update) can catch backdoors that manifest as general capability regression — a model that got measurably worse at its evaluation suite is suspicious regardless of cause. It is not designed to and should not be relied upon to catch a backdoor that is narrow, targeted, and doesn't touch the evaluation distribution at all — which describes most backdoors of actual concern.
Output distribution monitoring in production can surface a deployed model behaving differently than its evaluation suite predicted — useful for catching drift, environment mismatches, or a backdoor that happens to trigger on production traffic patterns not represented in evaluation. It is not a backdoor detector; it is a "this model is behaving unexpectedly" detector, and a sophisticated backdoor is specifically designed not to trigger this.

What this means operationally

The defensible position is: control the training pipeline's integrity as rigorously as you control the weight storage's integrity. Hash verification and signing protect the chain from training completion onward — extend the same supply chain discipline (Modules 1–3) to the training pipeline itself: who can modify training code, what data sources feed training, what review gates exist before a training run that will produce a production model. This is where backdoor risk is actually addressed — not at the weight-loading step, where it's already too late to detect most backdoors that matter.

For externally-sourced weights (Module 26's multi-provider weight management), this risk is structurally higher — you have no visibility into the training pipeline at all. The mitigation is not technical verification of the weights themselves; it is provenance and reputation of the source, combined with the broadest practical behavioral evaluation before promotion, with the explicit understanding that evaluation provides a floor of confidence, not a ceiling.

Multi-provider weight management

When using weights from external providers (HuggingFace, model registries), apply the same hash and signature verification pipeline described above for post-acquisition integrity
Verify the provider's published hash against the downloaded file before adding to the approved manifest
Do not download weights directly to serving infrastructure — pull to a staging environment, verify, sign, then promote to the approved store
Provider-published hashes should be verified against a secondary source (the provider's GPG-signed release notes, not just the download page)
Run the full behavioral evaluation suite (Module 24) before promoting any externally-sourced weight to production, and treat the result as informative rather than conclusive for training-time integrity

Key takeaways

Hash verification and HSM-backed signing completely close the post-training tampering vector — implement this regardless of anything else in this module
Detecting a backdoor introduced during training by examining the resulting weights is an open research problem with no general solution — do not represent behavioral monitoring as solving this
Behavioral evaluation and output monitoring are real, useful controls for capability regression and production drift — they are not backdoor detectors, and should not be the basis for a claim that backdoored weights will be caught
Backdoor risk is addressed by securing the training pipeline itself, not by inspecting its output — apply the same supply chain rigor to training infrastructure as to weight storage