Model validation testing asks one question: does this model do what it was built to do, on data it has never seen, and will it keep doing so in production? It is a different job from testing an AI application. Application testing checks how a chatbot or agent behaves; model validation checks the statistical model underneath it, its accuracy, calibration, robustness, fairness, and stability over time. The US banking regulator defines it cleanly as "the set of processes and activities intended to verify that models are performing as expected, in line with their design objectives and business uses" (Federal Reserve SR 11-7). This is the model-validation counterpart to our AI and LLM application testing guide; here is what a validation pass actually covers.
What is model validation, and how is it different from testing an AI application?
The two test different objects. An AI application is the system a user touches, a chatbot, a retrieval pipeline, an agent, and testing it means checking response quality, prompt-injection resistance, and tool use. A model is the quantitative engine inside, defined by SR 11-7 as "a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates" (SR 11-7). Validating that model means asking whether its predictions are accurate, well-calibrated, robust, fair, and stable. NIST frames the underlying idea as validation: "confirmation, through the provision of objective evidence, that the requirements for a specific intended use or application have been fulfilled" (NIST AI RMF, citing ISO 9000:2015). The two share a spine, since the NIST framework says AI systems "should be tested before their deployment and regularly while in operation" (NIST AI RMF), but the methods diverge: prompt-level evaluation harnesses for the application, statistical metrics for the model. Vervali's AI and LLM application testing guide covers the application half; this piece covers the model.
Why does a model need independent validation at all?
Because the people who build a model are the worst-placed to find its blind spots. SR 11-7 puts the principle directly: "Validation involves a degree of independence from model development and use. Generally, validation should be done by people who are not responsible for development or use and do not have a stake in whether a model is determined to be valid" (SR 11-7). It names the mechanism "effective challenge," meaning "critical analysis by objective, informed parties who can identify model limitations and assumptions and produce appropriate changes" (SR 11-7). That guidance is written for banking models and carries the force of supervisory expectation only there, so treat it as the clearest available framework rather than a universal legal rule. The general case is voluntary but pointed in the same direction: NIST lists "valid and reliable" as the foundational trustworthiness characteristic on which the others rest (NIST AI RMF).
What does model validation actually measure?
| Dimension | What validation checks | Reference |
|---|---|---|
| Generalisation | Does it perform on data it was not trained on | scikit-learn (hold-out test set, cross-validation) |
| Calibration | Do predicted probabilities match observed frequencies | scikit-learn (probability calibration) |
| Robustness | Does performance hold under noisy or shifted inputs | NIST AI RMF (MEASURE) |
| Fairness | Are errors distributed unfairly across groups | NIST SP 1270 (bias) |
| Stability over time | Does accuracy hold as data and relationships change | Gama et al. (concept drift) |
The first two are where most validation effort goes, and they are the easiest to get wrong.
How do you test that a model generalises?
You hold data back. A model that memorises its training set looks excellent and predicts nothing useful, the failure scikit-learn describes as overfitting: "a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data" (scikit-learn). The defence is to "hold out part of the available data as a test set" and, for a more stable estimate, to use cross-validation, where "a test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV" and "the training set is split into k smaller sets" (scikit-learn). Validation that reports a single accuracy number on the data the model trained on is not validation.
What is calibration, and why does it matter?
Calibration is whether the model's confidence means what it says. A "well calibrated" classifier is one whose probability outputs "can be directly interpreted as a confidence level," so that "among the samples to which it gave a predict_proba value close to, say, 0.8, approximately 80% actually belong to the positive class" (scikit-learn). This matters the moment a downstream decision uses the probability rather than the label, which is most risk, fraud, pricing, and triage systems. A model can be accurate on the label and badly miscalibrated on the probability, and a validation pass that only reports accuracy will miss it entirely.
How do you keep a model valid after deployment?
By watching for drift. NIST is explicit that "AI systems may require more frequent maintenance and triggers for conducting corrective maintenance due to data, model, or concept drift" (NIST AI RMF) and that "validity and reliability for deployed AI systems are often assessed by ongoing testing or monitoring" (NIST AI RMF). The peer-reviewed taxonomy separates two cases: real concept drift, where "the relation between the input data and the target variable changes over time" (a change in the conditional relationship), and virtual or data drift, where "the distribution of the incoming data changes... without affecting" that relationship (Gama et al., ACM Computing Surveys, 2014). Both degrade a deployed model, and only one is visible in the inputs alone, which is why post-deployment validation watches predictions and outcomes, rather than incoming features alone.
Which standards and frameworks govern model validation?
Several, at different altitudes. NIST's AI Risk Management Framework provides the MEASURE function, including the expectation that "the AI system to be deployed is demonstrated to be valid and reliable" with "limitations of the generalizability... documented" (NIST AI RMF, MEASURE 2.5). For organisational governance, ISO/IEC 42001:2023 "specifies requirements for establishing, implementing, maintaining, and continually improving an Artificial Intelligence Management System (AIMS)" (ISO), and ISO/IEC 5259 addresses "data quality for analytics and machine learning (ML)" (ISO). For fairness, NIST identifies "three categories of bias in AI, systemic, statistical, and human," noting that statistical biases "often arise when algorithms are trained on one type of data and cannot extrapolate beyond those data" (NIST SP 1270). And for regulated financial models, SR 11-7 remains the reference standard for independent validation.
How do you detect model drift in practice?
Drift detection is a monitoring problem, and the two kinds are caught in different places. Data drift, where "the distribution of the incoming data changes... without affecting" the input-output relationship (Gama et al., ACM Computing Surveys, 2014), is visible in the inputs alone: you compare the live feature distribution against the distribution the model trained on and alert when it moves. Concept drift, where "the relation between the input data and the target variable changes over time" (Gama et al.), only surfaces once you have outcomes to compare predictions against, so it needs labelled feedback, not input monitoring alone. A working setup tracks both: a baseline captured at validation time, live input-distribution checks for data drift, and a rolling comparison of predictions against realised outcomes for concept drift, with thresholds that trigger re-validation or retraining. That is the operational form of NIST's point that "validity and reliability for deployed AI systems are often assessed by ongoing testing or monitoring" (NIST AI RMF).
What should you look for in a model validation service?
Four things separate a validation service from a rubber stamp. First, independence: the validators should not be the people who built or operate the model, which is the "effective challenge" by "objective, informed parties" that SR 11-7 names as the guiding principle (SR 11-7). Second, coverage of all five dimensions rather than headline accuracy alone: generalisation, calibration, robustness, fairness, and drift, each measured rather than asserted. Third, documented limits: NIST expects a deployed system to be "demonstrated to be valid and reliable" with "limitations of the generalizability... documented" (NIST AI RMF, MEASURE 2.5), so a credible report states where the model should not be trusted, rather than only where it performs. Fourth, a continuous arrangement rather than a one-time certificate, because drift makes any single validation a snapshot with an expiry date. A service that reports one accuracy number and signs off is testing the easy part.
The verdict: validate the model, then keep validating it
Model validation is not a single accuracy score signed off before launch. It is a repeatable test of whether the model generalises beyond its training data, whether its probabilities are trustworthy, whether its errors fall fairly, and whether any of that is still true a quarter after deployment. The frameworks agree on the shape: prove validity before release, document the limits, and monitor for drift afterward. Treating validation as a one-time gate is how a model that passed in March quietly stops working by June. Vervali's model validation testing service is built around that continuous view rather than a single pre-launch report.
Sources
- Federal Reserve, SR 11-7 Guidance on Model Risk Management (attachment), 2011. https://www.federalreserve.gov/boarddocs/srletters/2011/sr1107a1.pdf
- NIST, AI Risk Management Framework (AI 100-1), 2023. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf
- scikit-learn, Cross-validation: evaluating estimator performance. https://scikit-learn.org/stable/modules/cross_validation.html
- scikit-learn, Probability calibration. https://scikit-learn.org/stable/modules/calibration.html
- Gama, Žliobaitė, Bifet, Pechenizkiy, Bouchachia, "A Survey on Concept Drift Adaptation," ACM Computing Surveys, 2014. https://dl.acm.org/doi/10.1145/2523813
- ISO/IEC 42001:2023, Artificial intelligence management system. https://www.iso.org/standard/81230.html
- ISO/IEC 5259-1:2024, Data quality for analytics and machine learning. https://www.iso.org/standard/81088.html
- ISO/IEC 22989:2022, Artificial intelligence concepts and terminology. https://www.iso.org/standard/74296.html
- NIST SP 1270, Towards a Standard for Identifying and Managing Bias in Artificial Intelligence, 2022. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1270.pdf