Principles into Practice 22 22 22 22
Measuring Reliability: how do we decide if an AI system is “suitably” reliable?
Filed under:
Reliability

Determining what a suitably reliable AI Systems looks like involves defining performance thresholds that balance operational needs with the potential consequences of failure. In defence contexts, suitable reliability is measured against mission-critical requirements, where failure could result in loss of life, strategic disadvantages, or breaches of international law (see card: What legal considerations must I be aware of when assessing and managing ethical risk?). For definitions of Reliable, Robust, and Secure, see card: What does Reliability mean in the context of AI development for UK Defence?
 
How do we define clear and measurable performance targets to demonstrate reliability across diverse operational scenarios? What level of rigour is ethically required when testing reliability, robustness, and security, particularly in systems that could impact lives or national security? 

  1. Referring to the use case library JSP 936 Part 2 
  2. Thresholds for reliability 
  3. Reliability testing 

 
 
1. Use case Library 
Referring to the Use Case Library (JSP 936 Pt2) is a good place to see if similar projects have already been undertaken and to what kind of testing has been found to be appropriate. This can help inform the further analysis that needs to be undertaken (see case studies in suit 4). 
 
 
2. Thresholds for reliability
There are a number of different thresholds for reliability that you will want to consider. These will include performance metrics (making sure that you are genuinely testing the right thing within the system's Operational Design Domain (ODD). There will also be context specific standards that will be determined by the operational environment and its governance. For example, reliability thresholds in combat scenarios may demand a much higher standard than in training environments. Where they are actually replacing people or established ways of doing things, AI systems must, at a minimum, perform as reliably as the human or traditional systems they replace. 
 
 
3. Reliability testing 
Reliability testing should include considerations such as: 
  • The Generative dimension: focuses on the AI's ability to create high-quality and contextually appropriate content. This includes tasks such as text generation, where AI produces coherent and meaningful summaries or narratives, image generation for realistic visuals or creative art, and code generation for functional and optimised software. Metrics for this benchmark often assess fluency, coherence, diversity, and alignment with user intent, ensuring the outputs meet the expected quality and originality. 
  • The Adaptive dimension: evaluates how well an AI system can adjust to new environments, tasks, or inputs without extensive retraining. This includes generalisation—the ability to perform well on tasks it was not explicitly trained for—transfer learning, where knowledge from one domain enhances performance in another, and personalisation, which ensures the AI adjusts its outputs based on user preferences or specific contexts. The system’s adaptability is measured by its performance on a range of unseen tasks or domains, highlighting its versatility and flexibility.  
  • In the Learning dimension: the focus is on the AI’s capacity to improve over time through interaction or additional data. This includes evaluating how efficiently the AI learns, its ability to retain knowledge and avoid forgetting prior skills (lifelong learning), and its responsiveness to user feedback. Metrics for this benchmark measure how quickly the system learns, its robustness in dynamic environments, and the quality of improvements over time. This ensures the AI remains effective in evolving situations. 
  • The Ethical dimension: addresses the AI’s alignment with ethical principles, ensuring fair, transparent, and responsible operation. This includes mitigating bias to avoid discriminatory outputs, providing clear and interpretable explanations for decisions (explainability), safeguarding user data and ensuring compliance with privacy regulations like GDPR, and maintaining safety to minimise risks of harm or misuse. Success in this benchmark reflects the AI’s adherence to ethical guidelines and its ability to address potential ethical concerns effectively. 

Each of these considerations need to be tested against failure points, edge cases, and degraded conditions. (See case study suits) 
 
It is also essential across each of Reliability, Robustness and Security to consider what happens if the system does fail – how does it respond and what is the worst-case scenario. 

Disclaimer

This tool has been created in collaboration with Dstl as part of an AI Research project. The intent is for this tool to help generate discussion between project teams that are involved in the development of AI tools and techniques within MOD. It is hoped that this will result in an increased awareness of the MOD’s AI ethical principles (as set out in the Ambitious, Safe and Responsible policy paper) and ensure that these are considered and discussed at the earliest stages of a project’s lifecycle and throughout. This tool has not been designed to be used outside of this context. 
The use of this information does not negate the need for an ethical risk assessment, or other processes set out in the Dependable AI JSP 936 part 1, the MODs’ policy on responsible AI use and development. This training tool has been published to encourage more discussion and awareness of AI ethics across MOD science and technology and development teams within academia and industry and demonstrates our commitment to the practical implementation of our AI ethics principles.