How do we define clear and measurable performance targets to demonstrate reliability across diverse operational scenarios? What level of rigour is ethically required when testing reliability, robustness, and security, particularly in systems that could impact lives or national security?
Referring to the use case library JSP 936 Part 2
Thresholds for reliability
Reliability testing
1. Use case Library
Referring to the Use Case Library (JSP 936 Pt2) is a good place to see if similar projects have already been undertaken and to what kind of testing has been found to be appropriate. This can help inform the further analysis that needs to be undertaken (see case studies in suit 4).
2. Thresholds for reliability
There are a number of different thresholds for reliability that you will want to consider. These will include performance metrics (making sure that you are genuinely testing the right thing within the system's Operational Design Domain (ODD). There will also be context specific standards that will be determined by the operational environment and its governance. For example, reliability thresholds in combat scenarios may demand a much higher standard than in training environments. Where they are actually replacing people or established ways of doing things, AI systems must, at a minimum, perform as reliably as the human or traditional systems they replace.
3. Reliability testing
Reliability testing should include considerations such as:
- The Generative dimension: focuses on the AI's ability to create high-quality and contextually appropriate content. This includes tasks such as text generation, where AI produces coherent and meaningful summaries or narratives, image generation for realistic visuals or creative art, and code generation for functional and optimised software. Metrics for this benchmark often assess fluency, coherence, diversity, and alignment with user intent, ensuring the outputs meet the expected quality and originality.
- The Adaptive dimension: evaluates how well an AI system can adjust to new environments, tasks, or inputs without extensive retraining. This includes generalisation—the ability to perform well on tasks it was not explicitly trained for—transfer learning, where knowledge from one domain enhances performance in another, and personalisation, which ensures the AI adjusts its outputs based on user preferences or specific contexts. The system’s adaptability is measured by its performance on a range of unseen tasks or domains, highlighting its versatility and flexibility.
- In the Learning dimension: the focus is on the AI’s capacity to improve over time through interaction or additional data. This includes evaluating how efficiently the AI learns, its ability to retain knowledge and avoid forgetting prior skills (lifelong learning), and its responsiveness to user feedback. Metrics for this benchmark measure how quickly the system learns, its robustness in dynamic environments, and the quality of improvements over time. This ensures the AI remains effective in evolving situations.
- The Ethical dimension: addresses the AI’s alignment with ethical principles, ensuring fair, transparent, and responsible operation. This includes mitigating bias to avoid discriminatory outputs, providing clear and interpretable explanations for decisions (explainability), safeguarding user data and ensuring compliance with privacy regulations like GDPR, and maintaining safety to minimise risks of harm or misuse. Success in this benchmark reflects the AI’s adherence to ethical guidelines and its ability to address potential ethical concerns effectively.
Each of these considerations need to be tested against failure points, edge cases, and degraded conditions. (See case study suits)
It is also essential across each of Reliability, Robustness and Security to consider what happens if the system does fail – how does it respond and what is the worst-case scenario.