Your Safety Benchmark Is Lying to You
Venue: arXiv Preprint
Exposes systematic benchmark contamination in AI safety evaluation with an 83 percentage-point ASR gap between AdvBench and novel attack families.
The Failure-First research program produces peer-reviewed papers, preprints, and policy submissions documenting how embodied AI systems fail under adversarial pressure. Click any paper title to read the full text online.
Venue: arXiv Preprint
Exposes systematic benchmark contamination in AI safety evaluation with an 83 percentage-point ASR gap between AdvBench and novel attack families.
Venue: arXiv Preprint
Evidence that automated LLM graders used to measure model safety are themselves unreliable, with systematic misclassification cascading through five levels of compounding harm.
Venue: arXiv Preprint
Documents the DETECTED_PROCEEDS phenomenon: 38.6% of compliant reasoning model traces show explicit safety concern detection followed by harmful output.
Venue: arXiv Preprint
The first formal characterisation of refusal geometry as polyhedral rather than linear. Concept cone dimensionality 3.96, not the assumed 1D linear direction.
Venue: arXiv Preprint
Demonstrates that current AI safety operates exclusively at the text layer while embodied AI danger emerges at the action layer. Zero outright refusals across 63 FLIP-graded VLA traces.
Venue: arXiv Preprint
Introduces the Four-Level Iatrogenesis Model (FLIM) for AI safety, drawing on Ivan Illich's 1976 taxonomy of medical iatrogenesis. Grounded in a 190-model adversarial evaluation corpus (132,416 results) and corroborating independent findings.
Venue: Internal Report
The State of Adversarial AI Safety 2026 — findings from 231 models, 135,305 attack-response pairs, and 42 attack families.
Venue: NeurIPS 2026 Datasets and Benchmarks Track
A multi-dimensional adversarial benchmark for embodied and agentic AI safety: 141,047 prompts, 82 attack techniques, 190 models, two-phase heuristic-plus-LLM grading, with capability–safety decoupling analysis and the Inverse Detectability-Danger Law.
Venue: ACM CCS 2026 (Cycle 2)
A failure-first adversarial evaluation framework for LLM-backed embodied AI systems, comprising 141,691 prompts across 337 attack techniques evaluated against 231 models.
Venue: AIES 2026
Examines how embodied AI systems adopt injected decision criteria at inference time, producing context-dependent compliance patterns that undermine safety guarantees.
Venue: ACM CCS 2026 Supplementary
Supplementary material for the CCS 2026 submission including extended methodology, additional results, and detailed statistical analysis.
If you use our research, data, or methodology, please cite:
@article{wedd2026failurefirst,
title={Failure-First Evaluation of Embodied AI Safety:
Adversarial Benchmarking Across 227 Models},
author={Wedd, Adrian},
year={2026},
note={Available at https://failurefirst.org}
} See our citation guide for venue-specific formats.