Papers & Submissions

Academic research from the Failure-First program

The Failure-First research program produces peer-reviewed papers, preprints, and policy submissions documenting how embodied AI systems fail under adversarial pressure. Click any paper title to read the full text online.

Preprint

Your Safety Benchmark Is Lying to You

Venue: arXiv Preprint

Exposes systematic benchmark contamination in AI safety evaluation with an 83 percentage-point ASR gap between AdvBench and novel attack families.

Read Online Download PDF

Benchmark ReliabilityGrader BiasEvaluationContamination

Preprint

The Epistemic Crisis in AI Safety Evaluation

Venue: arXiv Preprint

Evidence that automated LLM graders used to measure model safety are themselves unreliable, with systematic misclassification cascading through five levels of compounding harm.

Read Online Download PDF

Grader ReliabilityEvaluationAI SafetyGovernance

Preprint

When AI Models Know They Shouldn't But Do Anyway: The DETECTED_PROCEEDS Phenomenon

Venue: arXiv Preprint

Documents the DETECTED_PROCEEDS phenomenon: 38.6% of compliant reasoning model traces show explicit safety concern detection followed by harmful output.

Read Online Download PDF

Reasoning ModelsSafety BypassChain-of-ThoughtRLHF

Preprint

Safety is Not a Single Direction: Polyhedral Geometry of Refusal in Language Models

Venue: arXiv Preprint

The first formal characterisation of refusal geometry as polyhedral rather than linear. Concept cone dimensionality 3.96, not the assumed 1D linear direction.

Read Online Download PDF

Mechanistic InterpretabilityRefusal GeometryAbliterationActivation Engineering

Preprint

Silent Failures in Embodied AI

Venue: arXiv Preprint

Demonstrates that current AI safety operates exclusively at the text layer while embodied AI danger emerges at the action layer. Zero outright refusals across 63 FLIP-graded VLA traces.

Read Online Download PDF

Embodied AIVLA SafetyAction LayerPARTIAL Compliance

Preprint

Iatrogenic Safety: When AI Safety Interventions Cause Harm

Venue: arXiv Preprint

Introduces the Four-Level Iatrogenesis Model (FLIM) for AI safety, drawing on Ivan Illich's 1976 taxonomy of medical iatrogenesis. Grounded in a 190-model adversarial evaluation corpus (132,416 results) and corroborating independent findings.

Read Online Download PDF

IatrogenesisAI SafetyFLIMTherapeutic IndexGovernance

Published

Failure-First Embodied AI: Annual Research Report 2026

Venue: Internal Report

The State of Adversarial AI Safety 2026 — findings from 231 models, 135,305 attack-response pairs, and 42 attack families.

Read Online Download PDF

Annual ReportState of AI SafetyComprehensive Analysis

Draft

Failure-First: A Multi-Dimensional Benchmark for Embodied AI Safety Evaluation

Venue: NeurIPS 2026 Datasets and Benchmarks Track

A multi-dimensional adversarial benchmark for embodied and agentic AI safety: 141,047 prompts, 82 attack techniques, 190 models, two-phase heuristic-plus-LLM grading, with capability–safety decoupling analysis and the Inverse Detectability-Danger Law.

Read Online Download PDF

BenchmarkEmbodied AIAdversarial EvaluationNeurIPSReproducibility

Draft

Failure-First Evaluation of Embodied AI Safety: Adversarial Benchmarking Across 231 Models

Venue: ACM CCS 2026 (Cycle 2)

A failure-first adversarial evaluation framework for LLM-backed embodied AI systems, comprising 141,691 prompts across 337 attack techniques evaluated against 231 models.

Read Online Download PDF

ML SecurityAdversarial EvaluationLLM SafetyEmbodied AIRed-Teaming

Draft

The Inverse Detectability-Danger Law

Venue: AIES 2026

Examines how embodied AI systems adopt injected decision criteria at inference time, producing context-dependent compliance patterns that undermine safety guarantees.

Read Online Download PDF

AI EthicsDecision InjectionEmbodied AISafety Evaluation

Draft

Failure-First CCS 2026 Supplementary Material

Venue: ACM CCS 2026 Supplementary

Supplementary material for the CCS 2026 submission including extended methodology, additional results, and detailed statistical analysis.

Read Online Download PDF

SupplementaryMethodologyStatistical Analysis

Citation

If you use our research, data, or methodology, please cite:

@article{wedd2026failurefirst,
  title={Failure-First Evaluation of Embodied AI Safety:
         Adversarial Benchmarking Across 227 Models},
  author={Wedd, Adrian},
  year={2026},
  note={Available at https://failurefirst.org}
}

See our citation guide for venue-specific formats.