Publications

FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation.

Published in ICML, 2023

We present FACADE, a novel probabilistic and geometric framework designed for unsupervised mechanistic anomaly detection in deep neural networks.

Recommended citation: Pai, DB, Carranza, A, Tandon, A, Schaeffer, R, Koyejo, S. “FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation.” Adversarial ML Frontiers, ICML Workshops, Jun 20, 2023. https://openreview.net/forum?id=4j8KuZOmQH

Deceptive Alignment Monitoring

Published in ICML, 2023

We propose a new paradigm of adversarial machine learning, deceptive alignment monitoring, in which mechanistically anomalous model behavior serves as a basis fo model misalignment, and propose aa variety of new research directions in the field.

Recommended citation: Pai, DB, Carranza, A, Schaeffer, R, Koyejo, S. “Deceptive Alignment Monitoring.” Adversarial ML Frontiers, ICML Workshops, Jun 20, 2023. https://openreview.net/forum?id=obsO44GFhh