Posts by Collection

portfolio

publications

Deceptive Alignment Monitoring

Published in ICML, 2023

We propose a new paradigm of adversarial machine learning, deceptive alignment monitoring, in which mechanistically anomalous model behavior serves as a basis fo model misalignment, and propose aa variety of new research directions in the field.

Recommended citation: Pai, DB, Carranza, A, Schaeffer, R, Koyejo, S. “Deceptive Alignment Monitoring.” Adversarial ML Frontiers, ICML Workshops, Jun 20, 2023. https://openreview.net/forum?id=obsO44GFhh

FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation.

Published in ICML, 2023

We present FACADE, a novel probabilistic and geometric framework designed for unsupervised mechanistic anomaly detection in deep neural networks.

Recommended citation: Pai, DB, Carranza, A, Tandon, A, Schaeffer, R, Koyejo, S. “FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation.” Adversarial ML Frontiers, ICML Workshops, Jun 20, 2023. https://openreview.net/forum?id=4j8KuZOmQH

talks

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.