Centers & Programs

Publications

Home Centers & Programs AI and Natural Sciences Publications

Title
Group Robust Best-of-K Decoding of Language Models for Pluralistic Alignment
KIAS Author
Yoon, Sangwoong
Journal
Neural Information Processing Systems 2024 Pluralistic Alignment Workshop, 2024
Archive
Abstract
The desirable behaviour of a chat agent can be described with multiple criteria, such as harmlessness, helpfulness, and conciseness, each of which is represented by reward models. While each user, or a group of users, may perceive each criterion with different significance, it is difficult to know how much an individual user or group would weigh one criterion over another in many practical scenarios. Instead of assuming knowledge of the weights among multiple criteria, we propose a robust alignment approach that maximises the worst-case criterion among the group of reward models. To test this approach, we use best-of-K rejection sampling to demonstrate the properties of an algorithm that employs our robust objective. Finally, we propose several interesting avenues of future exploration that may lead to more practical algorithms than group robust best-of-K rejection sampling.