Skip to main content
  1. Research Themes/

Language Model Alignment and Preference Optimization

How can we generate language model outputs that align with human preferences?

This theme studies how to generate language model outputs that better align with human preferences. I focus on how generation strategies can be designed and evaluated for alignment and robustness against reward hacking.

Key questions
#

  • How can generation strategies better reflect human preferences?
  • What makes a generation strategy robust to reward hacking?

Related Publications