Language Model Alignment and Preference Optimization

How can we generate language model outputs that align with human preferences?

This theme studies how to generate language model outputs that better align with human preferences. I focus on how generation strategies can be designed and evaluated for alignment and robustness against reward hacking.

Key questions
#

How can generation strategies better reflect human preferences?
What makes a generation strategy robust to reward hacking?

Related Publications

NAACL 2025

Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment

Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe

Theme:LLM Alignment

arXiv

Transactions on Machine Learning Research

Evaluation of Best-of-N Sampling Strategies for Language Model Alignment

Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe, Mitsuki Sakamoto, Eiji Uchibe

Theme:LLM Alignment

arXiv

EMNLP 2024

Filtered Direct Preference Optimization

Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu

Theme:LLM Alignment

arXiv

ICML 2024 Workshop on Models of Human Feedback for AI Alignment

Filtered Direct Preference Optimization

Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu

Theme:LLM Alignment

arXiv

ICML 2024 Workshop on Models of Human Feedback for AI Alignment

Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe

Theme:LLM Alignment

arXiv

ICML 2024

Model-Based Minimum Bayes Risk Decoding

Yuu Jinnai, Tetsuro Morimura, Ukyo Honda, Kaito Ariu, Kenshi Abe

Theme:LLM Alignment

arXiv

Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative

Sho Shimoyama, Tetsuro Morimura, Kenshi Abe, Toda Takamichi, Yuta Tomomatsu, Masakazu Sugiyama, Asahi Hentona, Yuuki Azuma, Hirotaka Ninomiya

Theme:LLM Alignment

arXiv

↑

Key questions#

Related Publications

Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment

Evaluation of Best-of-N Sampling Strategies for Language Model Alignment

Filtered Direct Preference Optimization

Filtered Direct Preference Optimization

Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

Model-Based Minimum Bayes Risk Decoding

Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative

Key questions
#