Author : Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nissan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano
Paper Link : https://arxiv.org/abs/2109.10862v1d
Blog : https://openai.com/blog/summarizing-books/

 

Summarizing Books with Human Feedback

Scaling human oversight of AI systems for tasks that are difficult to evaluate.

openai.com

  • RL을 사용하여 Human feedback으로부터 preference를 반영한(=Reward modeling) summarizing방법(=policy)을 학습
    -> Supervised Learning보다 더 plausible
  • 책 전체를 한번에 요약하기보다 책을 나누어 요약을하고 이를 다시 recursively 요약하는 공통의 RL policy를 학습
    -> Scalability
  • GPT-3의 Human-in-the-loop RL 기반의 fine tunning

https://www.youtube.com/watch?v=lL-nq8zhi18 

  • Reward engineering 문제와 reward exploitation문제를 해결하고자 human perference가 반영된 reward를 sequential query로부터 학습
  • Human-in-the-loop RL

 

  • 관련 papers:
  1. PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training (ICML 2021)
    - Paper link: https://arxiv.org/abs/2106.05091
    - Site: https://sites.google.com/view/icml21pebble
    - Code: https://github.com/pokaxpoka/B_Pref
  2. B-Pref: Benchmark for Preference-based RL (NeurIPS 2021, Track)
    - Openreview link: https://openreview.net/forum?id=ps95-mkHF_ d
    - Code: https://github.com/pokaxpoka/B_Pref

+ Recent posts