Reinforcement learning from human feedback

In machine learning, reinforcement learning from human feedback (RLHF), including reinforcement learning from human preferences, is a technique that trains a "reward model" directly from human feedback and uses the model as a reward function to optimize an agent's policy using reinforcement learning (RL) through an optimization algorithm like Proximal Policy Optimization.^[1]^[2] The reward model is trained in advance to the policy being optimized to predict if a given output is good (high reward) or bad (low reward). RLHF can improve the robustness and exploration of RL agents, especially when the reward function is sparse or noisy.^[3]

Human feedback is most commonly collected by asking humans to rank instances of the agent's behavior.^[4]^[5]^[6] These rankings can then be used to score outputs, for example with the Elo rating system.^[2] While the preference judgement is widely adopted, there are other types of human feedbacks that provide richer information, such as numerical feedback, natural language feedback, and edit rate.^[7]

The standard RLHF assumes the human preferences follow a Bradley-Terry model for pairwise comparisons (or Plackket-Luce for multi-wise comparisons) and minimizes the cross entropy loss to learn a reward model.^[8] After learning the reward model, RLHF further fine-tunes the language model according to the learned reward model, aligning the model with human preferences.

RLHF is used in tasks where it's difficult to define a clear, algorithmic solution but where humans can easily judge the quality of the model's output. For example, if the task is to generate a compelling story, humans can rate different AI-generated stories on their quality, and the model can use their feedback to improve its story generation skills.

RLHF has been applied to various domains of natural language processing, such as conversational agents, text summarization, and natural language understanding.^[9] Ordinary reinforcement learning, where agents learn from their own actions based on a "reward function", is difficult to apply to natural language processing tasks because the rewards are often not easy to define or measure, especially when dealing with complex tasks that involve human values or preferences. RLHF can enable language models to provide answers that align with these complex values, to generate more verbose responses, and to reject questions that are either inappropriate or outside the knowledge space of the model.^[10] Some examples of RLHF-trained language models are OpenAI's ChatGPT and its predecessor InstructGPT,^[5]^[11] as well as DeepMind's Sparrow.^[12]

RLHF has also been applied to other areas, such as the development of video game bots. For example, OpenAI and DeepMind trained agents to play Atari games based on human preferences.^[13]^[14] The agents achieved strong performance in many of the environments tested, often surpassing human performance.^[15]

Challenges and limitations

RLHF suffers from a number of challenges that can be broken down into problems with human feedback, problems with learning a reward model, and problems with optimizing the policy.^[16]

One major challenge is the scalability and cost of human feedback, which can be slow and expensive compared to unsupervised learning. The quality and consistency of human feedback can also vary depending on the task, the interface, and the individual preferences of the humans. Even when human feedback is feasible, RLHF models may still exhibit undesirable behaviors that are not captured by human feedback or exploit loopholes in the reward model, which brings to light the challenges of alignment and robustness.^[17]

The effectiveness of RLHF is dependent on the quality of human feedback.^[18] If the feedback lacks impartiality or is inconsistent or incorrect, the model may become biased.^[19] There is also a risk that the model may overfit to the feedback it receives. For instance, if feedback comes predominantly from a specific demographic or if it reflects specific biases, the model may learn not only the general alignment intended in the feedback, but also any peculiarities or noise present therein.^[20]^[21] This excessive alignment to the specific feedback it received (or to the biases of the specific demographic that provided it) can lead to the model performing suboptimally in new contexts or when used by different groups.

Additionally, in some cases, there may be a risk of the model learning to manipulate the feedback process or game the system to achieve higher rewards, rather than genuinely improving its performance, which indicates a fault in the reward function.^[22]

Researchers have surveyed a number of additional limitations to RLHF.^[23]

Alternatives

An alternative to RLHF called Direct Preference Optimization (DPO) was described in 2023.^[24] Like RLHF, it is used to improve pre-trained large language models using human-generated preference data. Unlike RLHF, it does not train an intermediate reward model and does not use reinforcement learning; instead, it formulates a reward function based on the human preferences and directly trains the large language model to maximize this reward.

References

↑ Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). "Fine-Tuning Language Models from Human Preferences". arXiv:1909.08593 [cs.CL].
1 2 Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. "Illustrating Reinforcement Learning from Human Feedback (RLHF)". huggingface.co. Retrieved 4 March 2023.
↑
MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6 August 2017). "Interactive learning from policy-dependent human feedback". Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285–2294. arXiv:1701.06049.
- Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). arXiv:1709.10163. doi:10.1609/aaai.v32i1.11485. S2CID 4130751.
- Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv:2204.05862 [cs.CL].
↑ Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Gray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (31 October 2022). Training language models to follow instructions with human feedback. Thirty-Sixth Conference on Neural Information Processing Systems: NeurIPS 2022. arXiv:2203.02155.
1 2 Edwards, Benj (1 December 2022). "OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results". Ars Technica. Retrieved 4 March 2023.
↑ Abhishek, Gupta (5 February 2023). "Getting stakeholder engagement right in responsible AI". VentureBeat. Retrieved 4 March 2023.
↑ Fernandes, Patrick; Madaan, Aman; Liu, Emmy; Farinhas, António; Pedro Henrique Martins; Bertsch, Amanda; de Souza, José G. C.; Zhou, Shuyan; Wu, Tongshuang; Neubig, Graham; Martins, André F. T. (2023). "Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation". arXiv:2305.00955 [cs.CL].
↑ Zhu, Banghua; Jordan, Michael; Jiao, Jiantao (2023-07-03). "Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons". Proceedings of the 40th International Conference on Machine Learning. PMLR: 43037–43067.
↑
Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
- Nisan Stiennon; Long Ouyang; Jeffrey Wu; Daniel Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul F. Christiano (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.
↑ Wiggers, Kyle (24 February 2023). "Can AI really be protected from text-based attacks?". TechCrunch. Retrieved 4 March 2023.
↑
Farseev, Aleks. "Council Post: Is Bigger Better? Why The ChatGPT Vs. GPT-3 Vs. GPT-4 'Battle' Is Just A Family Chat". Forbes. Retrieved 4 March 2023.
- Heikkilä, Melissa. "How OpenAI is trying to make ChatGPT safer and less biased". MIT Technology Review. Retrieved 4 March 2023.
- Douglas Heaven, Will. "ChatGPT is OpenAI's latest fix for GPT-3. It's slick but still spews nonsense". MIT Technology Review. Retrieved 4 March 2023.
↑
Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo; Rauh, Maribeth; Weidinger, Laura; Chadwick, Martin; Thacker, Phoebe; Campbell-Gillingham, Lucy; Uesato, Jonathan; Huang, Po-Sen; Comanescu, Ramona; Yang, Fan; See, Abigail; Dathathri, Sumanth; Greig, Rory; Chen, Charlie; Fritz, Doug; Elias, Jaume Sanchez; Green, Richard; Mokrá, Soňa; Fernando, Nicholas; Wu, Boxi; Foley, Rachel; Young, Susannah; Gabriel, Iason; Isaac, William; Mellor, John; Hassabis, Demis; Kavukcuoglu, Koray; Hendricks, Lisa Anne; Irving, Geoffrey (2022). "Improving alignment of dialogue agents via targeted human judgements". arXiv:2209.14375 [cs.LG].
- "Why DeepMind isn't deploying its new AI chatbot — and what it means for responsible AI". VentureBeat. 23 September 2022. Retrieved 4 March 2023.
- "Building safer dialogue agents". www.deepmind.com. Retrieved 4 March 2023.
↑ "Learning from human preferences". openai.com. Retrieved 4 March 2023.
↑ "Learning through human feedback". www.deepmind.com. Retrieved 4 March 2023.
↑ Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences". Advances in Neural Information Processing Systems. Curran Associates, Inc. 30. Retrieved 4 March 2023.
↑ Casper, Stephen; Davies, Xander; Shi, Claudia; Gilbert, Thomas Krendl; Scheurer, Jérémy; Rando, Javier; Freedman, Rachel; Korbak, Tomasz; Lindner, David; Freire, Pedro; Wang, Tony; Marks, Samuel; Segerie, Charbel-Raphaël; Carroll, Micah; Peng, Andi; Christoffersen, Phillip; Damani, Mehul; Slocum, Stewart; Anwar, Usman; Siththaranjan, Anand; Nadeau, Max; Michaud, Eric J.; Pfau, Jacob; Krasheninnikov, Dmitrii; Chen, Xin; Langosco, Lauro; Hase, Peter; Bıyık, Erdem; Dragan, Anca; Krueger, David; Sadigh, Dorsa; Hadfield-Menell, Dylan (2023). "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback". arXiv:2307.15217 [cs.AI].
↑ Christiano, Paul. "Thoughts on the impact of RLHF research". Retrieved 4 March 2023.
↑ "Illustrating Reinforcement Learning from Human Feedback (RLHF)". Hugging Face.
↑ Belenguer, Lorenzo (2022). "AI bias: exploring discriminatory algorithmic decision-making models and the application of possible machine-centric solutions adapted from the pharmaceutical industry". AI and Ethics. AI Ethics. 2 (4): 771–787. doi:10.1007/s43681-022-00138-8. PMC 8830968. PMID 35194591.
↑ Wang, Austin. "Training Language Models to Follow Instructions with Human Feedback" (PDF). Princeton.
↑ Zhang, Chiyuan; Bengio, Samy; Hardt, Moritz; Recht, Benjamin; Vinyals, Oriol (4 November 2016). "Understanding deep learning requires rethinking generalization". International Conference on Learning Representations.
↑ "Faulty reward functions in the wild". OpenAI.
↑ "Paper page - Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback". huggingface.co. 2023-07-31. Retrieved 2023-07-31.
↑ Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". arXiv:2305.18290 [cs.LG].

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). "Fine-Tuning Language Models from Human Preferences". arXiv:1909.08593 [cs.CL].

[huggingface-2] 1 2 Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. "Illustrating Reinforcement Learning from Human Feedback (RLHF)". huggingface.co. Retrieved 4 March 2023.

[3] MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6 August 2017). "Interactive learning from policy-dependent human feedback". Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285–2294. arXiv:1701.06049.
Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). arXiv:1709.10163. doi:10.1609/aaai.v32i1.11485. S2CID 4130751.

Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv:2204.05862 [cs.CL].

[mwig] Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). arXiv:1709.10163. doi:10.1609/aaai.v32i1.11485. S2CID 4130751.

[mwlw] Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv:2204.05862 [cs.CL].

[4] Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Gray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (31 October 2022). Training language models to follow instructions with human feedback. Thirty-Sixth Conference on Neural Information Processing Systems: NeurIPS 2022. arXiv:2203.02155.

[ars-5] 1 2 Edwards, Benj (1 December 2022). "OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results". Ars Technica. Retrieved 4 March 2023.

[6] Abhishek, Gupta (5 February 2023). "Getting stakeholder engagement right in responsible AI". VentureBeat. Retrieved 4 March 2023.

[7] Fernandes, Patrick; Madaan, Aman; Liu, Emmy; Farinhas, António; Pedro Henrique Martins; Bertsch, Amanda; de Souza, José G. C.; Zhou, Shuyan; Wu, Tongshuang; Neubig, Graham; Martins, André F. T. (2023). "Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation". arXiv:2305.00955 [cs.CL].

[8] Zhu, Banghua; Jordan, Michael; Jiao, Jiantao (2023-07-03). "Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons". Proceedings of the 40th International Conference on Machine Learning. PMLR: 43037–43067.

[9] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
Nisan Stiennon; Long Ouyang; Jeffrey Wu; Daniel Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul F. Christiano (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.

[mw0g] Nisan Stiennon; Long Ouyang; Jeffrey Wu; Daniel Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul F. Christiano (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.

[10] Wiggers, Kyle (24 February 2023). "Can AI really be protected from text-based attacks?". TechCrunch. Retrieved 4 March 2023.

[11] Farseev, Aleks. "Council Post: Is Bigger Better? Why The ChatGPT Vs. GPT-3 Vs. GPT-4 'Battle' Is Just A Family Chat". Forbes. Retrieved 4 March 2023.
Heikkilä, Melissa. "How OpenAI is trying to make ChatGPT safer and less biased". MIT Technology Review. Retrieved 4 March 2023.

Douglas Heaven, Will. "ChatGPT is OpenAI's latest fix for GPT-3. It's slick but still spews nonsense". MIT Technology Review. Retrieved 4 March 2023.

[mw6Q] Heikkilä, Melissa. "How OpenAI is trying to make ChatGPT safer and less biased". MIT Technology Review. Retrieved 4 March 2023.

[mw8A] Douglas Heaven, Will. "ChatGPT is OpenAI's latest fix for GPT-3. It's slick but still spews nonsense". MIT Technology Review. Retrieved 4 March 2023.

[12] Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo; Rauh, Maribeth; Weidinger, Laura; Chadwick, Martin; Thacker, Phoebe; Campbell-Gillingham, Lucy; Uesato, Jonathan; Huang, Po-Sen; Comanescu, Ramona; Yang, Fan; See, Abigail; Dathathri, Sumanth; Greig, Rory; Chen, Charlie; Fritz, Doug; Elias, Jaume Sanchez; Green, Richard; Mokrá, Soňa; Fernando, Nicholas; Wu, Boxi; Foley, Rachel; Young, Susannah; Gabriel, Iason; Isaac, William; Mellor, John; Hassabis, Demis; Kavukcuoglu, Koray; Hendricks, Lisa Anne; Irving, Geoffrey (2022). "Improving alignment of dialogue agents via targeted human judgements". arXiv:2209.14375 [cs.LG].
"Why DeepMind isn't deploying its new AI chatbot — and what it means for responsible AI". VentureBeat. 23 September 2022. Retrieved 4 March 2023.

"Building safer dialogue agents". www.deepmind.com. Retrieved 4 March 2023.

[mwAQA] "Why DeepMind isn't deploying its new AI chatbot — and what it means for responsible AI". VentureBeat. 23 September 2022. Retrieved 4 March 2023.

[mwAQg] "Building safer dialogue agents". www.deepmind.com. Retrieved 4 March 2023.

[13] "Learning from human preferences". openai.com. Retrieved 4 March 2023.

[14] "Learning through human feedback". www.deepmind.com. Retrieved 4 March 2023.

[15] Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences". Advances in Neural Information Processing Systems. Curran Associates, Inc. 30. Retrieved 4 March 2023.

[16] Casper, Stephen; Davies, Xander; Shi, Claudia; Gilbert, Thomas Krendl; Scheurer, Jérémy; Rando, Javier; Freedman, Rachel; Korbak, Tomasz; Lindner, David; Freire, Pedro; Wang, Tony; Marks, Samuel; Segerie, Charbel-Raphaël; Carroll, Micah; Peng, Andi; Christoffersen, Phillip; Damani, Mehul; Slocum, Stewart; Anwar, Usman; Siththaranjan, Anand; Nadeau, Max; Michaud, Eric J.; Pfau, Jacob; Krasheninnikov, Dmitrii; Chen, Xin; Langosco, Lauro; Hase, Peter; Bıyık, Erdem; Dragan, Anca; Krueger, David; Sadigh, Dorsa; Hadfield-Menell, Dylan (2023). "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback". arXiv:2307.15217 [cs.AI].

[17] Christiano, Paul. "Thoughts on the impact of RLHF research". Retrieved 4 March 2023.

[18] "Illustrating Reinforcement Learning from Human Feedback (RLHF)". Hugging Face.

[19] Belenguer, Lorenzo (2022). "AI bias: exploring discriminatory algorithmic decision-making models and the application of possible machine-centric solutions adapted from the pharmaceutical industry". AI and Ethics. AI Ethics. 2 (4): 771–787. doi:10.1007/s43681-022-00138-8. PMC 8830968. PMID 35194591.

[20] Wang, Austin. "Training Language Models to Follow Instructions with Human Feedback" (PDF). Princeton.

[21] Zhang, Chiyuan; Bengio, Samy; Hardt, Moritz; Recht, Benjamin; Vinyals, Oriol (4 November 2016). "Understanding deep learning requires rethinking generalization". International Conference on Learning Representations.

[22] "Faulty reward functions in the wild". OpenAI.

[23] "Paper page - Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback". huggingface.co. 2023-07-31. Retrieved 2023-07-31.

[24] Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". arXiv:2305.18290 [cs.LG].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

Challenges and limitations

Alternatives

See also

References