AI Glossary

RLHF

Reinforcement Learning from Human Feedback, reinforcement learning from human feedback

RLHF is a model fine-tuning method in which people rate its responses and the model learns to prefer the higher-rated ones. This makes it more helpful and more aligned with users' expectations.

Uses human ratings of responses to fine-tune the model.
A stage that follows initial training on large text corpora.
Improves usefulness and safety, but does not eliminate hallucinations.

RLHF is a fine-tuning stage that follows the initial training of a large language model. People rate or compare the model's responses, and from those ratings a reward model is built. The language model is then fine-tuned so that it more often generates responses people would rate highly.

The goal is to align the model's behavior with users' expectations: clearer, safer and more helpful answers. RLHF does not, however, eliminate hallucinations, nor does it serve to add new factual knowledge — it relies on human judgment (human-in-the-loop) collected beforehand and applied on top of already-learned patterns.

Related terms