Explaining Reinforcement Learning with Human Feedback with Star Trek

Microsoft announced today that it will include results from a Large Language Model based on GPT-3 in Bing results. They will also release a new version of the Edge browser that will include a ChatGPT-like bot.

GPT-3 has been around for almost two years. What has caused this sudden leap forward in the capabilities of Large Language Models 🤔?

The answer is – *Reinforcement Learning From Human Feedback* or RLHF.

By combining the capabilities of a large language model with those of another model trained on the end-users preferences, we end up with the uncannily accurate results that ChatGPT seems to produce.

Ok – but how does RLHF work? Let me try and explain with a (ridiculous) analogy.

In the Star Trek series, the Replicator is a device that can produce pretty much anything on demand.

When Captain Picard says, “Tea, Earl Grey, Hot!” it produces the perfect cup of tea. But how might you train a Replicator? With RLHF, of course!

Let’s see how:

1. Feed the Replicator with all the beverage recipes in the known universe.

2. Train it to try and predict what a recipe would be when given a prompt. I.e. when a user says “Tea, Earl Gray, Hot!” – it should be able to predict what goes into the beverage.

3. Train *another* model – let’s call it the “Tea Master 2000” with Captain Picard’s preferences.

4. When the Replicator generates a beverage, the Tea Master responds with a score. +10 for a perfect cup of tea, -10 for mediocre swill.

5. We now use Reinforcement Learning (RL) to optimize the Replicator to get a perfect ten score.

6. After much optimization, the Replicator can generate the perfect cup of tea – tuned to Captain Picard’s preferences.

If you substitute the Replicator with an LLM like GPT-3, and substitute the Tea Master with another ML model called the *Preference* model, then you have seen RLHF in action!

It is a lot more complicated, but I will take any opportunity to generate Star Trek TNG-themed content 🖖.

Further Reading

Share this: