With trl you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the transformers library by ðŸ¤— Hugging Face. Therefore, pre-trained language models can be directly loaded via transformers. At this point most of decoder architectures and encoder-decoder architectures are supported.

Highlights:

  • PPOTrainer: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.
  • AutoModelForCausalLMWithValueHead & AutoModelForSeq2SeqLMWithValueHead: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
  • Example: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier.

Leave a Reply

Help us find great AI content

Newsletter

Never miss a thing! Sign up for our AI Hackr newsletter to stay updated.

About

AI curated tools and resources. Find the best AI tools, reports, research entries, writing assistants, chrome extensions and GPT tools.

Submit