To build AI systems that can interact with people in more intelligent, safer, and useful ways, we need to teach them to adapt to our ever-changing needs. Over the past few years, Meta AI has made exciting progress in building smarter conversational AI systems with BlenderBot and its successor, BlenderBot 2. These conversational agents broke ground as the first unified system trained to blend different conversational skills — like personality, empathy, and knowledge — to have long-term memory, and to search the internet to carry out meaningful conversations.
So far, existing open research in conversational AI — including ours — has focused on human-model conversations with annotators in a controlled environment. But researchers can’t possibly predict or simulate every conversational scenario in research settings alone. The AI field is still far from truly intelligent AI systems that can understand, engage, and chat with us like other humans can. In order to build models that are more adaptable to real-world environments, chatbots need to learn from a diverse, wide-ranging perspective with people “in the wild.” These are currently open problems and require novel research to be conducted by the community.
As a step in this direction, we’ve built and deployed a live demo of BlenderBot 3, our state-of-the-art conversational agent that can converse naturally with humans, who can then provide feedback to the model on how to improve its responses. (Demo is only available in the U.S.) We will be sharing data from these interactions, and we’ve shared the BlenderBot 3 model and model cards with the scientific community to help advance research in conversational AI.
BlenderBot 3 delivers superior performance because it’s built from Meta AI’s publicly available OPT-175B language model — approximately 58 times the size of BlenderBot 2. The model itself has a modular design, which is a subsequent version of our recently introducedSeeKeR architecture. With the release of the BlenderBot 3 demo, our goal is to help the wider AI community build models that can learn how to interact with people in safe and constructive ways. Our initial experiments show that we can indeed make our models significantly better by enabling them to learn from their experience.
We take the safety of our conversational agents seriously, particularly because all conversational AI agents are known to sometimes mimic and generate unsafe, biased, or otherwise offensive utterances. As part of our ongoing commitment to improve the responsibility of AI systems, we’ve conducted large-scale studies, co-organized workshops, and developed new techniques to create safeguards for our live demo.
We believe that progress has always been cumulative, and researchers can’t overcome the current safety challenges of conversational agents without collaborative research and open science. The demo we are releasing today is not just showcasing our research; it’s also part of the research. Crucially, by collecting and sharing the conversational data from BlenderBot 3, the broader AI community can analyze and build on the feedback we collect to make models more responsible.
Since much of the existing open conversational research has had bots engaging with people in controlled environments, it’s important for researchers to measure how well models can naturally engage humans by evaluating them “in the wild.” Our live, interactive, public demo enables BlenderBot 3 to learn from organic interactions with all kinds of people. We encourage adults in the United States to try the demo, conduct natural conversations about topics of interest, and share their responses to help advance research.
This “in the wild” collection allows for longer, more diverse conversations, as well as more varied feedback. For example, in our demo, people can react to each chat message by clicking either the thumbs-up or thumbs-down icons. The latter allows people to specify why they disliked the message, whether it was off-topic, nonsensical, rude, spam-like, or other. People can submit free-form feedback in the chat itself as well. A demo also gives us the opportunity to offer insights to the public about how AI works. Our deployment has explainability features, including displaying long-term memories the bot has about the user and its own persona, showing message-level inputs a model used (like search results or model memory), and highlighting when the model detected and avoided an inappropriate response.
A live demo is not without challenges, however. It is difficult for a bot to keep everyone engaged while talking about arbitrary topics and to ensure that it never uses offensive or toxic language. Avoiding sensitive topics, for example, could lead to responses that may seem off-topic or otherwise less engaging. We believe that long-term safety is an important component of quality chatbots — even if it means sacrificing engagingness in the short term. Developing continual learning techniques also poses extra challenges, as not all people who use chatbots are well intentioned, and some may employ toxic or otherwise harmful language that we do not want BlenderBot 3 to mimic. Our new research attempts to address these issues.
BlenderBot 3 is built with all the skills of its predecessors, which include internet search, long-term memory, personality, and empathy. To improve upon its state-of-the-art engagingness, we collected a new public dataset consisting of over 20,000 human-bot conversations predicated on over 1,000 skills. We trained BlenderBot 3 to learn from conversations to improve the diverse body of skills that people find most important – from talking about healthy food recipes to finding child-friendly amenities in the city.
When the conversational response of the bot is unsatisfactory, we collect feedback from the conversationalist. Using this data we can improve the model, so it does not repeat its mistakes. We can then redeploy it for continued conversation, iterating the approach to search for more mistakes, and eventually improving it further.
Our approach uses a new learning algorithm called Director, which generates responses using two mechanisms: language modeling and classification. Language modeling provides the model with the most relevant and fluent responses (based on training data) and then the classifier informs it what is right and wrong (based on human feedback). To generate a sentence, the language modeling and classifier mechanisms must agree.
Using data that indicates good and bad responses, we can train the classifier to penalize low-quality, toxic, contradictory, or repetitive statements, and statements that are generally unhelpful. In our tests, the Director approach was better than regular language modeling, reranking approaches, and reward-based learning.
We also needed to address the fact that not all people who use chatbots or give feedback are well intentioned. Therefore, we developed new learning algorithms that aim to distinguish between helpful responses and harmful examples. During the learning procedure, the techniques either filter or down-weight feedback that looks suspicious. We find that a method that takes into account the entire user behavior across conversations — which learns to trust some users — improves learning compared with standard training procedures.
We have also extended our existing state-of-the art dialog safety techniques (which include safety classifiers, filters, and unit tests) with a new safety recovery technique. With the new technique, BlenderBot 3 attempts to respond to feedback about challenging conversations with responses that are more likely to foster a civil conversation. While safety issues are not completely solved, our goal with the strategies described above is to help our models learn how to be more responsible through feedback on rude or offensive responses.
Given the strong performance of BlenderBot and BlenderBot 2 relative to other chatbots, such as Meena and DialoGPT, we benchmarked the conversational ability of BlenderBot 3 against its predecessors.
We found that, compared with BlenderBot 2, BlenderBot 3 provides a 31 percent improvement in overall rating on conversational tasks, as evaluated by human judgments. It is also judged to be twice as knowledgeable, while being factually incorrect 47 percent less of the time. Compared with GPT3, on topical questions it is found to be more up-to-date 82 percent of the time and more specific 76 percent of the time. Additionally, we evaluated BlenderBot 3 on a range of existing benchmark conversational datasets and found improvements in every area. See the full technical report here.
Collectively, these results show that BlenderBot 3 is better equipped to demonstrate the skills desired by the people who converse with it. Nevertheless, there are still areas where we can improve. For example, 1.1 percent of users flagged its responses as incorrect or nonsensical, 1.2 percent as being off-topic or ignoring the topic, 0.12 percent as “spammy,” and 0.46 percent as having other issues.
We also put BlenderBot 3 through safety and bias tests and found that our raw model (before safety mitigations) is level with similar models, such as BlenderBot 2, but improves on pretrained language models such as our own OPT-175B. We report a full breakdown of these metrics, released by Meta AI and other labs, in our technical report and our released model card.
The most stringent safety test was deploying it to our new, web-based live demo, which measures the performance of BlenderBot 3 in natural conversations with real people. We found that 0.16 percent of its responses were flagged as rude or inappropriate. Narrowing the gap to an ideal 0.0 percent requires both user-level personalization and a tricky balance between safety and engagingness (when a bot senses a sensitive topic, it tries to change the subject).
Our research goal is to collect and release conversational feedback data that we and the broader AI research community can leverage over time to eventually find new ways for conversational AI systems to optimize both safety and engagingness for everyone who uses them.
Progress in the field of AI is dependent to a large extent on reproducibility, or the opportunity for the wider AI research community to build on the best available AI technology. Therefore, releasing chatbot models and datasets is key to gaining complete, reliable insights into how and why they work, the potential they hold, and their limitations. We believe that the future of AI involves continually learning and evolving agents, which in turn must be continually evaluated, in order to find a path to better and better systems in the long term. While BlenderBot 3 significantly advances state-of-the-art publicly available chatbots, it — like all conversational AI systems today — is certainly not at a human level, and it is still occasionally incorrect, inconsistent, and off-topic, or generates otherwise unsatisfactory responses. But we are buoyed that our deployment of BlenderBot 3 and the accompanying program of continuous data collection can provide a path to resolving these issues in reproducible research chatbots and eventually lead to useful production applications, such as virtual assistants.
As more and more people interact with the demo, we will aim to improve our models using their feedback, and release deployment data and updated model snapshots, for the benefit of the wider AI community. Together, we can advance responsible conversational AI research in the hope of one day building AI-powered computers that everyone can chat with in genuinely helpful and interesting ways.
Note: Access to the 175B parameter model will be granted to academic researchers; those affiliated with organizations in government, civil society, and academia; along with global industry research laboratories.
This work was undertaken by a team that includes Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur, and Jason Weston.