Open-source language AI challenges big tech’s models

Open-source language AI challenges big tech’s models

An international team of around 1,000 largely academic volunteers has tried to break big tech’s stranglehold on natural-language processing and reduce its harms. Trained with US$7-million-worth of publicly funded computing time, the BLOOM language model will rival in scale those made by firms Google and OpenAI, but will be open-source. BLOOM will also be the first model of its scale to be multilingual.

The collaboration, called BigScience, launched an early version of the model on 17 June, and hopes that it will ultimately help to reduce harmful outputs of artificial intelligence (AI) language systems. Models that recognize and generate language are increasingly used by big tech firms in applications from chat bots to translators, and can sound so eerily human that a Google engineer this month claimed that the firm’s AI model was sentient (Google strongly denies that the AI possesses sentience). But such models also suffer from serious practical and ethical flaws, such as parroting human biases. These are difficult to tackle because the inner workings of most such models are closed to researchers.

As well being a tool to explore AI, BLOOM will be open for a range of research uses, such as extracting information from historical texts and making classifications in biology. “We think that access to the model is an essential step to do responsible machine learning,” says Thomas Wolf, co-founder of Hugging Face, a company that hosts an open-source platform for AI models and data sets, and has helped to spearhead the initiative.

“It was long overdue that this technology diffused into the open-source world, and this is quite an interesting way for it to have happened,” says Connor Leahy, co-founder of EleutherAI, which is creating its own open-source large language model in English and was not involved in the project.

Large language models are algorithms that learn statistical associations between billions of words and phrases to perform tasks such as generating summaries, translating, answering questions and classifying text. Built using brain-inspired architectures known as neural networks, the models train through adjusting values, called parameters, by blanking out words and comparing their predictions with reality. BLOOM has 176 billion parameters, on a par with GPT-3, one of the best-known such models, which was created by the non-profit firm OpenAI and licensed by Microsoft.

Although such models are sometimes impressive — generating poetry or correctly answering trivia questions — they have no sense of the meaning of language, which causes them to also create gibberish. More worryingly, they can also promote abuse or self-harm, and echo existing racist or sexist associations that are sewn throughout the human-written text they learn on, such as linking ‘Islam’ with terrorism. The models generally cost millions of dollars to train and have an enormous carbon footprint (BigScience eventually plans to reveal its carbon emissions).

Whereas most natural-language models are built by small in-house teams, BLOOM was the work of hundreds of researchers — mostly academics — including ethicists, legal scholars and philosophers, but also some employees from Facebook and Google, working in a personal capacity. To train BLOOM, BigScience was granted free access to France’s national Jean Zay supercomputer facility outside Paris. The model is currently in the last few weeks of its three-month training period.

Models are only as good as the data sets they are based on, so a major task was selecting what texts the model should learn from, says Yacine Jernite, a machine-learning researcher at Hugging Face. Most major models rip language directly from the web, including sites such as Reddit. Instead, the BigScience researchers hand-picked nearly two-thirds of their 341-billion-word data set from 500 sources. Among them was Semantic Scholar, an AI-backed search engine for academic publications that also includes content such as Nature news articles. The sources were suggested during a series of workshops, including with community groups, such as the African natural-language-processing community Masakhane, LatinX in AI and Machine Learning Tokyo. “We wanted to make sure people with proximity to the data, their country, the language they speak, had a hand in choosing what language came into the model’s training,” says Jernite.

To make full use of the computing power available, the team topped up the data trove using a multilingual web crawl, filtered for quality and with some redaction for privacy. The collaboration also attempted to reduce the usual over-representation of porn sites (which can lead to sexist associations in the model) but without excluding keywords that would remove content associated with frank discussion of sexuality in often under-represented communities.

Jernite acknowledges that BLOOM will not be free of biases. But by providing it with multicultural and high-quality sources, the team hopes to improve on existing models. Crucially, because the code and data set behind the model are open, researchers can try to understand the roots of harmful behaviours, which could improve future iterations, says Wolf.

Evaluation of the model will also differ from the usual benchmarks, says Ellie Pavlick, a natural-language-learning researcher at Brown University in Providence, Rhode Island. As well as comparing BLOOM against other models in its abilities to, for example, answer questions, researchers also want to look at more diverse metrics, such as how strongly it makes certain stereotyped associations or how biased its abilities are towards a specific language. Pavlick hopes that because the model has been trained to be multilingual, it might have a deeper understanding of language, which could help in its ability to generalize to a diversity of tasks.

Leahy predicts that the model might perform slightly worse than other large models in English, given its smaller data set in the language, but that should be balanced by markedly better performance elsewhere.

The fully trained BLOOM model will be available to download for researchers who want to experiment with it or train it on new data for specific applications. But downloading it and running it requires significant hardware capacity. Because that’s available to so few research teams, BigScience will also publish smaller, less hardware-intensive versions as well as create a distributed system that allows labs to share the model across their servers. In addition, Hugging Face will release a web application that will enable anyone to query BLOOM without downloading it. A similar application will be available for the early release later this week.

BLOOM could find uses in research outside AI. Francesco de Toni, a linguist at the University of Western Australia in Perth, jointly leads a BigScience working group that is looking at using models to extract information from collections of historical texts that are too large to go through by hand. Models can, for example, extract all the names or goods mentioned in a collection of letters by Renaissance merchants — information that would be impossible to find using a search engine.

BLOOM comes with documentation that outlines its capabilities and limitations. Using it also requires signing up to an evolving legal licence that commits researchers to not use the model for malicious or inappropriate ends, such as generating fake news. The collaboration will monitor how the model is applied and adjust the license and documentation as necessary, says Giada Pistilli, an ethicist at Hugging Face and philosopher at the Sorbonne University in Paris who co-chaired BigScience’s ethical and legal working group. “It’s really hard to imagine and predict all the uses,” she says.

Images Powered by Shutterstock