Modern AI Text Generation: An Exploration of GPT-3, Wu Dao 2.0 & other NLP Advances

shutterstock_403937386.jpg

Today's AI models like ChatGPT and Stable Diffusion are capable of composing essays, music, artwork and imagery that are highly realistic. The casual observer will likely not suspect that these compositions are created by a machine or, if they do suspect that they are simulated, are likely to be impressed by the quality of the output. The technology in these areas is growing at a rapid pace. Within this last year alone, there has been a paradigm shift in model development as research groups are ingesting (nearly) the entire world's worth of information on the internet to train massive deep learning models capable of performing fantastic or frightening feats, depending on your perspective. In this article, we explore an AI compositional technology, known as generative modeling, and demonstrate its ability to simulate human-realistic text. This article is the first of a series on the subject and, in a forthcoming article, we will identify and illustrate practical applications of this fascinating technology for your business.

Simulating Human Text

The ability to simulate human text is a hot topic in the natural language processing (NLP) world particularly with the recent release of ChatGPT; this is largely due to recent innovations that are quantifiably bettering traditional NLP benchmarks. This technology is currently being used for a variety industrial applications such as customer service, human and computer language translation, chatbot communication, and text summarization to name a few. The cornerstone of NLP models used to create simulated text are Generative Pre-Trained Transformers (GPT) models:

  • generative: the ability to generate new data instances.

  • pre-trained: the model is pre-trained on a set of text data, meaning that the training need only be done once and can be reused in the future. GPT models are also amenable to transfer learning, which conceivably allows them to be used as a base for other NLP tasks.

  • transformer: a model architecture developed by Google in 2017 that is both computationally efficient and shown to be accurate in translating languages, say from English to French.

In June of 2020, OpenAI released GPT-3: version 3 of a language model capable of producing highly realistic human-like text given an input sequence of text, known as a prompt. The release of GPT-3 created quite a buzz in the AI research community and in the general media as OpenAI touted its capabilities. Language modeling is certainly not a new concept. Of particular note, is a recent advancement made by Google researchers in their development of transformers, which OpenAI leverages in GPT-3 and has become a staple for modern NLP models. What then makes OpenAI's GPT-3 model so special and why all the hype? There are several factors:

  • scale of the training data: The model was trained on a massive data set comprising hundreds of billions of words taken from the internet, e.g., CommonCrawl and Wikipedia. In effect, OpenAI basically ingested the world of text as was publicly available on CommonCrawl from 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering.

  • scale of the model: when we speak of model complexity, we are referring to the number of model parameters. You can analogously think of these parameters as all the knobs and buttons you fiddle with when you get a new stereo system. Once you get the sound just the way you like it, which may take a few rounds of iterative adjustment, we say that your sound system is "tuned". In the case of a neural network, a parameter is a weight that determines the strength of a single neuron-neuron connection and tuning the model means to adjust the weights of all such parameters in such a way that a chosen performance metric (say, prediction accuracy) is optimized. Deep learning models contain lots of neuron-neuron connections and so it is common to see the number of model parameters in the millions. The GPT-3 deep learning model is something quite special in that it consists of 175 billion model parameters and, at the time of release, was the largest neural network ever created. 

  • cost to tune the model: to tune such a complex model requires a lot of hardware, compute cycles, electricity, and manpower. It is estimated that OpenAI spent anywhere from $4.6M to $12M to train the GPT-3 model.

  • level of realism: the text completions are often highly realistic. Consider, for example, this article which was written entirely by GPT-3, i.e., a machine. Yes, it isn't Henry David Thoreau, but in reading the article you get the sense that an actual human-being wrote it. Given the level of realism, some even pondered whether GPT-3 was the first first instance of an artificial general intelligence (AGI), i.e., the ability of an AI agent to understand or learn any intellectual task that a human being can.

  • generalizability: a fascinating aspect of GPT-3 is that it can take as input text that follows any language structure and this can be used for a wide variety of applications including translating languages, writing topical essays, summarizing long texts, answering questions, and even creating computer code. It has been stated that GPT-3 does not need to be “trained” for various language tasks in the traditional machine learning sense, since its training data was composed of the world's worth of text on the internet.

The hype of GPT-3 was also met with resistance:

  • lack of semantic understanding and real-world reasoning: semantic understanding is the ability of a machine to process the meaning and context behind real-world information and some experts deny that GPT-3 has these capabilities, likening the effort to a "a glorified $10M+ auto-complete software" that is "nowhere close to AGI".

  • bias in generated text: the unfortunate reality is that the internet is littered with examples of racist, sexist, ageist, or otherwise discriminatory remarks. As GPT-3 is trained on most of the world's text data, it means that the text that is generated, given the proper prompts, may also contain societal bias, which could be harmful and offensive.

  • negative use cases: because GPT-3 has the potential to imitate text generated by real humans, it may be used for ill gain, e.g., Phishing or impersonation. OpenAI is fundamentally aware of the issues, and in their release of GPT-2 (predecessor of GPT-3) they made a statement concerning "generate deceptive, biased, or abusive language at scale".

Even the CEO of OpenAI chimed in on the hype surrounding GPT-3, stating, 

The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.

Simulation in Action

Is GPT-3 more hype than substance? Let's see it in action to get a sense of its capabilities and limitations. To start, let's see how GPT-3 views itself by feeding it the prompt, OpenAI's GPT-3 model is and asking it to keep writing. Here is a collection of several responses:

  • a major improvement over the company's previous work. It's still nowhere near as sophisticated as a human brain, but it's far better than its predecessors. The GPT-2 model, which impressed many experts last year, topped out at a vocabulary of about 10,000 words. GPT-3's knowledge base is more than twice as extensive. The model is also capable of a few other impressive feats. It can listen to a sentence and then generate the next sentence of a story, and it can also listen and then answer questions about a text. For example, the model can listen to a list of events and then answer whether one event happened before another.

  • a step forward in the field of generative adversarial networks (GANs), which are a class of machine learning algorithms that pit two neural networks against each other. The first network, the generator, creates fake data, and the second network, the discriminator, tries to tell the difference between real and fake.

  • a special kind of GAN that generates text from a varied set of large bodies of text, including books and web paragraphs. Whereas the GPT-2 uses a dictionary of possible two- and three-char sequences, the GPT-3 model uses a dictionary of all possible sequence-lengths. The model achieves state-of-the-art performance, generating text that is often indistinguishable from humans. The GPT-3 benchmark task consists in training a GPT-3 model on a corpus to predict the next character of a sequence of random length 32 that is sampled uniformly at random from a vocabulary of sequences of length 128.

The completions are human-realistic and informative, giving us insight into the underlying model (GAN), letting us know that it is better than its predecessor (GPT-2), hinting at how GPT-3 results can be bettered, e.g., by supplying good examples in list form, and comes with an admittance that the current state of the model is not at the sophistication of the human brain. Quite impressive really given that the prompt contains only a few words. However, a caveat to the above generations, and indeed all AI generated text, is that there is no guarantee that the text is factually accurate. For example, in the above, GPT-3 describes itself as a GAN model but OpenAI mainly uses GANs for image generation, not text generation, and since the model is not publicly released it is questionable whether GANs were used in any capacity involving GPT-3, except for OpenAI's DALL-E project, which is a 12B parameter version of GPT-3 and "creates images from text captions for a wide range of concepts expressible in natural language". This highlights one of the deficiencies of using text generation: just because it sounds human doesn't mean it is accurate.

Let's see how GPT-3 performs with other tasks. Table 1 shows sample completions created by GPT-3 given a prompt. In these examples, the prompt comes in a few flavors:

  1. A conversational pattern prefaced with tokens, e.g., English:

  2. The beginning of a sentence.

  3. A list of examples expressing a similar theme or relation.

  4. A single question without any preceding patterns or examples to go by.

Table 1: GPT-3 sample completions given an input prompt.

Table 1: GPT-3 sample completions given an input prompt.

In the first example, an English phrase is paired with its French translation and four such pairs are given as examples, ending with a blank French: entry, which is a cue for GPT-3 to form the French translation of the English partner phrase "How good is the wine here?". The result is Quelle est la qualité des vins ici?, which according to Google Translate means "What is the quality of the wines here?". That result is very impressive given that we only supplied a few examples for GPT-3 to understand what we are looking for. This is an example of what the NLP community refers to as few shot learners, emulating what humans can generally do with a new language task given just a few examples or from simple instructions.

The second example is simply the beginning of a statement, "I love to eat". GPT-3 completes the prompt with two sentences, the first of which is something very reasonable and the second of which starts to get a little silly with a complaint about how the person "accidentally" swallows food too quickly leaving them "feeling sick from the bitter taste". This is an important example because it illustrates how GPT-3 can go off the rails, so to speak, from "standard" language. In the process of generating text, GPT-3 fills in each word one at a time from a list of candidates that have been assigned a probability score behind the scenes. In part, the user can influence the degree of randomness in these generations through an API parameter named temperature, which ranges from 0 (cold) to 1 (hot). Hot temperatures encourage more randomness in the chosen word to complete the prompt. Setting a cold temperature is tantamount to asking for the completion to be more popular and common. So, you can imagine a situation where the temperature is turned up quite high so that you can get really creative completions but at the risk of them straying a bit far from the intended "expected" result. With that word chosen, it is appended to the original prompt and the process repeated until a sentence is formed, then a paragraph, and so on. The GPT-3 API has another parameter, max_length, that can set you control the maximum length of the simulated responses, e.g., to just a few words or a few paragraphs. Given the right settings, It is easy to see how you can encourage GPT-3 to output highly creative text that strays far from reality and reading more like random streams of consciousness than “normal” communication. 

The third and fourth examples are given in a particular style that helps to polarize the expected result. For example "Lewis Hamilton is a Formula 1 driver. Tom Cruise is an actor." form an obvious pattern of the name of a famous person and what they do for a living. We then ask GPT-3 to complete "Joe Biden is a", and it returns "politician.", which is absolutely correct and reasonable. In the next example, however, we change the rules by providing examples of famous people along with their nationality. GPT-3 completes the prompt "Angela Merkel is" correctly with "German".  These are more examples of few-shot learners and performs quite well given that GPT-3 was not trained specifically for these tasks. 

The final example involves a fun question and is meant to demonstrate that GPT-3 is far from artificial general intelligence. We pose a simple question, "What weighs more, the moon or a cookie?", a question that a child is likely to answer correctly. GPT-3 incorrectly answers the question in its first two responses but does so correctly in its third response with an additional observation that "the cookie is easier to eat", an astute observation! This demonstrates that generally GPT-3 lacks real-world reasoning as some critics suggest. It's not able to comprehend physics in the way that humans do. What it is doing is finding words that optimally complete a prompt based on the data that it has been given. Without a prompt containing a list of examples comparison the relative weights of several objects, GPT-3 indeed is filling in the blanks with text likely to have been communicated but not necessarily beholden to the laws of physics or reality. 

Competition in the Generative Model Space

For many, GPT-3 represents a game changer in NLP technology if not for the sheer size of the underlying model and the amount of information OpenAI absorbed to train it. However, on June 1, 2021 they were met with serious competition as the Beijing Academy of Artificial Intelligence (BAAI) conference introduced Wu Dao 2.0: a multi-modal (text + image) model containing 1.75 trillion parameters, an order of magnitude larger than the mammoth GPT-3 model. 

Figure 1: Number of parameters for popular NLP models. Note that the parameter counts in the y-axis are in log scale, e.g., 10E9 = 1,000,000,000 (billion) and 10E12 = 1,000,000,000,000 (trillion).

Figure 1: Number of parameters for popular NLP models. Note that the parameter counts in the y-axis are in log scale, e.g., 10E9 = 1,000,000,000 (billion) and 10E12 = 1,000,000,000,000 (trillion).

The South China Morning Post has stated that Wu Dao 2.0 has surpassed OpenAI and Google   with their new language processing model. As with the release of GPT-3, there is quite a buzz about the capabilities of Wu Dao 2.0 which include text generation, image recognition, and image generation tasks. The multimodality of the model allows it to write essays, poems, and couplets in traditional Chinese, as well as captioning images and creating nearly photorealistic artwork, given natural language descriptions. As it stands, Wu Dao 2.0 is currently not available to the general public.

Today, there is heavy competition in the generative AI space as companies and, indeed entire nations, battle for technological dominance.

That dominance is important when it comes to disseminating information at a massive scale. As AI text generation models improve so does their ability to be used for malicious purposes. OpenAI openly stated in a research paper that GPT-3 could be for "misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing and social engineering pretexting." As a result, OpenAI has not released GPT-3 as an open sourced product. In fact, in Sep 2020 GPT-3 was licensed exclusively to Microsoft and if you wish to experiment with the beta version of GPT-3, you must apply to OpenAI for an API key, which comes with restricted use. You also will have to pay for the use of GPT-3 and will require their permission to include it in a commercial product. Simply put, OpenAI is being careful with their creation and are keeping the reins on it to ensure proper use. Like Xyonix, they generally operate under the credo of doing AI for good, and we commend them for that effort.

With restrictions limiting access to such technologies, that leaves researchers in a bit of a bind, knowing that these tools might be very useful for their business but not having immediate access to them or an assurance that they will be available to them in the future. 

With Wu Dao 2.0 inaccessible and GPT-3 currently only available as a commercial product, are there open source alternatives? 

Thankfully, the answer is: yes! For example, EleutherAI is a grassroots collective of researchers working on GPT-NEO: a project dedicated to the train an equivalent model to the full-sized GPT⁠-⁠3 and make it available to the public under an open source license. GPT-NEO currently comes in two flavors: a 1.3- and 2.7-billion parameter model. Let's take GPT-NEO 2.7B for a spin and ask it to describe itself as we did for GPT-3. Here are several independently generated responses to the prompt "EleutherAI's GPT-NEO 2.7B is":

  • bringing 20 GPT-AlexNet models per minitCloud.

  • the first GPT based AI training tool that supports the popularity of BERT, the asynchronous sequence-to-sequence model that was developed to solve GPT-2.

  • the first production-ready production-ready smoothly running (at least it runs fast and smooth as a production-ready variant of ELITE).

  • an accurate tool to achieve a high level of translation accuracy on the English-German NMT and German-English MT tasks.

  • the result of months of intense work.

Like GPT-3, these responses give us a glimpse at GPT-NEO's capabilities and highlight the level of effort in training the model. GPT-NEO can also handle prompts that contain multiple lines of text. For example, let's repeat some of the experiments we did with GPT-3 completions  for GPT NEO 2.7B (see Table 2):

Table 2: Sample GPT-NEO 2.7B completions given a prompt.

Table 2: Sample GPT-NEO 2.7B completions given a prompt.

For each of these prompts, GPT-NEO does a good job in understanding the context of the provided prompts and returns a sensible human-realistic response. However, these responses are lacking in real-world accuracy in comparison to GPT-3:

  • What is good life here? is not the same as How good is the wine here?

  • Joe Biden was indeed a US Senator for Delaware but that stint ended in 2009 and more timely and relevant answers might be Vice President or President

  • Angela Merkel is not Italian 

So, while GPT-NEO 2.7B model outputs are impressive they tend to lack the quality that can more readily be achieved with GPT-3. Likely, this is directly related to the fact that the size of the GPT-3 model is almost 65 times larger than GPT-NEO 2.7B and is trained on a much larger data set. Still, GPT-NEO models play a valuable role in allowing open source access to a developing and important technology. Just in the last few weeks, EleutherAI released the GPT-J-6B (GPT-J), a model the group claims to perform at nearly the same level as an equivalent-sized GPT-3 model across multiple tasks. True to its name, GPT-J-6B contains approximately 6 billion model parameters, more than twice the size of its 2.7B predecessor. 

An important aspect of EleutherAI’s work is in its deliberate attempt to attenuate harmful societal bias from its text completions. To do so, EleutherAI claims to have performed “extensive bias analysis” on their training data and made “difficult editorial decisions” to exclude datasets they felt were “unacceptably negatively slanted” towards certain groups or viewpoints. With GPT-J made available as an open source product, we at Xyonix applaud EleutherAI’s efforts towards developing “AI technology for good”. 

Summary

Generative text models have gotten a lot of attention and hype over the last year. Their ability to generate human-like text coupled with data trained on literally the entire world's text data on the internet has established a new modeling paradigm. These models are massive, with OpenAI's GPT-3 setting a parameter size record in June of 2020 only to be outdone by BAAI's Wu Dao 2.0 model, which is 10 times larger, in June of 2021.  While accessibility to GPT-3 and Wu Dao 2.0 models is restricted, GPT-NEO and GPT-J offer open source alternatives with the caveat that their current largest model is orders of magnitude smaller than GPT-3 and Wu Dao 2.0 and may suffer from a relative lack of quality in certain tasks as a result. 

As impressive as these technologies are, they are certainly far from perfect. For the practitioner, working with AI text generators can require a fair amount of work in developing suitable text prompts coupled with finding the right API parameters to achieve satisfactory results. The temperature parameter, for example, controls the degree of randomness in next word completions and, if set is too high, may result in generated text that resembles a random stream of consciousness bordering on utter nonsense. That might be great for certain artistic expressions such as song lyrics or poetry but likely won’t fare well for other applications, e.g., a mental health chat bot, which requires sound and stable advice to be delivered in a timely and relevant manner. 

These trailblazing models aren't intelligent like a human: they can't reason based on the laws of physics or have a deeper understanding of life that sentient humans naturally possess given years of experience on planet Earth. Inasmuch, claims that these models represent a first generation of Artificial General Intelligence are truly hype and not reality. Still, with the world's data ingested, the simulations that are generated are often very impressive. So much so that researchers leading the charge in generative text modeling are wisely holding onto the reins of their products knowing that they can be used for harm as well as good. One thing is certain: with the race for technological AI dominance afoot and given the pace at which these models are advancing, we are bound for a wild ride over the next few years. We can't wait to see what's in store for us in the future!