Three Models Leading the Neural Network Revolution
In recent years, we have seen great advances in machine learning and artificial intelligence that could usher in a new era of progress. In the area of natural language processing, three algorithms have been the cornerstone of this innovation: GPT, BERT, and T5.
- By Troy Hiltbrand
- February 13, 2023
In the past couple of years, there have been some revolutionary advances in machine learning (ML) and artificial intelligence (AI). These advances are demonstrating that ML and AI are moving from science fiction to science fact and that they have the capacity for transformational change across many industries. From DALL-E and Lensa demonstrating how machines can create art to ChatGPT demonstrating that machines can write articles, poetry, song lyrics, and even programming code, this domain is on the precipice of huge advances.
Underlying these amazing demonstrations of the business value of ML and AI is a set of technologies that fall into the family of neural networks called transformers. As an analytics leader, you don’t necessarily have to understand all the technical details associated with how these are programmed and the inner workings of their code, but it is important to understand what they are and what makes them unique.
In 2017, a group of researchers at Google and the University of Toronto developed a new type of neural network architecture: the transformer. Originally, the goal of this team was to enable machine translation, but their findings have gone beyond just translation and have revolutionized multiple arenas in the ML world. Unlike the recurrent neural nets (RNNs) of the past, which were feed-forward in nature and expected data to arrive in a sequential manner, these transformers allowed the data to be distributed and parallelized. This means they can process huge amounts of data and can train very large models.
What Makes the Transformer Special
There are three concepts that enable these transformers to succeed where the RNN didn’t: positional encoding, attention, and self-attention.
Positional encoding removes the need to process one word of a sentence at a time. Each word in the corpus is encoded to have both the text of the word and the position in the sentence. This allows the model to be built in a distributed fashion across multiple processors and leverage mass parallelization.
Attention is a very important concept for machine translation. When translating language, it is not enough to just translate the words. The process needs to see patterns of word placement in the input and output sentences on the training content and mirror those patterns when performing machine translation on new phrases. This ability to leverage these patterns is at the core of attention. In addition to sentence word position matching, this pattern-matching concept applies to word gender determination, plurality, and other rules of grammar associated with translation.
Self-attention is the mechanism in a neural network where features are identified from within the data itself. In computer vision problems and convolutional neural nets (CNN), the neural network can identify features such as object edges and shapes from within unlabeled data and use these in the model. In natural language processing (NLP), self-attention finds similar patterns in the unlabeled data that represents parts of speech, grammar rules, homonyms, synonyms, and antonyms. These features extracted from within the data are then used to better train the neural network for future processing.
With these concepts, multiple groups have built large language models that leverage these transformers to do some incredible machine learning tasks related to NLP.
The Top Three Transformer Models
GPT stands for Generative Pre-trained Transformer. GPT-3 is the third generation of this transformer model and is the one gaining momentum today with an anticipated GPT-4 on the near-term horizon. GPT-3 was developed by OpenAI using 45TB of text data, or the equivalent of almost all the content on the public web.
GPT-3 is a neural network that has over 175 billion machine learning parameters that allow it to effectively perform natural language processing and natural language generation (NLG). The results of the GPT model are very human-like in word usage, sentence structure, and grammar. This model is the cornerstone of the GPTChat released by OpenAI to demonstrate how this model can solve real-world problems.
BERT stands for Bidirectional Encoder Representations from Transformers. In this neural net, every output element is connected to every input element. This enables the bidirectional nature of the language model. In past language models, the text was processed sequentially either left-to-right or right-to-left, but only in a single direction. The BERT framework was pre-trained by Google using all the unlabeled text from Wikipedia but can be further refined with other question-and-answer data sets.
The BERT model aims to understand the context and meaning of words within a sentence. BERT can be leveraged for tasks such as semantic role labeling of words, sentence classification, or word disambiguation based on the sentence context. BERT can support interaction in over 70 languages. Google leverages BERT as a core component of many of its products, including developer-facing services in the Google Cloud Platform.
T5 stands for Text-to-Text Transfer Transformer. T5 was developed by Google in 2019. Researchers were looking for an NLP model that would leverage transfer learning and have the features of a transformer, therefore this is called a transfer transformer. This model is different from the BERT model in that it uses both an encoder and decoder so its inputs and outputs are both text strings. This is where the text-to-text portion of the model is derived.
The model was trained, leveraging both unsupervised and supervised methods, on a large portion of the Common Crawl data set. T5 was designed to be transferable to other use cases by using its model as a base and then transferring it and fine-tuning it to solve domain-specific tasks.
Common Use Cases
Because of the transformer revolution we are experiencing, many NLP problems and use cases are being solved using these new and improved methods. This makes it possible for businesses to more effectively perform tasks that require text summarization, question answering, automatic text classification, text comparison, text and sentence prediction, natural language querying (including voice search), and message blocking based on policy violations (e.g., offensive or vulgar material, profanity).
As companies experience the power of these new models, many additional use cases will be identified, and businesses will find ways to derive value from integrating them into their existing and new products. We will see more products arrive on the market with intelligent features leveraging these three models.
Looking Forward
At this stage, many of these algorithms are still in the demonstration and experimentation phase, but companies such as Microsoft and Google are actively looking at ways to incorporate them into other products to make them better, smarter, and more capable of interacting in an intelligent manner with users. The AI revolution that is upon us will possibly define the coming decade much in the same way that the introduction of the internet defined the 1990s and 2000s, so it is important to understand what these algorithms are and start to identify where on your strategic road map they should be planned.
About the Author
Troy Hiltbrand is the senior vice president of digital product management and analytics at Partner.co where he is responsible for its enterprise analytics and digital product strategy. You can reach the author via email.