Understanding "Improving Language Understanding with Unsupervised Learning" Part 2

Vladimir Steiner
Dec 17, 2018
3 min read

Updated: Oct 12, 2022

I decided to try and understand one of OpenAI's latest research (from last June). But to do so, we must first read and assimilate the articles it is based on. After, Semi-Supervised Sequence Learning, I will try to summarize Attention is All You Need, from Google, in June 2017.

Once again this article is an innovative amelioration of pre-exisiting models, so we need a few concepts to understand what Google's team managed to do. The most important one is the attention mechanism. It is a kind of algorithm that is trying to solve a specific problem, spatio-temporal information parsing. Let's explain what that means with an example:

A woman in a blue dress was approached by a woman in a white dress. She cut a few slices of an apple. She then gave a slice to the woman in blue.

The question that we want answer is "Who cuts the apple ?". For us humans, it is quite simple, we just read in the last sentence that She gives a slice to the woman in blue so She can only be the woman in white. But that is not so easy for our computers. That is why researchers created attention mechanisms, algorithms that will allow to concentrate on certain parts of a sentence (also works on images and videos). For our article, the application is the translation of texts.

Example of keeping only the important part of a video

Attention mechanisms are usually paired with recurrent neural networks, and the computation cost (the computing time, depending on the calculation capability of our machine) is enormous. What the Google team wants to do is, to only use attention mechanisms, removing RNN from the model, to try and have better results, in computation cost and in precision.

And that is another notion that we need to explicit: How do you measure the precision of a translation, which is not an absolute science (meaning that two professional human translators won't always have the same translation) ? The answer is that you can't have an absolute measure, but some people created a tool specific for evaluating a translating machine and comparing it to professional human translations, the BLEU score. What it means is that it will never be a clear and absolute precision value but it is a face value, useful for comparing different models.

The model created in our article, called the Transformer model, and I will not go in too much details about it. Simply put, they created a quite complicated model, with a parallelization of the attention mechanisms, allowing the model to "to jointly attend to information from different representation subspaces at different positions."

Precision of the different steps in the model

This Transformer model was tested on English-to-German and English-to-French datasets. On each, the model was better than the latest ones before, with a BLEU score better by 2 % (from 26% to 28%, which is quite a gap), and computation cost decreased by more than ten (but still being in the 10^18).

This model is a great innovation, not so much because it gets better results, but mostly because it uses only one type of algorithm, which makes it a lot easier to set up and to understand.

That is it for this article, next time we will finally study OpenAI's article from June. As always, I used as references articles from towardsdatascience.com, and used some of their examples. If you want more precisions, I encourage you to read Google's full article, Attention Is All You Need (https://arxiv.org/pdf/1706.03762.pdf) Hope you enjoyed it and see you next time !

Understanding "Improving Language Understanding with Unsupervised Learning" Part 2

Recent Posts

Commentaires