Understanding "Improving Language Understanding with Unsupervised Learning" Part 1

Vladimir Steiner
Nov 26, 2018
3 min read

Updated: Oct 12, 2022

I decided to try and understand one of OpenAI's latest research (from last June). But to do so, we must first read and assimilate the articles it is based on. This post will therefore cover the research paper from Google, Semi-Supervised Sequence Learning, dating from november 2015.

This article is not so complicated, as it deals more with the results of experiments than with new concepts, but is still based on some notions that need to be explained. Two are essential for this paper, the first one is the difference between supervised and unsupervised learning.

The two most frequent uses of Supervised Learning

Supervised learning is when during the learning phase, for each input data, we tell our machine the output to expect, because we know the different types of object we are dealing with. It is often used for classification problems, for example if you want to be able to tell the difference between apartment and houses, based on the price and the surface. We will give a lot of examples to our machine, telling it each time if the data corresponds to a flat or a house. If it is done well, our machine should be able to classify correctly new examples afterwards.

Unsupervised learning is different because we don't exactly know what we want the result to be. It is usually used for some data where we would like to find clusters, some relations between certain data, without actually knowing in advance what those relations will be. For more precision about the difference between those, you can go read https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

The second crucial notion is the LSTM, used when we deal with information as sequences. This special neural network is the first solution we found to one of the main problems of recurrent neural networks (RNNs, used in deep learning): short-term memory. If we work on a sequence long enough, our network will forget the beginning. To quote Michael Nguyen in Illustrated Guide to LSTM’s and GRU’s: A step by step explanation, "if a sequence is long enough, they’ll have a hard time carrying information from earlier time steps to later ones. So if you are trying to process a paragraph of text to do predictions, RNN’s may leave out important information from the beginning". The LSTM is a neural network that is able to retain information on long term. I won't go into more detail, and advise you to read Mr. Nguyen's post https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21.

Now that we know the basics, we can go back to our article at hand. In Semi-Supervised Sequenced Learning, the team writes that they tried to enhance the results of classifying different databases using classical Supervised Learning, by adding unlabelled data and using different treatment of the data before making it pass through the LSTM.

They tried 2 different possibilities, sequence autoencoding and language modelling. Once again I will try to explain the principle of those two.

The sequence autoencoder (SA) is a neural network that tries to reproduce the input under a constraint of limited encoder. Concretely, the SA will get a paragraph or a sentence as an input, and try to reproduce it.

The recurrent language models (LM) are a bit different (but are not really explained in the article). After trying to use those two different methods before using the LSTM, the team found that they had better results than with random initialization of the LSTM (which is the basic way of beginning).

More precisely, after experiments on 6 different databases, they found that the LM-LSTM was sometimes better, sometimes worse than the original neural networks. However, the SA-LSTM was better or equal on each.

What is interesting here is that the two methods added before the LSTM are unsupervised, so you can put lots of data without spending weeks of work to label it by hand. They manage to get the error rate from 20% to 15% which is great, seeing how the cost of unlabelled data is virtually nothing.

That is it for this article, next time we will study Attention is all you need, to eventually understand OpenAI's article from June. As always, I used as reference articles from towardsdatascience.com and of course the article by Google Semi-Supervised Sequence Learning (https://arxiv.org/pdf/1511.01432.pdf). Hope you enjoyed it and see you next time !

Understanding "Improving Language Understanding with Unsupervised Learning" Part 1

Recent Posts

Comentários