Enriching Word Vectors with Subword Information

Vladimir Steiner
Mar 20, 2020
3 min read

Updated: Oct 12, 2022

Hi there ! I will try to make a summary of an article as always, but this time i'll be giving my opinion afterwards.

This article, written by Piotr Bojanowski, Edouard Grave, Armand Joulin and Tomas Mikolov, from the Facebook AI Research team, is about finding ways to use subword information to improve the skipgram model. Their idea is that by using the internal structure of words, it becomes possible to have a good representation of even rare words, by understanding rules on the character level (inflected forms or declensions).

The team starts from a skipgram model, which has for purpose to maximize the following log-likelihood :

where Ct is the context, as the set of indices surrounding the word wt and wc is a word of that context. The expression of p(wc | wt) must be chosen and using a sigmoid is not relevant to this problem, as it would help us find only one word of the context. They thus decided to consider the prediction of context words as a set of independent binary classification problems. The words in the context will be positive examples and negative ones will be picked randomly from the corpus. With those considerations, the log-likelihood will now be:

with

The function "s" is the scoring function between our word and the positive or neg

ative examples. The most commonly used way of scoring is by assigning a vector to each word and computing the scalar product between wt and wc. However, the article is about using a different method, to make use of subword information. Each word will now be represented as a group of bags of n-characters, with the special character “<” at the start and “>” at the end. The whole word will be kept as well. For example, the word whose will be represented with n = 3 as: <wh, who, hos, ose, se>, and <whose>. It needs to be noted that the tri-gram who from whose is not the same as the word <who>. It will be those n-grams that will now be represented as vectors, and the scalar product will be done between every n-grams of each words.

The model was tested on datasets from 9 different languages: Arabic, Czech, German, English, Spanish, French, Italian, Romanian and Russian, using Wikipedia data.

An interesting part of the results is to see what the 3 most important n-grams in words are depending on the language and the type of word. For example, in French, the study shows that our model puts emphasis on the radical and the ending of a verb, which allows a human to understand the tense used. In English, the model learns to separate compound words like submarine, but also affixes like politeness.

The test results were compared to 2 other models, the usual skipgram, and the cbow. Both are from the word2vec library. The results vary a lot depending on a few parameters. There is the quantity of data used for the training. The new model (sisg) worked better with 5% of the German data than the cbow baseline on the full dataset and works better on the English dataset with only 1%.

Another important parameter is the n chosen for the n-grams. The authors chose to use n-grams from 3 to 6 arbitrarily and found out afterwards that these lengths worked well for English and German. When n < 3, the n-grams apparently don’t contain any relevant information and beyond 6 seems to be too big to bring character-scaled information.

This method is interesting, as it allows to use less data as training and makes our model have a deeper understanding of the language, without making it train that much longer. The authors even say that, as it needs less data it is quicker even with the separation in n-grams (we do not have any precise number however). However, it seems to only really work on Latin-based languages (and maybe especially Germanic languages).

And that is the main problem, the results are only explicitly talking about English and German in depth, evocating at some point other languages like French. But at the start of the article, it is said that they trained it on 9 languages and in my opinion, it would have been better to see results for every language, comparing them to the baselines. Not writing anything about them makes me think that it did not work well enough to make it, but I think it is a pity as it makes the results less convincing.

In the same way, we do not have precise numbers about the speed of training compared to methods not using subword information. I regret that the results are not complete and precise enough to be completely convincing.

Enriching Word Vectors with Subword Information

Recent Posts

Comments