Using Language Models

In the last year Language Models have changed my approach to working with natural language processing. Some (relatively) fresh results by XLNet show that large transformer-style models work really well for many language ML tasks. On the surface it seems quite similar to training word embeddings1 but with the advantage of training far more layers & parameters (weights). This is a great boon to any NLP practitioner as many have talked about. And for the first half of 2019, the speed of progress hasn't slowed down.

There is a downside to this. Easily replicating or retraining your own models from scratch is becoming increasingly hard. Often resorting to tricks like accumulating gradients, recalculating some weights during the backward pass, and simply still leaving some parts of a model "freezed". Saving up money for some time to buy a Nvidia 1080 Ti, only to not be able to train many models can be a bummer (lucky there is free colab, but it feels bad to depend on that). A more serious problem is gauging how a certain architecture and how training choices really add in to the final result. Anna Rogers has written a very good piece about this. Another problem could be the fact that I do not really know what data has been fed to the model.

But leaving all these sidenotes, language models help enormously and I would like to show two libraries which make working with them incredibly easy.

Zalando's Flair

Flair is together with ULMFit one of the older RNN-style language models. This might be the reason why it is not SOTA anymore, but it still performs very well, and the library is incredibly easy to use. With build in model downloader, training and all kind of tooling around their models. They also have a simple tutorial which I can recommend if you want to get started quickly.

To show you how easy, here is a short sample using their pre-trained sequence model which predicts NER tags.

from import Sentence
from flair.models import SequenceTagger

sentence = Sentence('German Chancellor Angela Merkel and British Primeminister Boris Johnson .....')
tagger = SequenceTagger.load('ner')

for entity in sentence.get_spans('ner'):

And if we want to train our own, with multiple layers of embeddings.

import flair.datasets
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.embeddings import FlairEmbeddings, WordEmbeddings, StackedEmbeddings

corpus = flair.datasets.WIKINER_ENGLISH()
ner_dict = corpus.make_tag_dictionary('ner')
embedding_types = [
embeddings = StackedEmbeddings(embeddings=embedding_types)

tagger = SequenceTagger(
trainer = ModelTrainer(tagger, corpus)

And there is even a build in ColumnCorpus and CSVCorpus and ClassificationCorpus dataset loader which loads fasttext style inputs (also quite a useful toolkit to have).

Huggingface's pytorch-transformers

This is a well known reimplementation of modern Bert and XLNet architectures in Pytorch (originally with help from Google, and in Tensorflow). Not only are they pretty easy to work with. The pre-trained models are also available on pytorch hub and can be downloaded and used with build-in tools.

To use a pretrained network as the front-part of your network can be as easy as this:

import torch
from pytorch_transformers import BertModel, BertTokenizer

pretrained_modelname = "bert-base-uncased"

# This will download the model if it has not be found in user storage
tokenizer = BertTokenizer.from_pretrained(pretrained_modelname)
model = BertModel.from_pretrained(pretrained_modelname)

encoded_text = tokenizer.encode("Enter the text you need the embeddings from here")
input_tensor = torch.tensor([encoded_text])
print("Input tensor: ", input_tensor)
with torch.no_grad():
    output_tuple = model(input_tensor)
    last_hidden_states = output_tuple[0]
print("Last hidden states: ", last_hidden_states)
print("Shape (size): ", last_hidden_states.size())

Now, this is already a great solution for many cases where simply using the embeddings as input features give better results (especially coming from wordvector embeddings). And, like the Flair example shows, it gives room to experiment with combining embeddings (simply concatenating them over the words).

But there is more. There are build-in models, ready for word classification (like NER tagging) and generic text classification. And example run-scripts in the examples folder.

To extend the model training scripts to make it run generic text classification problems I have added some code to

class ImdbProcessor(DataProcessor):
    """Processor for the special IMDB dataset."""

    def get_train_examples(self, data_dir):
        """See base class.""""LOOKING AT {}".format(os.path.join(data_dir, "imdbtrain.tsv")))
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "imdbtrain.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "imdbtest.tsv")), "dev")

    def get_labels(self):
        """Still same classes luckily."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for line in lines:
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = None
            label = line[2]
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

processors = {
    "imdb": ImdbProcessor,

output_modes = {
    "imdb": "classification",

    "imdb": 2,

In this case I named it IMDB to make it work with the IMDB classification dataset. Now it can work on a dataset with tab separated values2.

Other libraries

Some libraries I am watching but haven't tested yet.

Spacy embedding pretraining: I use Spacy quite often for fast text cleaning/mangling and for creating rules matchers based on regex in combination with NER tags. Previously I have tested and used their build-in NER training and classification modules. My bet is that this will be just as great.

Deepset FARM: By the folks who also released a German trained BERT model. Looks very neatly done and is build ontop of pytorch_transformers.

JohnSnow NLP: A colleague of mine tried to do NLP on SPARK some time ago and used this library. He checked it out again this week and saw a bunch of new features. I am not completely sold on the distributed text processing just yet, but I do want to try it out.

There is a lot more to say but I am going to leave it by this for now. Hopefully showing you this will get you interested in trying out these great techniques, and dig deeper into their examples and source code.

  1. Wordembeddings can still be used, although most models now learn embeddings for subwordunits. See sentencepiece or GPT-2's encoder 

  2. In this case col 0: id, col 1: text, col 2: class labels → 0 or 1