Pre-trained Transformers with Hugging Face

Get started with the transformers package from Hugging Face for sentiment analysis, translation, zero-shot text classification, summarization, and named-entity recognition (English and French)

Transformers are certainly among the hottest deep learning models at the moment.

Originally, transformers were introduced as a novel architecture for language translation [Attention is All You Need, 2017]

And in fact, they are still mainly used for natural language processing. Here are a few examples:

  • text classification:
    • sentiment analysis: is this text positive or negative?
    • document sorting: in which folder should this document or email go?
  • text generation:
    • write a blog post from scratch about what to do in the garden in spring. This is one of the things GPT3 can do.
  • named-entity recognition:
    • extract important entities from the text, such as person names, dates, prices, location, etc
  • part-of-speech tagging:
    • tag parts of the text according to their grammatical role: nouns, verbs, adjectives, ...
  • translation of course

In this article, you will use pre-trained transformer models to perform some of these tasks.

We are going to use the Hugging Face library, which provides an easy, high-level interface to either TensorFlow or Pytorch for transformers.

Hugging Face also maintains a hub, which makes it easy to share and access models and datasets.

So let's get started!

You can run this tutorial on Google Colab by clicking here.

There is no need for a specific background in programming or machine learning, you'll see that Hugging Face makes it very easy for anybody to use pre-trained transformer models.

Installation on Google Colab

The transformers package is not installed by default on Google Colab. So let's install it with pip:

In [1]:
!pip install transformers[sentencepiece]
Collecting transformers[sentencepiece]
  Downloading https://files.pythonhosted.org/packages/b5/d5/c6c23ad75491467a9a84e526ef2364e523d45e2b0fae28a7cbe8689e7e84/transformers-4.8.1-py3-none-any.whl (2.5MB)
     |████████████████████████████████| 2.5MB 25.9MB/s 
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (20.9)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (3.0.12)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (1.19.5)
Collecting tokenizers<0.11,>=0.10.1
  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
     |████████████████████████████████| 3.3MB 36.5MB/s 
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (4.5.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (2.23.0)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (4.41.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (2019.12.20)
Collecting sacremoses
  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
     |████████████████████████████████| 901kB 32.8MB/s 
Collecting huggingface-hub==0.0.12
  Downloading https://files.pythonhosted.org/packages/2f/ee/97e253668fda9b17e968b3f97b2f8e53aa0127e8807d24a547687423fe0b/huggingface_hub-0.0.12-py3-none-any.whl
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (3.13)
Collecting sentencepiece==0.1.91; extra == "sentencepiece"
  Downloading https://files.pythonhosted.org/packages/f2/e2/813dff3d72df2f49554204e7e5f73a3dc0f0eb1e3958a4cad3ef3fb278b7/sentencepiece-0.1.91-cp37-cp37m-manylinux1_x86_64.whl (1.1MB)
     |████████████████████████████████| 1.1MB 24.4MB/s 
Requirement already satisfied: protobuf; extra == "sentencepiece" in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (3.12.4)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers[sentencepiece]) (2.4.7)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->transformers[sentencepiece]) (3.4.1)
Requirement already satisfied: typing-extensions>=3.6.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->transformers[sentencepiece]) (3.7.4.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers[sentencepiece]) (2021.5.30)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers[sentencepiece]) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers[sentencepiece]) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers[sentencepiece]) (3.0.4)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers[sentencepiece]) (1.15.0)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers[sentencepiece]) (7.1.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers[sentencepiece]) (1.0.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from protobuf; extra == "sentencepiece"->transformers[sentencepiece]) (57.0.0)
Installing collected packages: tokenizers, sacremoses, huggingface-hub, sentencepiece, transformers
Successfully installed huggingface-hub-0.0.12 sacremoses-0.0.45 sentencepiece-0.1.91 tokenizers-0.10.3 transformers-4.8.1

Sentiment analysis in English

In this article, we will use the high-level pipeline interface, which makes it extremely easy to use pre-trained transformer models.

Basically, we just need to tell the pipeline what we want to do, and possibly to tell it which model to use for this task.

Here we're going to do sentiment analysis in English, so we select the sentiment-analysis task, and the default model:

In [3]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")




The pipeline is ready, and we can now use it:

In [3]:
classifier(["this is a great tutorial, thank you", 
            "your content just sucks"])
Out[3]:
[{'label': 'POSITIVE', 'score': 0.9998582601547241},
 {'label': 'NEGATIVE', 'score': 0.9971919059753418}]

We sent two sentences through the pipeline. The first one is predicted to be positive and the second one negative with very high confidence.

Sounds good!

Now let's see what happens if we send in french sentences:

In [4]:
classifier(["Ton tuto est vraiment bien", 
            "il est complètement nul"])
Out[4]:
[{'label': 'POSITIVE', 'score': 0.7650704979896545},
 {'label': 'POSITIVE', 'score': 0.8282670974731445}]

This time, the classification does not work...

Indeed, the second sentence, which means "this tutorial is complete crap", is classified as positive.

That's not a surprise: the default model for the sentiment analysis task has been trained on English text, so it does not understand French.

Sentiment analysis in Dutch, German, French, Spanish and Italian

So what can you do if you want to work with text in another language, say French?

You just need to search the hub for a french classification model.

Several models are available, and I decided to select nlptown/bert-base-multilingual-uncased-sentiment.

We can specify this model as the one to be used when we create our sentiment-analysis pipeline:

In [5]:
multilang_classifier = pipeline("sentiment-analysis", 
                                model="nlptown/bert-base-multilingual-uncased-sentiment")





In [6]:
multilang_classifier(["Ton tuto est vraiment bien", 
                      "il est complètement nul"])
Out[6]:
[{'label': '5 stars', 'score': 0.5787978172302246},
 {'label': '1 star', 'score': 0.9223358035087585}]

And it worked! The second sentence is properly classified as very negative.

You might be wondering while the confidence for the first sentence is lower. I'm pretty sure that it's because this sentence scores high on '4 stars' as well.

Now let's try with an actual review for a restaurant near my place:

In [8]:
import pprint
sentence="Contente de pouvoir retourner au restaurant... Quelle déception... L accueil peu chaleureux... Un plat du jour plus disponible à 12h45...rien à me proposer à la place... Une pizza pas assez cuite et pour finir une glace pleine de glaçons... Et au gout très fade... Je pensais que les serveuses seraient plus aimable à l idée de retrouver leur clientèle.. Dommage"
pprint.pprint(sentence)
('Contente de pouvoir retourner au restaurant... Quelle déception... L accueil '
 'peu chaleureux... Un plat du jour plus disponible à 12h45...rien à me '
 'proposer à la place... Une pizza pas assez cuite et pour finir une glace '
 'pleine de glaçons... Et au gout très fade... Je pensais que les serveuses '
 'seraient plus aimable à l idée de retrouver leur clientèle.. Dommage')
In [9]:
multilang_classifier([sentence])
Out[9]:
[{'label': '2 stars', 'score': 0.5843755602836609}]

2 stars! on Google Review, this review has 1 star. Not a bad prediction.

Translation English-French

Let's try and do a bit of translation, from English to French.

Again, we search the hub, and we end up with this pipeline:

In [10]:
en_to_fr = pipeline("translation_en_to_fr", 
                    model="Helsinki-NLP/opus-mt-en-fr")






In [12]:
en_to_fr("your tutorial is really good")
Out[12]:
[{'translation_text': 'votre tutoriel est vraiment bon'}]

This works well. Let's translate in the other direction. For this, we need to change the task and the model:

In [13]:
fr_to_en = pipeline("translation_fr_to_en", 
                    model="Helsinki-NLP/opus-mt-fr-en")






In [14]:
fr_to_en("ton tutoriel est super")
Out[14]:
[{'translation_text': 'Your tutorial is great.'}]

Excellent translation!

Zero-shot classification in French

Nowadays, very large deep-learning models are trained on very large datasets collected from the internet.

These models know a lot already, so they don't need to learn much more.

Typically, it's possible to fine tune these models to a specific use case like text classification with a very small additional specific dataset. This is called few-shots learning.

And sometimes, we can even do zero-shot learning: specific tasks can be performed without any specific training. This is what we're going to do now.

We search the hub for a french zero-shot classification model, and we create this pipeline:

In [15]:
classifier = pipeline("zero-shot-classification", 
                      model="BaptisteDoyen/camembert-base-xlni")





In the example below, I propose a sequence to classify into categories, and I also specify the categories.

It's important to note that the model has not been trained with these categories, you could change them at will!

In [16]:
sequence = "Colin est en train d'écrire un article au sujet du traitement du langage naturel"
candidate_labels = ["science","politique","education", "news"]
classifier(sequence, candidate_labels)     
Out[16]:
{'labels': ['science', 'news', 'education', 'politique'],
 'scores': [0.4613836407661438,
  0.20861364901065826,
  0.20573210716247559,
  0.12427058815956116],
 'sequence': "Colin est en train d'écrire un article au sujet du traitement du langage naturel"}

The predicted probabilities are actually quite sound! This sentence is indeed about science, news, and education. And not at all related to politics.

But if we try this one:

In [21]:
sequence = "Laurent Wauquiez reconduit à la tête de la région Rhône-Alpes-Auvergne à la suite du deuxième tour des élections."
candidate_labels = ["politique", "musique"]
classifier(sequence, candidate_labels)   
Out[21]:
{'labels': ['politique', 'musique'],
 'scores': [0.6573010087013245, 0.34269899129867554],
 'sequence': 'Laurent Wauquiez reconduit à la tête de la région Rhône-Alpes-Auvergne à la suite du deuxième tour des élections.'}

It works!

Feel free to try other sentences and other categories. You can also change the model if you wish to do zero-shot classification in English or in another language.

Summarization in French

Summarizing text is an interesting application of transformers.

Here, we use a model trained on a dataset obtained by scraping https://actu.orange.fr/, again found on the Hugging Face hub:

In [22]:
summarizer = pipeline("summarization", 
                       model="moussaKam/barthez-orangesum-title")




Let's use the first two paragraphs of an article about Covid-19 read in Le Monde:

In [26]:
import pprint
sentence = "La pandémie ne marque pas le pas. Le variant Delta poursuit son essor planétaire au grand dam de pays impatients de retrouver une vie normale. La pandémie a fait près de quatre millions de morts dans le monde depuis que le bureau de l’Organisation mondiale de la santé (OMS) en Chine a fait état de l’apparition de la maladie fin décembre 2019, selon un bilan établi par l’Agence France-Presse (AFP) à partir de sources officielles, lundi à 12 heures. Les Etats-Unis sont le pays le plus touché tant en nombre de morts (603 967) que de cas. Le Brésil, qui compte 513 474 morts, est suivi par l’Inde (396 730), le Mexique (232 564) et le Pérou (191 899), le pays qui déplore le plus de morts par rapport à sa population. Ces chiffres, qui reposent sur les bilans quotidiens des autorités nationales de santé, sont globalement sous-évalués. L’Organisation mondiale de la santé (OMS) estime que le bilan de la pandémie pourrait être deux à trois fois plus élevé que celui officiellement calculé."
pprint.pprint(sentence)
('La pandémie ne marque pas le pas. Le variant Delta poursuit son essor '
 'planétaire au grand dam de pays impatients de retrouver une vie normale. La '
 'pandémie a fait près de quatre millions de morts dans le monde depuis que le '
 'bureau de l’Organisation mondiale de la santé (OMS) en Chine a fait état de '
 'l’apparition de la maladie fin décembre 2019, selon un bilan établi par '
 'l’Agence France-Presse (AFP) à partir de sources officielles, lundi à 12 '
 'heures. Les Etats-Unis sont le pays le plus touché tant en nombre de morts '
 '(603 967) que de cas. Le Brésil, qui compte 513 474 morts, est suivi par '
 'l’Inde (396 730), le Mexique (232 564) et le Pérou (191 899), le pays qui '
 'déplore le plus de morts par rapport à sa population. Ces chiffres, qui '
 'reposent sur les bilans quotidiens des autorités nationales de santé, sont '
 'globalement sous-évalués. L’Organisation mondiale de la santé (OMS) estime '
 'que le bilan de la pandémie pourrait être deux à trois fois plus élevé que '
 'celui officiellement calculé.')
In [30]:
summarizer(sentence, max_length=80)
Out[30]:
[{'summary_text': 'Coronavirus : près de 4 millions de morts dans le monde'}]

Terse, but not bad!

Named entity recognition in French

Named entity recognition can serve as the basis of many interesting apps!

For example, one could analyse financial reports looking for dates, prices, company names.

Let's see how to do this.

Here, we use a french equivalent of BERT, called CamemBERT, fine-tuned for NER:

In [4]:
ner = pipeline("token-classification", model="Jean-Baptiste/camembert-ner")





In [7]:
nes = ner("Colin est parti à Saint-André acheter de la mozzarella")
pprint.pprint(nes)
[{'end': 5,
  'entity': 'PER',
  'index': 1,
  'score': 0.94243556,
  'start': 0,
  'word': '▁Colin'},
 {'end': 23,
  'entity': 'LOC',
  'index': 5,
  'score': 0.99605554,
  'start': 17,
  'word': '▁Saint'},
 {'end': 24,
  'entity': 'LOC',
  'index': 6,
  'score': 0.9967083,
  'start': 23,
  'word': '-'},
 {'end': 29,
  'entity': 'LOC',
  'index': 7,
  'score': 0.99609375,
  'start': 24,
  'word': 'André'}]

We need to do a bit or post-processing to aggregate named entities with the same type.

Here is a simple algorithm to do so (it can certainly be improved!)

In [10]:
cur = None
agg = []
for ne in nes: 
  entity=ne['entity']
  if entity != cur: 
    if cur is None: 
      cur = entity
    if agg: 
      print(cur, ner.tokenizer.convert_tokens_to_string(agg))
      agg = []
      cur = entity
  agg.append(ne['word'])
print(cur, ner.tokenizer.convert_tokens_to_string(agg))
PER Colin
LOC Saint-André

We found two named entities:

  • PERSON: Colin
  • LOCATION: Saint-André

Outlook

In this article, you have learned how easy it is to use a pre-trained transformer, using the transformers package from Hugging Face.

This is possible thanks to all the researchers who designed and trained the models, and who shared them on the Hugging Face hub.

If you want to know more about transformers, I encourage you to follow the Hugging Face course. This should take you no more than a day, and probably less if you're already familiar with deep learning.

In the next article, we will see how to fine tune a transformer to a specific task, with transfer learning.


Please let me know what you think in the comments! I’ll try and answer all questions.

And if you liked this article, you can subscribe to my mailing list to be notified of new posts (no more than one mail per week I promise.)

Back Home