Should we shift from Bert to Albert?

3 min readMar 19, 2020

In this article we will see if we should make a shift from BERT to ALBERT? or is it too early to say?

Thought of the day: At the pace which we are having developments in the field of ML/AI especially in NLP this time, I believe that there will be a time in the next few years where we might get a model which can be applied for every application!
- Naman Bansal

Coming to the article!

What is BERT?

Bidirectional Encoder Representations from Transformers is a pre-trained model developed by Google AI Language which is trained on a a very very large corpus which includes a total of 3300 million words containing all of Wikipedia (2500 million words) and a Book Corpus (800 million words). So it being pre-trained is a boon for individuals like ours who like to try them out or apply them.

It is based on the transformer architecture which was proposed in the paper Attention is all you need . It is a huge improvement on seq2seq models like RNN which face difficulty to handle long range dependencies.

What is ALBERT?

ALBERT also known as lite-BERT achieves better results while reducing parameters by 80% which itself is a huge accomplishment.

According to this blog,

The success of ALBERT demonstrates the importance of identifying the aspects of a model that give rise to powerful contextual representations. By focusing improvement efforts on these aspects of the model architecture, it is possible to greatly improve both the model efficiency and performance on a wide range of NLP tasks.

Performance Benchmarks in NLP:

SQuAD — Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Version 2 of SQuAD contains more than 1,50,000 + questions.
RACE — A new dataset for benchmark evaluation of methods in the reading comprehension task. Collected from the English exams for middle and high school Chinese students in the age range between 12 to 18, RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students’ ability in understanding and reasoning.

State of the art results on SQuAD and RACE benchmarks for ALBERT

How to implement (pre-train) ALBERT model for a custom corpus?

Below we will learn to how to implement the ALBERT model for a custom corpus.

The dataset used is a restaurant review dataset and we aim to identify the names of the dishes through the ALBERT model.

For the blog we will use the implementation given by Li Xiaohong :

Step 1: Downloading dataset and preparing files

Step 2: Using transformer and defining layers

Step 3: Using LAMB optimiser and fine-tuning ALBERT according to your corpus

Step 4: Training the model for custom corpus

Step 5: Prediction

Results

Here we can see that model is successfully getting the names of the dishes.

Conclusion

While ALBERT has a significantly less parameters than BERT, and it gives better results but while implementing ALBERT is a bit more computationally expensive than BERT due to its structure.

So there still needs to be more development on how we can make the training part faster for ALBERT to outperform BERT totally.

You can check out the full code at this link.

References:

Albert paper, Li Xiaohong Github repository, Albert google blog, BERT paper, Transformers paper, Huggingface for open source transformers

Author of the blog:

Naman Bansal — Github, Linkedin

Should we shift from Bert to Albert?

What is BERT?

What is ALBERT?

How to implement (pre-train) ALBERT model for a custom corpus?

Written by Naman Bansal