What is BERT?
Bert stands for Bidirectional Encoder Representations from Transformers. It’s google new techniques for NLP pre-training language representation. Which means now machine learning communities can use Bert models that have been training already on a large number of words,(some researchers say the Bert model train on the English Wikipedia 2,500 million words) for NLP models to do a wide variety of tasks such as Question Answering tasks, Named Entity Recognition (NER), and Classification like sentiment analysis.
In Bert paper, they present two types of Bert models one is the Best Base and the other is Bert Large. Both of these models have a large number of encoder layers 12 for the base and 24 for the large. If you understand the concept of transformers. You will see that Bert also trained on the Encoder stacks in the transformers to use the same attention mechanism. But why is it called bidirectional?
What is bidirectional mean?
Because the transformers encoder reads the entire sequence of the words at once which is the opposite of the directional models that read the input sequentially for the left to the right or from the right to the left. The bidirectional method will help the model to learn and understand the meaning and the intention of the word based on its surrounding. Since we will use it for toxic classification, we will explain only the Bert steps for classification tasks only.
What is the input of Bert?
The input of Bert is a special input start with [CLS] token stand for classification. As in the Transformers, Bert will take a sequence of words (vector) as an input that keeps feed up from the first encoder layer up to the last layer in the stack. Each layer in the stack will apply the self-attention method to the sequence after that it will pass to the feed-forward network to deliver the next encoder layer.
What is the output of Bert?
The output of Bert model contains the vector of size (hidden size) and the first position in the output is the [CLS] token. Now, this output can be used as an input to our classifier neural network for classification of the toxicity of the words. In the Bert paper, they achieve a great result by using only a single layer neural network as the classifier.