Pranav B's other Models Reports

Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Generic Models » Natural Language Processing » Finding diseases present in a given text.

Finding diseases present in a given text.

Models Status

Model Overview

INTRODUCTION:

BERT has become a sensational model in the Machine Learning community by providing the best result in various NLP tasks such as NER(Named Entity Recognition), Question Answering, etc. It is a technique that is developed by Google.

BERT's key specialized advancement is leveraging the power of a Transformer, a well-known attention model, to language modeling. This is as opposed to the past method where text sequence is examined either from left or right or combined left-to-right and right-to-left while training of a model. The BERT's result shows that a language model that is bidirectionally trained can have a more profound feeling of language context and background than single-direction language models. BERT is pre-trained on two tasks in NLP named Masked Language Modeling (MLM) and Next Sentence Prediction(NSP) which was lacking in previous models.

The research team behind BERT describes it as:
“BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from the unlabeled text by jointly conditioning on both the left and right contexts. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.”

NEED:



In the event that we attempt to anticipate the idea of the word "bank" by just taking either the left or right context, at that point we will make a mistake in understanding the sentence in at least one of the two given examples. One approach to deal with this is to consider both the left and the right context before making a prediction. That is actually what BERT does, being a bidirectional model it can handle this kind of blunders easily.
The most amazing part of BERT is that we can fine-tune it by including only a couple of extra output layers to make the best in class models for any NLP tasks.

Pre-BERT era:

Before BERT came into existence there were different models that were used for understanding the context of a language.

Word2Vec and GloVe

For learning the representation/context of a language by pre-training a model on a text data is done by using word embeddings techniques like Word2Vec and GloVe. These embeddings changed the manner in which we performed NLP tasks. We currently had embeddings that could catch relevant information and connections among words. These embeddings were utilized to prepare models on downstream NLP applications and improve their predictions.



One constraint of these embeddings was the use of extremely shallow Language Models. This implied there was a limit to the amount of data they could parse and capture the context of the language, this inspired the utilization of more profound and more complex language models (layers of LSTMs and GRUs).

Another key constraint was that these models didn't consider the representation of the word. We should take the above "bank" example. A similar word has various implications in various contexts. Moreover, embedding techniques like Word2Vec will give a similar vector for "bank" in both the unique sentences.


ELMo and ULMFiT



Then came ELMo to overcome the problem of Polysemy – same words having different meanings based on their context. Instead of training shallow feed-forward networks like Word2Vec, the training of word embeddings is done by using complex layers of Bi-directional LSTMs. Due to this, the problem of polysemy is solved as the same words can have multiple ELMO embeddings considering its context. This already created model is used for further training it to our desired dataset which is known as pre-training.

ULMFiT is the upgraded version. It is used to train the pre-trained model to give the best result even the data is less by doing some fine-tuning. The process is called transfer learning and ULMFiT is the best in classification problems.

Using this technique most of the NLP breakthroughs are done.

OpenAI’s GPT



It is the next step when it comes to pre-training and fine-tuning. In this instead of the LSTM-based architecture, OpenAI’s GPT uses Transformer-based architecture for modeling. By using this model in transfer learning it can be fine-tuned not just for classification but for many NLP tasks such as common sense reasoning, etc.

This gave the rise to the Transformer-based model BERT.

BERT working:

Architecture:

As discussed earlier BERT is a Transformer-based model. There are currently two variants:
BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters



The Encoder layer used in this architecture is of the Transformer.

Feature Extraction:

As stated in the introduction BERT works on 2 NLP tasks MLM and NSP.

Masked Language Modeling(MLM):



GPT was lacking in this task because GPT is unidirectional as there was a loss of information. That's where MLM comes.

Consider we have "Apple is of red color." sentence. In MLM a word is replaced with "[MASK]" token, here consider "red", then the sentence becomes "Apple is of [MASK]color." and the model tries to predict the word (red) considering the context of the statement. This the high-level working of a Masked Language Model.

The authors decided on the following percentage of the data for training of MLM:
80% of the time the words were replaced with the masked token [MASK].
10% of the time the words were replaced with random words.
10% of the time the words were left unchanged.

The MLM is used to understand the context and relationship of the words used in a sentence.

Next Sentence Prediction(NSP)



The NSP model is used where the task is to understand the relationship between the sentences for example Question and Answering System.
In this, the model simply predicts that given two sentences P and Q, if Q is actually the next sentence after P or just a random sentence.

Like MLM the authors also defined the conditions for training of the NSP model which are:
For 50% of the pairs, the second sentence would actually be the next sentence to the first sentence
For the remaining 50% of the pairs, the second sentence would be a random sentence from the corpus
The labels for the first case would be ‘IsNextSentence’ and ‘NotNextSentence’ for the second case

Pre-processing of the data:



BERT requires the representation of the input data in a specific set of rules before giving it for training to the model. The rules are that the input should be containing 3 embeddings:

Position Embeddings: BERT learns and uses positional embeddings to express the position of words in a sentence. These are added to overcome the limitation of Transformer which, unlike an RNN, is not able to capture “sequence” or “order”  information.

Segment Embeddings: BERT can also take sentence pairs as inputs for tasks. That’s why it learns a unique embedding for the first and the second sentences to help the model distinguish between them. In the above example, all the
tokens marked as EA belong to sentence A (and similarly for EB).

Token Embeddings: These are the embeddings learned for the specific token from the WordPiece token vocabulary.


APPLICATION:


BIO-BERT


By leveraging the technology we can use it for domain-specific NER systems. Like here the model created is specific to the medical domain and is able to identify diseases present in a given sentence. BIOBERT is a model that is pre-trained on biomedical datasets. It can be fine-tuned to meet our requirements.


Data used


The pre-trained Bio-Bert model is trained on PubMed abstracts and PMC articles data and various medical-related journals and magazines. In the purposed model it is further fine-tuned on the biomedical named entity recognition dataset (NCBI and BC5CDR). This dataset comprises of sentences which are word-level tagged with labels as 'B-disease' and 'I-disease' if the word is a name of a disease and tagged as 'O' for remaining words. So as to identify the disease's name present in a given text.


Training


While keeping the already trained layer of the Bio-Bert model constant it is trained on our dataset by adding the output layer for our desired labels. A new model is created from the pre-trained one and it can be used for further inference.



For understanding, more about BERT refer to the paper published by Google AI, given in the link: https://arxiv.org/pdf/1810.04805.pdf
For understanding, more about BIOBERT refer to https://academic.oup.com/bioinformatics/article/36/4/1234/5566506

0 comments