Google QUEST Q&A Labeling: kaggle competition

Manikanth
26 min readMar 2, 2022
Photo by Kelly Sikkema on unsplash

Table of Contents:

  1. Introduction
  2. Business problem
  3. About the data set
  4. Prerequisites
  5. EDA
  6. Feature engineering
  7. Data Preprocessing
  8. Experiments performed
  9. Comparison of models
  10. Final solution
  11. Kaggle submission
  12. Flask demo app
  13. Future work
  14. Reference

Introduction:

The google quest Q&A labelling competition conducted by google to build better subjective question-answering algorithms. This Q&A model will be feed to the downstream of anther model for answering or quesitoning to understand and provide more human like conversation. At present computers are really good at answering questions with single, verifiable answers. But, humans are often still better at answering questions about opinions, recommendations, or personal experiences.

We humans are better at addressing subjective questions that require a deeper, multidimensional understanding of context — something computers aren’t trained to do well…yet.. Questions can take many forms — some have multi-sentence elaborations, others may be simple curiosity or a fully developed problem. They can have multiple intents, or seek advice and opinions. Some may be helpful and others interesting. Some are simple right or wrong.

So the CrowdSource team at Google Research, a group dedicated to advancing NLP and other types of ML science via crowdsourcing, has collected data on a number of these quality scoring aspects.

Business problem:

To create a more human-like question and answering system that can answer the provided question having the intuitive understanding of the question. This can attract users and address their question more human-like and this can also increase the number of user participation in the question answering forms and create human-like conversation chat boxes.

Problem statement: Create intelligent question and answer systems that can reliably predict context without relying on complicated and opaque rating guidelines.

About the dataset:

The data for this competition includes questions and answers from 70 StackExchange similar websites. our task is to predict target values of 30 labels for each question-answer pair. Target labels with the prefix question_ relate to the question_title and/or question_body features in the data. Target labels with the prefix answer_ relate to the answer feature.

Each row contains a single question and a single answer to that question, along with additional features (like question user name, answer user name, user url, different categories of questions etc). The training data contains rows with some duplicated questions (but with different answers). The test data does not contain any duplicated questions. This is not a binary prediction challenge. Target labels are aggregated from multiple raters, and can have continuous values in the range [0,1]. Therefore, predictions must also be in that range.

As we have to predict 30 labels with values range from 0 to 1 for instance(one row/ single datapoint) we can say this is a “multi-label regression problem”.

If you have not understood about the data do not worry we will perform EDA on the data set in below section

Prerequisites

This post assumes that you are already familiar with Machine Learning Technique like Regression such as Linear Regression , Linear Support Vector Machine and deep Learning Technique like Multi-layered Perceptrons, Convolution Neural Networks , LSTM, Underfitting, Overfitting, Probability, Text Processing, Python syntax and data structures, Keras library, etc.

Natural Language preprocessing transformer models in this post we will be using bert model.

Exploratory data analysis:

Let’s take a look at all available features and labels available and its type and get some insights from the data.

I have used kaggle notebook for this challenge and by running the above cell we get know that we have total 6079 training instances and in training data set we have 41 columns in which 30 are class labels and 11 are features. where as in test data set we have 11 feature with no labels and in submission data set we have 31 feature column that one column is for unique ID and 30 are class labels.

Lets have a look at data using data frame head method

Lets run data frame info method to check if there are any null values and type of data available in our train data set

By running the info method we can find that there are 10 features and no null values and 10 are having type as object and 30 labels are having type as float64 and we can also conclude that below are the features that we need to provide as input to our model. we will later see which feature are important and which features we can ignore some of the feature engineering techniques.

Features:

1 question_title
2 question_body
3 question_user_name
4 question_user_page
5 answer
6 answer_user_name
7 answer_user_page
8 url
9 category
10 host

In the above 41 columns, 10 are feature and 30 are the class labels and one column qa_id is the unique ID for every instance.

  • 21 class labels are for questions that is the label that starts with “question_…”
  • 9 class labels are for answers that is the label which starts with “answer_…”
  • Total we have 30 Class Lables.
  • From the 10 features question_title , question_body and answer contain text and the labels are mostly depend on this three features as humans manually labeled based on the these text data.

Now lets check density of words & characters present in the question_title feature.

Running the above code we get the below plots and from the plots we can observe that Both train and test having the same distribution of characters and words.

  • Most of the words lies in range 5–10 both train and test.
  • Most of the characters lies in the range 40–60 train and test.

Similarly we will check the word and character count for question_body and answer features. we can use above code snippet for both question_body and answer by just replacing question_title with respective feature will get the below plots.

From question_body feature word & character count we can observe that the distribution of both words and characters are very much right skewed.

  • Most of the characters in question_body lies below 2500.
  • Most of the words in question_body lies below 1000.

Similarly we will check for answer feature

  • As similar to question_body we can find that answer distribution is also skewed.
  • Their may be some extreme outlier instance that words/char length are very high in both question_body and answer features.

Lets analyze question_body and answer features sequence length as we have observed skewness.

By running the above code we get percentile range from 0–100, 90–100 and 99 -100 having 10 steps per each so that we can get how much percentage of word count are present in the question_body feature.

0th percentile of question_body input sequence 1.0
10th percentile of question_body input sequence 34.0
20th percentile of question_body input sequence 48.0
30th percentile of question_body input sequence 61.0
40th percentile of question_body input sequence 76.0
50th percentile of question_body input sequence 93.0
60th percentile of question_body input sequence 116.0
70th percentile of question_body input sequence 143.0
80th percentile of question_body input sequence 192.0
90th percentile of question_body input sequence 282.1999999999998
100th percentile of question_body input sequence 4665.0

90th percentile of question_body input sequence 282.1999999999998
91th percentile of question_body input sequence 300.0
92th percentile of question_body input sequence 318.0
93th percentile of question_body input sequence 342.0
94th percentile of question_body input sequence 371.0
95th percentile of question_body input sequence 433.0
96th percentile of question_body input sequence 499.0
97th percentile of question_body input sequence 578.0
98th percentile of question_body input sequence 722.0
99th percentile of question_body input sequence 1026.7200000000066
100th percentile of question_body input sequence 4665.0

99.1th percentile of question_body input sequence 1087.5959999999995
99.2th percentile of question_body input sequence 1170.0
99.3th percentile of question_body input sequence 1195.9079999999994
99.4th percentile of question_body input sequence 1337.580000000069
99.5th percentile of question_body input sequence 1486.0
99.6th percentile of question_body input sequence 1580.0
99.7th percentile of question_body input sequence 1811.0
99.8th percentile of question_body input sequence 1942.0520000000042
99.9th percentile of question_body input sequence 3216.0800000000672
100th percentile of question_body input sequence 4665.0

Above is the output of the cell and we can observe that 90 percentage of the question_body text having word count less than 1026 and by observing the percentile values range from 90 to 100. 99% of the question_body text word count less than 1026 so only 99 to 100 that is only 1% of the question_title having word length more than 1026.

99.8% the of words in question body lies below 1942

Similarly we will check for answer feature we can use the same above code snippet just replace question_body with answer will get below output

0th percentile of answer input sequence 2.0
10th percentile of answer input sequence 26.0
20th percentile of answer input sequence 40.0
30th percentile of answer input sequence 55.0
40th percentile of answer input sequence 72.0
50th percentile of answer input sequence 91.0
60th percentile of answer input sequence 114.0
70th percentile of answer input sequence 149.0
80th percentile of answer input sequence 196.0
90th percentile of answer input sequence 293.0
100th percentile of answer input sequence 8158.0

90th percentile of answer input sequence 293.0
91th percentile of answer input sequence 312.9800000000005
92th percentile of answer input sequence 340.0
93th percentile of answer input sequence 363.0799999999999
94th percentile of answer input sequence 392.0
95th percentile of answer input sequence 428.0
96th percentile of answer input sequence 488.28000000000065
97th percentile of answer input sequence 548.6599999999999
98th percentile of answer input sequence 620.4399999999996
99th percentile of answer input sequence 880.880000000001
100th percentile of answer input sequence 8158.0

99.1th percentile of answer input sequence 916.5959999999995
99.2th percentile of answer input sequence 956.2560000000012
99.3th percentile of answer input sequence 1035.6319999999978
99.4th percentile of answer input sequence 1108.044000000018
99.5th percentile of answer input sequence 1157.8799999999974
99.6th percentile of answer input sequence 1213.8160000000007
99.7th percentile of answer input sequence 1338.7879999999932
99.8th percentile of answer input sequence 1679.8920000000098
99.9th percentile of answer input sequence 2122.1140000000178
100th percentile of answer input sequence 8158.0

from above output we can observe that only 99.9% of words in answer feature lies below 2200

Still now we have analyzed question_title , question_body and answer feature lets analyze other features as well lets start with category feature.

Analyzing category Feature

So we have only five different category from those five category most of the question and answer are from technology in both train and test so the distribution of category of question in train and test are also similar.

Also by checking the each of the different category question and answer text we can observe below insights

  • Five unique category are present in the category feature.
  • Technology and Stackoverflow are the highest count and both are related topics.
  • Life_arts as the lowest count category.
  • Distribution of train and test category are the same.
  • Life_arts & culture follow general english syntax & structure.
  • Science utilizes latex with expressions prepended and appended with symbol: $
  • Technology & stackoverflow have code snippets & logs.

Lets plot word cloud for three feature and for both train and test

We can observe that some of mostly used words match between train and test set.

Analyzing label

question_asker_inten: no. of unique label values: 9
question_body_critic: no. of unique label values: 9
question_conversatio: no. of unique label values: 5
question_expect_shor: no. of unique label values: 5
question_fact_seekin: no. of unique label values: 5
question_has_commonl: no. of unique label values: 5
question_interesting: no. of unique label values: 9
question_interesting: no. of unique label values: 9
question_multi_inten: no. of unique label values: 5
question_not_really_: no. of unique label values: 5
question_opinion_see: no. of unique label values: 5
question_type_choice: no. of unique label values: 5
question_type_compar: no. of unique label values: 5
question_type_conseq: no. of unique label values: 5
question_type_defini: no. of unique label values: 5
question_type_entity: no. of unique label values: 5
question_type_instru: no. of unique label values: 5
question_type_proced: no. of unique label values: 5
question_type_reason: no. of unique label values: 5
question_type_spelli: no. of unique label values: 3
question_well_writte: no. of unique label values: 9
answer_helpful: no. of unique label values: 9
answer_level_of_info: no. of unique label values: 9
answer_plausible: no. of unique label values: 9
answer_relevance: no. of unique label values: 9
answer_satisfaction: no. of unique label values: 17
answer_type_instruct: no. of unique label values: 5
answer_type_procedur: no. of unique label values: 5
answer_type_reason_e: no. of unique label values: 5
answer_well_written: no. of unique label values: 9

By observing the output above we can conclude that

  • The output label are regression(real) values but the distribution is not continuous.
  • Except for answer_satisfaction label rest every label are having unique values some are with 9 unique values and some are of 5 unique values.
  • Using this insights we can use post process to get better scoring

If we plot the bar plot for all our labels it will be very large become very large page so it better to use below kernel link to view the plots by the plot we can observe that Label values are imbalance like for some of the label values are having only one values ex: question_type_spelling, question_not_really_question etc that is the distribution of label are very dissimilar.

correlation between target variables

From the above heatmap of correleation we can observe that answer_helpful, answer_level_of_information, answer_plausible, answer_releveance and answer_satification have some correlation between them.

Now lets analyze Host feature

  • All question and answer in the data set are extracted from 63 websites.
  • Most of the question and answer are from stackoverflow.com as we observe from the category feature analysis that most of the caterogy fall under technology and stackoverflow.

Feature engineering:

Below are some of the feature engineering technique that we can experiment and verify which feature are important.

Text count based features:

  • Number of characters in the question_title
  • Number of characters in the question_body
  • Number of characters in the answer
  • Number of words in the question_title
  • Number of words in the question_body
  • Number of words in the answer
  • Number of unique words in the question_title
  • Number of unique words in the question_body
  • Number of unique words in the answer

TF-IDF based features:

  • Character Level N-Gram TF-IDF of question_title
  • Character Level N-Gram TF-IDF of question_body
  • Character Level N-Gram TF-IDF of answer
  • Word Level N-Gram TF-IDF of question_title
  • Word Level N-Gram TF-IDF of question_body
  • Word Level N-Gram TF-IDF of answer

Web Scraping Features

In train data set we have one feature name URL where it will land to user page by using web scraping we can extract some of the important features like gold, silver, bronze and reputation scores of the user.

Preprocessing Text Feature

In this Data set we don’t have any Null value present as in the description of kaggle problem it is written there that Null value is not present in the data set.

When we deal with text, we generally perform some basic cleaning like lower-casing all the words removing special tokens (like ‘%’, ‘$’, ‘#’, etc.),removal of HTML Tag , \r tags, \n (enter) with space Removed all Special Character.

Below are the Code snippet to remove the HTML Tag ,special Character.

Sample text answer data before preprocessing

same text after preprocessing

Build models:

Lets start with base model then will try building more complex model. We know that we have three text feature so lets convert the text into vector so that we can feed that to our models.

Using the above code we can store word and its vector having 300 dimension in embedding_index we will be storing all the words and its vector as key and value pair after loading the embedding_index now we will convert our text to tokens using below code snippet

using above code we convert text to its unique ID and then we are padding the text to have all the text sequence in the same size by adding zeros to its ending similarly we will do the same for other text features.

After this tokenization and padding we will use this vector to convert into glove embedding that is to convert each word in the sequence that is each token into 300 dimension vector so we will get the similar words together and opposite words far away

by executing the above cell we will get the embedding weight for question_title feature similarly by referring above code we can get embedding weights for question_body and answer features.

Building Base LSTM model

In this model we are using bi-directional LSTM layer for each of the input layer then concatenating all the three layers and passing through some of the dense layer and in ouput layer we have used sigmoid activation function because we need to have probability score of each of label ranges between 0 to 1.

Using the above call back class we will find spearman score for validation data

why we will be using spearman score as metrics?

We need to compare the predicted values with true values as per our use case we need to compare how much our predicted and true are similar to each other and spearman uses rank of the data so its a robust measure to measure the similarity of 30 labels form predicted to true values and in kaggle competition spearman is the evaluation metrics.

Using the above model we have achieved spearman score as 0.279374. This score is for only considering three features that is question_title, question_body and answer

Base LSTM Model + 16 Feature Engineering Features:

we have our basic three text features that is question_title, question_body, answer apart from these three features now we will experiment with feature engineering features.

In this we will provide these below feature engineering features for our model and let see does this effect the model performance.

  • 9 Feature engineering features (9 dim) — question_title_num_chars, question_body_num_chars, answer_num_chars, question_title_num_words, question_body_num_words, answer_num_words, question_title_num_unique_words, question_body_num_unique_words, answer_num_unique_words
  • 3 TF-IDF features (384 dim) — TF-IDF quesion_title, TF_IDF quesiton_body, TF-IDF answer.
  • 4 Web scraping features (4 dim) — reputation, gold, silver, bronze.

By using above features we have achieved the score of 0.0564 that is very low value comparing to previous model with simple three text feature so by using all the feature engineering features our model has degrade its performance drastically so in the next experiment we will be removing some of features and check the performance of the model.

Base LSTM model applying 13 Feature Engineering features

Removing TF-IDF features as in the above model the performance has decreased comparing to the base model with only three features as tf-idf has more dimensions

Now we are experimenting with only below features and check the performance of our model.

  • 3 text features — question_title, question_body, answer
  • 9 Feature engineering features (9 dim) — question_title_num_chars, question_body_num_chars, answer_num_chars, question_title_num_words, question_body_num_words, answer_num_words, question_title_num_unique_words, question_body_num_unique_words, answer_num_unique_words
  • 4 Web scraping features (4 dim) — reputation, gold, silver, bronze.

by removing the tf-idf features we are able to achieve the spearman score of -0.00246 which is very low comparing to base with only three features. so lets remove more features and check the model performance.

Base LSTM model applying 9 Feature Engineering features

Removing TF-IDF features as in the above model the performance has decreased comparing to the base model with only three features as tf-idf has more dimension.

Removing Web scraping features

  • 3 text features — question_title, question_body, answer
  • 9 Feature engineering features (9 dim) — question_title_num_chars, question_body_num_chars, answer_num_chars, question_title_num_words, question_body_num_words, answer_num_words, question_title_num_unique_words, question_body_num_unique_words, answer_num_unique_words

The model with meta feature engineering has achieved better performance comparing to the TF-IDF, Web scraping features by achieving the height spearman value as 0.2871.

I have also performed the same experiment having base LSTM model with all feature engineering features with 100 dimensional embedding but the model has very low score so lower dimensional embedding does not work at all.

Overview of base model

  • Model with basic three features (question_title, quesiton_body, answer) + Meta features has acheived best performance comparing to the model with TF-IDF and web scraping features.
  • The best base model has acheived an spearman score of 0.2871.
  • TF-IDF and Web scraping feature are not important to get best performance, meta features are the important features.

Universal sentense encoder

The Universal Sentence Encoder makes getting sentence level embeddings as easy as it has historically been to lookup the embeddings for individual words. The sentence embeddings can then be trivially used to compute sentence level meaning similarity as well as to enable better performance on downstream classification tasks using less supervised training data.

From the above plot we can observe that similar sentence are grouped together so this can be used in our model to achieve better performance as our model need to find the similarity from question and answers. for more implementation details check out this link

For this we will be using tensor flow universal sentence encoder pre train embedding

Lets load the embedding layer from tensor flow hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
uni_sen_embed = hub.load(module_url)

If you observe the output for single word we are having embedding size of 512 and for sentence with multiple words we are having same 512 dimension embedding.

Now lets convert our each text features to 512 embedding using below code snippets

In embeddings_train, embeddings_valid and embeddings_test dictionary we have three features embeddings let stack all these embeddings into numpy array.

After getting the embedding using universal sentence encoder let build the model.

Training the above USE model we have achieved spearman score of 0.33029 which is far better than the base LSTM model and with only with three basic text feature and here we also need to observe that training this took very less time comparing to base model as we already have the embedding and just need to train three dense layers so training one epoch is less than 1 sec so we can train on more epochs. so let perform more experiments on this USE model with our feature engineering feature which has worked with our base LSTM model and some more new features below are the models we are going to build using USE.

1. USE + 9 Meta feature
2. USE + L2 distance similarity features
3. USE + cosine similarity features
4. USE + all distance similarity features + 9 meta features

As in the base model experiments we have got better score with meta features comparing to TF-IDF, web scraping features so will be not using TF-IDF and web scraping features here.

USE + 9 Meta Features

let build USE + meta feature model there is no complex architecture involved here just adding one new input and concatenating the layers.

Here we have trained it for 100 epoch and observed that even for 100 epochs the model is having smooth loss and validation loss curve its not over fitting the data or deviating the loss curve you can observer the curve in below image and less time to train as well and after 100 epochs a slight deviation from loss and train so it good up to 100 epochs.

With this USE + 9 meta feature we have achieved spearman score of 0.3713 which is better than simple USE + basic features.

USE + L2 distance similarity + 9 Meta Features Model

The most common distance is the Euclidean one which is defined by

Source

using above euclidean distance we will compute the distance of features as below:

question_title_USE_embedding - answer_USE_embedding
question_body_USE_embedding - answer_USE_embedding
question_body_USE_embedding - question_title_USE_embedding

These are the three distance measure will be used with three features and feed as input to our USE model.

We will be using same model as above and we will be stacking more features computed above using np.hstack as in the above code and we will also use 9 meta feature

The score has not improved as much as above USE + 9 Meta features model but the same score has been achieved by both the model but this model took more epochs to reach high score of 0.36124.

USE + Cosine distance + 9 Meta features

Now lets perform the same experiment but instead of l2 distance we will try with cosine distance

Euclidean distance is like using a ruler to measure the distance. However, choosing this distance is probably not the best option. For example, Ronaldo is close to Messi because they have high ratings in shoots, speed or dribbles. But a young player like Joao Felix who has same profile to Messi will be further away because his attributes are weaker, but in the same proportion as explained in below images

This is where cosine similarity comes. This is a measure of similarity between two vectors that looks at the angle between them.

Given two vectors of attributes x and y, the cosine similarity cos(θ), is represented using a dot product and magnitude In below example, the angle between Messi and Joao Felix is smaller than the angle between Messi and Ronaldo. Even though they were further away.

Cosine similarity allows us to better capture “style” rather than pure “statistics” attributes.

The model has achieved an spearman score of 0.371537 which was similar with above two USE models.

Source

USE + All distance features + 9 Meta features

Lets stack all the distance features that we have used in the above model and experiment with the model and see how the performance

From this experiment by stacking all the distance embedding features we were able to achieve the maximum score of 0.37261. this is better than all the model that we have experimented with so using all distance features we were able to achieve better score than the experiments we have done so far.

BERT (Bidirectional Encoder Representations for Transformers):

Before dive into the bert model let have quick high level view of what bert is

BERT is a deep learning model that has given state-of-the-art results on a wide variety of natural language processing tasks. It stands for Bidirectional Encoder Representations for Transformers. It has been pre-trained on Wikipedia and BooksCorpus and requires (only) task-specific fine-tuning.

It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.

It’s not an exaggeration to say that BERT has significantly altered the NLP landscape. Imagine using a single model that is trained on a large unlabelled dataset to achieve State-of-the-Art results on 11 individual NLP tasks. And all of this with little fine-tuning. That’s BERT! It’s a tectonic shift in how we design NLP models.

BERT has inspired many recent NLP architectures, training approaches and language models, such as Google’s TransformerXL, OpenAI’s GPT-2, XLNet, ERNIE2.0, RoBERTa, etc.

It is basically a bunch of Transformer encoders stacked together (not the whole Transformer architecture but just the encoder). The concept of bi directionality is the key differentiator between BERT and its predecessor, OpenAI GPT. BERT is bidirectional because its self-attention layer performs self-attention on both directions.

There are a few things that we need to know about bert before start experimenting with it.

  • First, It’s easy to get that BERT stands for Bidirectional Encoder Representations from Transformers. Each word here has a meaning to it and we will encounter that one by one. For now, the key takeaway from this line is — BERT is based on the Transformer architecture.
  • Second, BERT is pre-trained on a large corpus of unlabelled text including the entire Wikipedia(that’s 2,500 million words!) and Book Corpus (800 million words). This pre training step is really important for BERT’s success. This is because as we train a model on a large text corpus, our model starts to pick up the deeper and intimate understandings of how the language works. This knowledge is the swiss army knife that is useful for almost any NLP task.
  • Third, BERT is a deeply bidirectional model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase.
  • Fourth, finally the biggest advantage of BERT is it brought about the ImageNet movement with it and the most impressive aspect of BERT is that we can fine-tune it by adding just a couple of additional output layers to create state-of-the-art models for a variety of NLP tasks.

Architecture of BERT

BERT is a multi-layer bidirectional Transformer encoder. There are two models introduced in the paper.

  • BERT base — 12 layers (transformer blocks), 12 attention heads, and 110 million parameters.
  • BERT Large — 24 layers, 16 attention heads and, 340 million parameters.

For an in-depth understanding of the building blocks of BERT (aka Transformers), you should definitely check this awesome post — The Illustrated Transformers.

Here’s a representation of BERT Architecture

Preprocessing Text for BERT

The input representation used by BERT is able to represent a single text sentence as well as a pair of sentences (eg., Question, Answering) in a single sequence of tokens.

  • The first token of every input sequence is the special classification token — [CLS]. This token is used in classification tasks as an aggregate of the entire sequence representation. It is ignored in non-classification tasks.
  • For single text sentence tasks, this [CLS] token is followed by the WordPiece tokens and the separator token — [SEP].
  • For sentence pair tasks, the WordPiece tokens of the two sentences are separated by another [SEP] token. This input sequence also ends with the [SEP] token.
  • A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar to token/word embeddings with a vocabulary of 2.
  • A positional embedding is also added to each token to indicate its position in the sequence.

BERT developers have set a a specific set of rules to represent languages before feeding into the model.

For starters, every input embedding is a combination of 3 embeddings:

  • Position Embeddings: BERT learns and uses positional embeddings to express the position of words in a sentence. These are added to overcome the limitation of Transformer which, unlike an RNN, is not able to capture “sequence” or “order” information
  • Segment Embeddings: BERT can also take sentence pairs as inputs for tasks (Question-Answering). That’s why it learns a unique embedding for the first and the second sentences to help the model distinguish between them. In the above example, all the tokens marked as EA belong to sentence A (and similarly for EB)
  • Token Embeddings: These are the embeddings learned for the specific token from the WordPiece token vocabulary

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings.

Such a comprehensive embedding scheme contains a lot of useful information for the model.

These combinations of preprocessing steps make BERT so versatile. This implies that without making any major change in the model’s architecture, we can easily train it on multiple kinds of NLP tasks.

Tokenization: BERT uses WordPiece tokenization. The vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the existing words in the vocabulary are iteratively added.

Having this above information about bert let fine tune bert model and see the performance on our task. First load the bert model from tensor flow hub

hub_url_bert = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3"
bert_layer = hub.KerasLayer(hub_url_bert, trainable=False)

After loading the pre trained bert model let get the vocab file and create tokenizer object which will be used to convert sentence to tokens, mask and segments to provide as input to the bert model.

Now lets transform our input features to bert input features. the below code snippets is used to get input text to Input Ids , Input mask, Input segment for bert

In the below _trim_input function:

  • if the input sentence has the number of tokens > 512, the sentence is trimmed down to 512. To trim the number of tokens, 256 tokens from the start and 256 tokens from the end are kept and the remaining tokens are dropped.

Ex. suppose an answer has 700 tokens, to trim this down to 512, 256 tokens from the beginning are taken and 256 tokens from the end are taken and concatenated to make 512 tokens. The remaining [700-(256+256) = 288] tokens that are in the middle of the answer are dropped.

  • The logic makes sense because in large texts, the beginning part usually describes what the text is all about and the end part describes the conclusion of the text. This is also closely related to the target features that we need to predict.

In the below _convert_to_bert_inputs function we concatinate the three text features in to one single features and convert the input to bert compatable inputs.

Transforming train data set that is text features in to bert training data set.

Similarly we will do it for validation data set and test data set. let create bert layer that we will be used in the model to get the embeddings features of the bert.

Using this let create our bert model and train it using the data that we have converted for bert model and check the performance of our model.

Fine tuning the bert model we have achieved an score of 0.3913 which highest of all the model that we have experimented with.

Overview of all the experiments so far:

Source

So far we have experimented with five different base model and five different universal sentence encoder model.

  • Bi directional LSTM model has achieved better performance with three text features and 9 meta features.
  • The given data is very less with only 6K points as the training instances and training neural network requires more data to extract pattern and features from the data we need more data so we have used transfer learning techniques.
  • USE has achieved best performance having spearman score of 0.37261 with text features and two distance features and 9 meta features.
  • TF-IDF and web scraping features does not improve the model performance.
  • USE has very less time to train and more epochs to fit the data and achieve best score.
  • Fine tuning the bert model has achieved high spearman score of 0.3912 so far the experiments we have performed.

Kaggle submission:

If you are submitting in kaggle make sure to execute your notebook in offline then submit your final best solution notebook.

We have achieved a public score of 0.36514 and private score of 0.34118 which is better as some of kernel having 0.44+ public score but very low private score so our model is not over fitting to the train data.

Flask demo app:

I have also made a flask web app to demonstrate the model predictions where you need to provide data like question title, question description and its answer and by clicking on submit you will the 30 labels predictions output in horizontal bar plot and in table format please using this link to access the web app files download it in your local machine and run app.py file.

After providing the input data that is question title, question body and answer you will get the below web page where you can see the question-answer similarities.

Future work

  • We have experimented with only one transformed based model bert but we can have more experiments on similar transformed based model like roberta, XLnet, Albert etc and ensembles of the transformer based models.

References:

Data Analysis:

  1. https://www.kaggle.com/codename007/start-from-here-quest-complete-eda-fe
  2. https://www.kaggle.com/mobassir/jigsaw-google-q-a-eda

Feature engineering:

  1. https://www.kaggle.com/c/google-quest-challenge/discussion/130041 — meta features.
  2. https://www.kaggle.com/codename007/start-from-here-quest-complete-eda-fe?scriptVersionId=25618132&cellId=65 — tfidf, count based features.
  3. https://towardsdatascience.com/hands-on-transformers-kaggle-google-quest-q-a-labeling-affd3dad7bcb — web scraping features

Universal Sentence encoder:

  1. https://www.kaggle.com/abazdyrev/use-features-oof
  2. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiv8qzn5aL2AhXCB94KHeKGAjIQFnoECAQQAQ&url=https%3A%2F%2Fwww.tensorflow.org%2Fhub%2Ftutorials%2Fsemantic_similarity_with_tf_hub_universal_encoder&usg=AOvVaw2Al3MpaY_pfxcNslvWnpuu
  3. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiv8qzn5aL2AhXCB94KHeKGAjIQFnoECAMQAQ&url=http%3A%2F%2Fai.googleblog.com%2F2019%2F07%2Fmultilingual-universal-sentence-encoder.html&usg=AOvVaw0l3X0-5ZuOXwgw6b_o5lQa
  4. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwi_0bb15aL2AhUOEIgKHYtwC1sQFnoECAUQAQ&url=https%3A%2F%2Fgaurav5430.medium.com%2Funiversal-sentence-encoding-7d440fd3c7c7&usg=AOvVaw33W-YC1OAfLAkTY5uxJ3Nm

BERT References:

  1. BERT for Dummies step by step tutorial by Michel Kana
  2. Demystifying BERT: Groundbreaking NLP Framework by Mohd Sanad Zaki Rizvi
  3. A visual guide to using BERT by Jay Alammar
  4. BERT Fine tuning By Chris McCormick and Nick Ryan
  5. How to use BERT in Kaggle competitions — Reddit Thread
  6. BERT GitHub repository
  7. BERT — SOTA NLP model Explained by Rani Horev
  8. YOUTUBE — BERT Pretranied Deep Bidirectional Transformers for Language Understanding algorithm by Danny Luo
  9. State-of-the-art pre-training for natural language processing with BERT by Javed Quadrud-Din

--

--

Manikanth

Data scientist | Helping business leverage their data using machine learning to drive results. https://linktr.ee/manikanthp