U.S Patent Phrase Matching

Raghavi_bala
8 min readNov 22, 2022

NLP abbreviated as Natural Language Processing is used for many task and one of them is text similarity prediction, where given 2 sentences we try to predict how similar these 2 are.

U.S Patent Phrase Matching is one such problem, where given an anchor text and a target text we have to predict the score, the score is the dependent variable which tells us how similar the anchor and target texts are.

Data Description.

We can download this dataset from kaggle. We use pandas dataframe to visualize it. This is the train dataset and has “score” which is the dependant variable. Test dataset is similar except for the “score” column.

We can see that there are totally 4 columns apart from the “id” column which can be ignored.

We can use the describe() function in pandas to get the total unique values, top frequence that is the most occuring values.

Data Analysis.

There is no missing value in the dataset.

This is the distribution of the “score” column, we can see that 1.0 is the socre with least frequency. And the data is imabalanced.

Distribution of Score

Anchor Column : From the data analysis we found that there are total of 733 unique values in anchor.

print(f"Number of uniques values in ANCHOR column: {data.anchor.nunique()}")

We can use the below code to create the word cloud out of all the anchor text.

anchor_desc = data.anchor.values
stopwords = set(STOPWORDS)
wordcloud = WordCloud(width = 800,
height = 800,
background_color ='white',
min_font_size = 10,
stopwords = stopwords,).generate(' '.join(anchor_desc))

plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

The word cloud is shown below.

The larger the word, higher the frequency, hence word cloud helps determine the most frequent to least frequent words.

We then look for the total number of words in each of the anchor text.We can see that most of the anchor text consists of 2 words, and the maximum word length is 5.

Target Column : From the data analysis we found that there are total of 29340 unique values in target.

The word cloud of the Target column is shown below.

We then look for the total number of words in each of the anchor text.We can see that most of the anchor text consists of 2 words, and only text has a maximum length of 15.

Feature Engineering.

Apart from the anchor, target, score columns we have seen so far. We seem to have another useful features. One of it is from a new .csv file named title.csv.

It’s a seperate datset published to help add aditional information to this dataset. We so far didn’t talk about one column in the dataset which is the context column. It contains codes and didn’t prove to be useful, but with help of titles dataset we can get the meaning of each of the codes that we find in context column.

You can find this dataset here. We will be merging the dataset we have and titles dataset based on context column. It’s similar to SQL JOIN. The code below does it for us.And the resultant dataframe can be seen below as well.

After we do this, we’ll perform some of the most basic pre-processing while handling text, like removing punctuation,and converting them to lower. And we will also drop context column as we have title in place of it.

Before building our model. Some pre-requisites.

As we’re dealing with text data, we will have to embed them, that is convert them into numbers as machines don’t understand text. There are many ways to do it, ranging from simple models like Tf-Idf to complex deep learning models like BERT.

In this case, we want our model to understand the context of the sentence, so we will be using a contextual word embedding model which is similar to BERT, DeBERTa — Decoding-enhanced BERT.

It is a transfomer architecture based model. You can get the pre-trained model from hugging face and use it.

Data Split.

We will be doing k-fold cross-validation as well.

You can simply write this function and pass your training data, and the k- value, that is the number of splits into it. And it will make the splits for you. I have used k=5 here.

This will create a seperate column named ‘kfold’ which will have 0–4 integer value. Grouping the whole data into 5 seperate bins.

Now that we are done with “train-test split” as we call in the community. We can go head and tokenize it.

Tokenization.

Here’s a simple trick that’ll come handy, we mostly know the model that we’re going to use and hence we use the respective tokenizer for the mode. But what if we’re too busy and go search for the tokenzier (it’s literally given in the code snippet in the documentation). Some of us really lazy and hugging face came up with a solution for that.

AutoTokenizer, we can pass our model name into it and ta-da it’ll detect the tokenizer that is to be used with your model for you.

As I mentioned before I’ll be using DeBERTa here. Hence my model_name will be “microsoft/deberta-base”.

If you go ahead and print the model you can see few important features and constraint of it. max_len of this model is 512, no worries we don’t have lengthy texts as we saw from our EDA (Do your EDA kids 😉).In our case it will be 40.

Input Data Preparation.

Just like ML models deep learning models don’t just take simple data in our csv as input. We have to cater to it’s special needs. As we’re using transformers, we will have to give 2 types dictionary input after processing our texts, which are input_ids and attention_mask. We can get both of these from passing our text into the AutoTokenizer by accessing it’s dictionary output. Below code will do it for you.

I wrote a function that would take, id_, anchor, target, code, title, score, tokenizer, max_len, train_status=True as input and output a dictionary which has “input_ids”, “attention_mask”, “ids” along with labels based on the boolean value of train_status.

If train_status = True, we get labels along with the dictionary.We do this for getting the dependent (label or target, many names) to pass to the model and we use it while evaluting.

Here you can notice that fold value is given as 0, here’s where the k-fold comes into play. Every row whose “kflod” column value is 0, goes to validation data and the rest, 4 folds goes in as train data.

Model Building.

So, we got the input. Now, let’s go ahead and build the model.We will be building a Functional API to build a Functional model in Tensorflow.

Look into the model, we’re using TFAutoModel, heard somewhere ? Similar to AutoTokenizer. We’re using “mse” as loss metric, as this problem involves target column which has input similar to that of Regression input. We’re using Dense(1), our final ouput is going to be a single value. And for optimizer, we leave it to our go-to “Adam”.

Note that we have only written a function to build the model.Let’s call it !

model = build_model(model_name, max_len)

Traning.

Train until we see convergence. It proves that our model is learning and not just wasting our GPU. Luckily we have a quiet intelligent model, it converges in 10 epochs.

OOPS !! Couldn’t capture the whole image, along with the 10 epochs. Here’s a complete picture.

Now that we’re done with the training let’s go ahead and plot the actual vs predicted to see if our model learnt something.

YAY !! We can see from the plot that our model is learning and we can actually see it converge.

Final Step.

I took 20 datapoints and predicted them using our model.

From this we can see how close they are, the overlapping shows that our model did a great job predicting the values.

That was one hell of a ride, we’re back to the station. You read this whole thing, made it till end. If you’d like to implement this yourself. Tweak the model a bit and see if does better.

FIND THE CODE BELOW

clone it => https://github.com/Raghavi02bala/U.S-Patent-Phrase-Matching.git

--

--

Raghavi_bala

Data Science Machine Learning Data & Business Analytics