Python: AI Language Processing "Like a GloVe"

 

"great game nice graphics" - nuanced and thought-provoking critic

Context

In this article, I'm going to walk through the process of using Python and the Tensorflow framework (via the Keras module) to sort user reviews of video games and gaming products into positive or negative categories based on the language included in each review. These are common tools and methods of Natural Language Processing (NLP) and sentiment analysis. Text-based problems are understood to be well-approached by recurrent neural networks (RNN) due to their internal memory, which allows them to model sequential data. This type of model is being used in many modern applications such as Google Translate. Long Short-Term Memory (LSTM) modeling will be used in this report- this is an extension of RNN that enables the use of input memorized over a long range, such as with the context of text paragraphs. ¹ 

Along the way, it will be necessary to employ a method of word embedding- the system by which words are given a numeric representation and their similarity quantified. For this I am using GloVe, or Global Vectors for Word Representation. This is a fantastic project from Stanford University that derives quantifiable relationships between English words based on the statistics of their co-occurence². More on this implementation in the corresponding section!

I'm now going to display the modules that are utilized for this purpose as well as all the supporting processing and visualization tasks.

#pandas for dataframe management
import pandas as pd
#additional manipulation
import numpy as np
from numpy import zeros
from numpy import array
from numpy import asarray
#basic statistics
from statistics import median, mean
#regex for strings
import re
#plotting and wordclouds
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
#keras tokenizer and modeling
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten
from tensorflow.keras.layers import Embedding, LSTM
from keras.callbacks import EarlyStopping
#keras string padding
from keras_preprocessing.sequence import pad_sequences
#training and testing 
from sklearn.model_selection import train_test_split
#model metrics
from sklearn import metrics

Data Retrieval

The dataset for this tutorial comes from Jianmo Ni's collection of Amazon product reviews³. It comes in the .json format, which we will convert to a pandas dataframe before previewing. 

#read json file into pandas dataframe
df = pd.read_json("/Volumes/EXT128/WGU/Advanced Analytics/nlp/Video_Games_5.json",lines=True)
#investigate head
df.head()


#report dimensions
dims = df.shape
print('The dimensions of the original review set are',dims,'.')
The dimensions of the original review set are (497577, 12) .

Well there you have it! this dataframe gives us an idea of the extent of the original data and its character. There are almost 500,000 reviews along with miscellaneous metadata, as well as an attempt by Amazon to summarize the review using key words or a star rating. 

Data Processing and Exploration

There are a number of things to accomplish before we can begin modeling. The first thing I notice about the original data is that there are more variables than we need- for this project, I only care about the review text and the star rating, which are under the column names reviewText and overall respectively. The first task in the cleaning process will be to remove the columns that aren't of interest.

#remove undesired columns
df.drop(columns=['verified','reviewTime','reviewerID','asin','reviewerName','summary','unixReviewTime','vote','style','image'],inplace=True)

Now it's time to make the kinds of decisions analysts must- remember that in a business situation, the needs of the stakeholders must be carefully understood before taking these steps. It has been established that this is a binary classification problem, and it seems prudent to consider 4 or 5 stars as positive sentiment and 1 or 2 stars as negative. That leaves us with the 3-star reviews- are they positive or negative? Any decision seems completely subjective. In my opinion, the presence of 3-star reviews adds noise to the model so the decision for this project will be to remove those observations.  Once that's done, I'm going to assign a value of 0 to the negative reviews and 1 to the positive ones. 

#remove 3-star reviews
df = df[df['overall'] != 3]
#assign binary outcome
df['overall'] = np.where(df['overall'] > 3, 1, 0)
#reset indexes for skips introduced by removing rows
df = df.reset_index()
df.drop(columns=['index'],inplace=True)
The next step also involves some decision making. Some degree of preprocessing is always required with NLP tasks, even in the presence of tokenizers like Keras and embedding methods like GloVe, which are both equipped to some degree for handling raw text. Removing punctuation and numbers is pretty standard, along with making the text lowercase. Where I have seen differing opinions is the removal of stopwords, when the embedding process is typically equipped to handle them by assigning them appropriate weights. As usual we have to make a decision, and the one I am going with here is to remove them as part of the cleaning procedure. I believe this to be a step that reduces noise in the model, reduces computational expense in the training process, and allows for the display of more meaningful patterns in visualizations like the wordcloud. We'll instantiate a list of stopwords to filter for and also definine a function to perform the basic preprocessing steps.

#define list of stopwords
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", 
             "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during",
             "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", 
             "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into",
             "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or",
             "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", 
             "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's",
             "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up",
             "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's",
             "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've",
             "your", "yours", "yourself", "yourselves"]
#define standardizer function
def standardize(text):
    #make text lowercase
    txt = text.lower()
    #remove stopwords
    txt = ' '.join([word for word in txt.split() if word not in stopwords])
    #remove punctuation and numbers
    txt = re.sub('[^a-z]', ' ', txt)
    #remove single characters
    txt = re.sub(r'(?:^| )\w(?:$| )', ' ', txt)
    #collapse multiple spaces
    txt = re.sub(r'\s+', ' ', txt)
    return txt
#cast reviews as string
df['reviewText'] = df['reviewText'].astype(str)
#list for review storage
lst = []
#standardize each review 
for review in df['reviewText']:
    lst.append(standardize(review))
#place list into dataframe
df['reviewText'] = lst
#allow display of full text
pd.options.display.max_colwidth = 5000
#review processed df contents
df.head(5)


Alright! That's much cleaner than before. The reviews have lost their syntactical structure but retained the key words a machine learning algorithm would be interested in. The ones that are positive, as well as the single visible negative review, make sense with respect to the words they feature. Let's generate the image featured for this article, which is the wordcloud representing the positive review keywords. For comparison, I'm also going to generate the negative version. 

#import image for styling
mask = np.array(Image.open('/Volumes/EXT128/WGU/Advanced Analytics/nlp/Vector-Game-Controller-PNG-Clipart.png'))
#generate positive wordcloud
pos = WordCloud(background_color ='white',mask=mask,colormap='crest').generate(" ".join(i for i in df.reviewText[df.overall==1]))
#show
plt.imshow(pos)
plt.axis("off")
plt.show()

#generate negative wordcloud
neg = WordCloud(background_color ='white',mask=mask,colormap='rocket').generate(" ".join(i for i in df.reviewText[df.overall==0]))
#show
plt.imshow(neg)
plt.axis("off")
plt.show()

Note that by default the wordcloud module uses colocation- especially in the positive version, you're seeing combinations that you'd expect such as 'great game' and 'video game' rather than a giant 'game' dominating the cloud. This is nice, since from the head call I'd imagine the word 'game' appears in just about all the reviews. That said, the positive cloud is about what I would expect but the negative one is not as loaded with typical language as I thought it would. If you look carefully, there are words like 'bad' but also words like 'good'- the language of the negative reviews is overall more nuanced than I expected and the benefit of LSTM is that it can often pick up on those nuances. A side note from the perspective of someone who has played video games for much of his life: it's interesting to see words like 'still' so visible- it reminds me of common complaints against major game franchises that release similar content year after year. 

It's now time to start processing the text in ways specific to the embedding and modeling process. First, embedding. Text doesn't mean anything to computers, so we express the textual data we have in a numeric form- a vector, specifically. When we eventually assign numbers to words, the vectors they populate must all have the same dimension in order to be used as inputs for a neural network. +Since there will be very short and very long reviews, we have to decide on the maximum review length that will also correspond to the maximum vector dimensions. I'm going to devise a statistical justification for the max length. Consider the distribution of review lengths:

#import basic statistics
from statistics import median, mean
#initialize list of sequence lengths
lengths = []
#loop through reviews, counting words and storing the counts
for review in df.reviewText:
    length = len(review.strip().split(" "))
    lengths.append(length)
#average review length
meanlen = mean(lengths)
print('The average review is', round(meanlen), 'words.')
#median review length
medlen = median(lengths)
print('The median review length is', medlen,'words.')
The average review is 67 words.
The median review length is 21 words.

#histogram of review length
pd.Series(lengths).hist(bins=100,range=[0,1000])
plt.title('Review Lengths Histogram')
plt.xlabel('Length')
plt.ylabel('Frequency')
plt.show()

#length quantiles
np.set_printoptions(suppress=True)
np.quantile(lengths,[0,0.25,0.5,0.75,1])
array([ 1., 6., 20., 65., 3232.])

It's clear that reviews less than 100 words in length are dominant, and specifically the lengths of 65 words represent the third quartile. Let's remove longer reviews from the analysis. We will make the decision to keep small review lengths, because they tend to contain single-word sentiments that will be relatively simple for the model to pick up and learn. This is the final paring-down of the raw data set and the following code will report the final number of reviews.

#fix maximum length
maxlen = 65
#subset raw reviews for maximum length
reduced_lengths = list(np.where(np.array(lengths) <= maxlen)[0])
rev_in = [df.reviewText[i] for i in reduced_lengths]
#create vector of sentiment labels
labels = [df['overall'][i] for i in reduced_lengths]
#concatenate into dataframe
DF = pd.DataFrame(list(zip(rev_in, labels)),
               columns =['review_text', 'sentiment'])
#verify final number of reviews
rev_tot = len(DF)
print("There are a total of",rev_tot,'reviews under study.')
There are a total of 336419 reviews under study.

This has been a pretty long preprocessing sequence- I'd say we're over half way done! Now that the length of reviews has been set we're going to tokenize the texts with Keras. This is basically the penultimate NLP processing step- the here the individual words in each review are considered the units that contain semantic meaning. This step splits the reviews into their tokens, which creates a structured index of vocabulary that an ML algorithm can model on. Here goes!

#initialize tokenization 
tokenizer = Tokenizer()
#tokenize and build vocabulary set
tokenizer.fit_on_texts(rev_in)
#find dictionary length
vocab_len = len(tokenizer.word_index)
print('The final vocabulary size is', vocab_len, 'words.')
The final vocabulary size is 66776 words.

There you have it: 66,776 unique words in the whole body of 336,419. It's a pretty small vocabulary, but it's fair to conclude that this context is pretty restrictive concerning what you can expect people to say. Now that a vocabulary index is established, I want to assign a sequence of indexes to each review and compare the output to the original text.

#create word-index sequences
seq = tokenizer.texts_to_sequences(DF.review_text)
#sample the first review vector
seq[0]

[1, 99, 70, 14, 1404, 1126, 2]

#examine standardized review index 0
DF.review_text[0]

'game bit hard get hang of great '

Right there we get an interesting sense of the most common words. As expected, the sequence [1, 99, 70, 14, 1404, 1126, 2] when compared to its source indicates that 'game' is the number one most frequent word in the dataset.

You'll recall how we set the maximum allowable review length to 65? It's name to apply the real reasoning for that by placing sequences like this one into vectors that all have the same length- where there are no word indexes to apply for reviews that are too short, we pad the sequence with zeros. Appropriately, the process is called sequence padding and Keras has a function just for this purpose.

#pad out word sequences length 65
pad = pad_sequences(seq, padding='post', maxlen=maxlen)
#compare output
pad[0]

array([   1,   99,   70,   14, 1404, 1126,    2,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
      dtype=int32)

Hopefully what we have been up to thus far is now a little more clear. I'm going to wrap up this section on preprocessing by preparing the train/test split at a 0.1 proportion:

#import train/test split function
from sklearn.model_selection import train_test_split
#retrieve explainers and targets
X = pad
y = DF.sentiment
#call split function at 0.1 test size, set random seed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state= 43022)

Model Building

Thanks for reading this far! The writeup for this and the remaining sections is on the way! For now, I'll provide the code.

#initialize embedding dictionary
embeddings_dictionary = dict()
#access GloVe matrix
glove_file = open('/Volumes/EXT128/WGU/Advanced Analytics/nlp/glove/glove.6B.100d.txt', encoding="utf8")

#loop through GloVe database vocabulary
for line in glove_file:
    #split line and into tokens
    records = line.split()
    #capture word entry
    word = records[0]
    #capture the 100 word scores as an array
    vector_dimensions = asarray(records[1:], dtype='float32')
    #create dictionary entry with word:score pairs
    embeddings_dictionary [word] = vector_dimensions
#close file connection after loop completion
glove_file.close()

#initialize a matrix with the size of our vocabulary and 100 columns
embedding_matrix = zeros((vocab_len + 1, 100))
#loop through tokens
for word, index in tokenizer.word_index.items():
    #find each word in the vocabulary
    embedding_vector = embeddings_dictionary.get(word)
    #check to see if vocabulary space has scores 
    if embedding_vector is not None:
        #assign scores to words in the project vocabulary
        embedding_matrix[index] = embedding_vector

#initialize sequential model
model = Sequential()
#create embedding layer with the vocabulary size, embedding dimension, and input length
embedding_layer = Embedding(vocab_len + 1, 100, weights=[embedding_matrix], input_length=maxlen , trainable=False)
#implement embedding layer
model.add(embedding_layer)
#add LSTM layer with 128 neurons
model.add(LSTM(128,dropout = 0.5))
#add final classification layer
model.add(Dense(1, activation='sigmoid'))

#compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

#summarize model
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 65, 100)           6677700   
                                                                 
 lstm (LSTM)                 (None, 128)               117248    
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
=================================================================
Total params: 6,795,077
Trainable params: 117,377
Non-trainable params: 6,677,700
_________________________________________________________________
None

#import early stopping module
from keras.callbacks import EarlyStopping
#add early stop for successive increases in validation loss across epochs
es = EarlyStopping(monitor="val_loss",verbose=2,mode='min',patience=2)
#fit model with a maximum 20 epochs and early stopping criteria
history = model.fit(X_train, y_train, batch_size=128, epochs=20, verbose=1, validation

Epoch 1/20
1893/1893 [==============================] - 258s 135ms/step - loss: 0.2429 - acc: 0.9072 - val_loss: 0.1846 - val_acc: 0.9310
Epoch 2/20
1893/1893 [==============================] - 268s 142ms/step - loss: 0.1920 - acc: 0.9256 - val_loss: 0.1706 - val_acc: 0.9356
Epoch 3/20
1893/1893 [==============================] - 270s 142ms/step - loss: 0.1775 - acc: 0.9316 - val_loss: 0.1810 - val_acc: 0.9405
Epoch 4/20
1893/1893 [==============================] - 268s 142ms/step - loss: 0.1679 - acc: 0.9353 - val_loss: 0.1424 - val_acc: 0.9474
Epoch 5/20
1893/1893 [==============================] - 279s 147ms/step - loss: 0.1608 - acc: 0.9382 - val_loss: 0.1521 - val_acc: 0.9481
Epoch 6/201893/1893 [==============================] - 263s 139ms/step - loss: 0.1559 - acc: 0.9402 - val_loss: 0.1398 - val_acc: 0.9488
Epoch 7/20
1893/1893 [==============================] - 263s 139ms/step - loss: 0.1516 - acc: 0.9419 - val_loss: 0.1319 - val_acc: 0.9509
Epoch 8/20
1893/1893 [==============================] - 261s 138ms/step - loss: 0.1480 - acc: 0.9431 - val_loss: 0.1298 - val_acc: 0.9505
Epoch 9/20
1893/1893 [==============================] - 270s 142ms/step - loss: 0.1447 - acc: 0.9444 - val_loss: 0.1273 - val_acc: 0.9524
Epoch 10/20
1893/1893 [==============================] - 274s 145ms/step - loss: 0.1420 - acc: 0.9462 - val_loss: 0.1277 - val_acc: 0.9522
Epoch 11/20
1893/1893 [==============================] - 264s 139ms/step - loss: 0.1397 - acc: 0.9467 - val_loss: 0.1249 - val_acc: 0.9536
Epoch 12/20
1893/1893 [==============================] - 262s 138ms/step - loss: 0.1366 - acc: 0.9476 - val_loss: 0.1252 - val_acc: 0.9533
Epoch 13/20
1893/1893 [==============================] - 263s 139ms/step - loss: 0.1356 - acc: 0.9475 - val_loss: 0.1228 - val_acc: 0.9533
Epoch 14/20
1893/1893 [==============================] - 226s 120ms/step - loss: 0.1333 - acc: 0.9490 - val_loss: 0.1221 - val_acc: 0.9541
Epoch 15/20
1893/1893 [==============================] - 161s 85ms/step - loss: 0.1332 - acc: 0.9493 - val_loss: 0.1234 - val_acc: 0.9550
Epoch 16/20
1893/1893 [==============================] - 150s 79ms/step - loss: 0.1321 - acc: 0.9496 - val_loss: 0.1280 - val_acc: 0.9548
Epoch 16: early stopping

Model Validation

#plot loss for training and validation
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Train and Validation Model Loss over Epochs')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','validation'], loc='upper left')
plt.show()
#plot accuracy for loss and validation
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Train and Validation Model Accuracy over Epochs')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','validation'], loc='upper left')
plt.show()


#import confusion plotting module
from sklearn import metrics
#create prediction and truth vectors
y_pred = model.predict(X_test)
y_pred = np.where(y_pred > 0.5, 1,0)
y_true = y_test
#create confusion matrix and visualization
confusion_matrix = metrics.confusion_matrix(y_true, y_pred)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, display_labels = ['Unfavorable', 'Favorable'])
#plot matrix
cm_display.plot(colorbar=False,cmap = 'PuBuGn')
plt.show()
#create binary classification metrics
Accuracy = metrics.accuracy_score(y_true, y_pred)
Precision = metrics.precision_score(y_true, y_pred)
Sensitivity_recall = metrics.recall_score(y_true, y_pred)
Specificity = metrics.recall_score(y_true, y_pred, pos_label=0)
F1Score = metrics.f1_score(y_true, y_pred)
#disply all binary classification metrics
print("--- Test Set Metrics ---")
print("Accuracy:", Accuracy)
print("Precision:", Precision)
print("Sensitivity:", Sensitivity_recall)
print("Specificity:", Specificity)
print("F1 Score:", F1Score)

--- Test Set Metrics ---
Accuracy: 0.9560073717377088
Precision: 0.964507225198896
Sensitivity: 0.9871381568014889
Specificity: 0.6923726428370391
F1 Score: 0.9756914788778661

References

¹Srivastava, P. Essentials of Deep Learning : Introduction to Long Short Term Memory. Analytics Vidhya. 05/2020. https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/

²Pennington, J. et al. GloVe: Global Vectors for Word Representation. Stanford. 08/2014. https://nlp.stanford.edu/projects/glove/

³McAuley, J. Amazon Product Data. UCSD. 05/2021. https://nijianmo.github.io/amazon/index.html

⁴Dernoncourt, F. The Effect of Stopword Filtering prior to Word Embedding. StackExchange. 01/2016. https://stats.stackexchange.com/questions/201372/the-effect-of-stopword-filtering-prior-to-word-embedding-training