Modelling

Build NLP pipeline, transform the dataset and predict helpfulness.

Recap

Amazon reviews that are voted as helpful are assumed to be reviews that people will naturally want to read more of​ Reviews are an important part of the consumer’s purchasing decisions. They are just as, if not more important than star ratings that are not descriptive enough​ However, reviews may not have enough helpfulness score if it is new and has yet to gain any traction. Lack of participation and feedback from other purchasers can also lead to insufficient scoring​ Therefore, prediction of review helpfulness before other users even see them can help Amazon push these reviews up and provide a better consumer buying experience​ Classification problems = classifier models; Text data = natural language processing

NLP Pipeline

Using PySpark ML NLP Pipeline

First, process review text with the following pipeline.

#Document & Tokenize
document_assembler = DocumentAssembler().setInputCol("review_text_l").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("review_words")
 
#Cleaning Tokens
remover           = StopWordsCleaner().setInputCols("review_words").setOutputCol("review_words_stop").setCaseSensitive(False).setStopWords(eng_stopwords)
stemmer           = Stemmer().setInputCols(["review_words_stop"]).setOutputCol("review_words_lemstem")
finisher          = Finisher().setInputCols(["review_words_lemstem"]).setOutputCols(["token_features"]).setOutputAsArray(True).setCleanAnnotations(False)
pipeline_stem = Pipeline(stages=[document_assembler,tokenizer,remover,stemmer,finisher]) 
df_clean_nlp = pipeline_stem.fit(df_clean).transform(df_clean)
hashingTF = HashingTF(inputCol="token_features", outputCol="rawFeatures", numFeatures = 10000)
df_featurizedData = hashingTF.transform(df_clean_nlp)
idf = IDF(inputCol="rawFeatures", outputCol="features")
df_nlp = idf.fit(df_featurizedData).transform(df_featurizedData)

Second, perform Logistic Regression & Random Forest Classifier on processed, vectorized dataset.

lr = LogisticRegression(featuresCol = 'features', labelCol='helpful')

paramGrid = (ParamGridBuilder().addGrid(lr.regParam, [0.3, 0.0]).addGrid(lr.elasticNetParam, [0.0]).addGrid(lr.maxIter, [20, 100]).build())

#Evaluator
evaluator = MulticlassClassificationEvaluator(labelCol='helpful', predictionCol="prediction")
    
# Create 3-fold CrossValidator
cv = CrossValidator(estimator=lr,estimatorParamMaps=paramGrid,evaluator=evaluator,numFolds=3)

cvModel = cv.fit(train)

predictions = cvModel.transform(train)

print(evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"}))
print(evaluator.evaluate(predictions, {evaluator.metricName: "f1"}))
print(evaluator.evaluate(predictions, {evaluator.metricName: "weightedPrecision"}))
print(evaluator.evaluate(predictions, {evaluator.metricName: "weightedRecall"}))
rfc = RandomForestClassifier(impurity="gini", featuresCol='features', labelCol="helpful")

paramGrid = (ParamGridBuilder().addGrid(rfc.impurity, ['gini', 'entropy']).addGrid(rfc.maxBins, [32, 100]).build())

#Evaluator
evaluator = MulticlassClassificationEvaluator(labelCol='helpful', predictionCol="prediction")
    
# Create 3-fold CrossValidator
cv = CrossValidator(estimator=rfc,estimatorParamMaps=paramGrid,evaluator=evaluator,numFolds=3)

cvModel = cv.fit(train)

predictions = cvModel.transform(train)

print(evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"}))
print(evaluator.evaluate(predictions, {evaluator.metricName: "f1"}))
print(evaluator.evaluate(predictions, {evaluator.metricName: "weightedPrecision"}))
print(evaluator.evaluate(predictions, {evaluator.metricName: "weightedRecall"}))

Using SparkNLP Pipeline With Sentence Encoders

First, process review text with the following pipeline.

Universal Sentence Encoder
#USE (Universal Sentence Encoder) Sentence Embedding
document = DocumentAssembler()\
    .setInputCol("review_text")\
    .setOutputCol("document")
    
embeddingsSentence = UniversalSentenceEncoder.load('tfhub_use_en_2.4.0_2.4_1587136330099') \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("prediction")\
  .setLabelColumn("helpful")\
  .setMaxEpochs(5)\
  .setEnableOutputLogs(True)

use_clf_pipeline = Pipeline(
    stages = [
        document,
        embeddingsSentence,
        classifierdl
    ])
from sklearn.metrics import classification_report, accuracy_score
metrics_final = metrics.toPandas()
metrics_final['result'] = metrics_final['result'].apply(lambda x: x[0])
metrics_final['result'] = metrics_final['result'].astype('int')

print(classification_report(metrics_final.helpful, metrics_final.result))
print(accuracy_score(metrics_final.helpful, metrics_final.result))
Uncased BERT
#BERT Sentence Embedding
document = DocumentAssembler()\
    .setInputCol("review_text")\
    .setOutputCol("document")
    
bert_cmlm = BertSentenceEmbeddings.load('sent_bert_base_uncased_en_2.6.0_2.4_1598346203624')\
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("prediction")\
  .setLabelColumn("helpful")\
  .setMaxEpochs(5)\
  .setEnableOutputLogs(True)

use_clf_pipeline = Pipeline(
    stages = [
        document,
        embeddingsSentence,
        classifierdl
    ])
from sklearn.metrics import classification_report, accuracy_score
metrics_final = metrics.toPandas()
metrics_final['result'] = metrics_final['result'].apply(lambda x: x[0])
metrics_final['result'] = metrics_final['result'].astype('int')

print(classification_report(metrics_final.helpful, metrics_final.result))
print(accuracy_score(metrics_final.helpful, metrics_final.result))