Data Preparation

Perform necessary data preprocessing work before modelling.

Data Preparation

  1. Remove NaN

  2. Get Products With Total Votes > 10

df = df.filter(col('total_votes') > 10)
df.count()
  1. Combine review_headline with review_body to create review text By doing so, we can minimize information loss by reducing number of null rows by combining both columns, as they have different number of null rows.
df = df.fillna("", "review_body")
df = df.fillna("", "review_headline")
df = df.withColumn('review_text', F.concat('review_headline', F.lit(" "), 'review_body'))
df.show(1, vertical = True, truncate = False)
  1. Create Helpful Ratio & Helful column for our modelling Helpful Ratio: helpful_votes/ total votes ​ Created Column Helpful: ​

    helpful_ratio < 0.5: not helpful = 0 ​

    helpful_ratio > 0.5: helpful = 1

df = df.withColumn('helpful_ratio', F.col('helpful_votes') / F.col('total_votes'))
df = df.withColumn('helpful', when(col("helpful_ratio") < 0.5, 0).otherwise(1))
  1. Further cleanup on text using REGEX operation
from pyspark.sql.functions import col, lower, regexp_replace, split

def clean_text(c):
    c = lower(c)
    c = regexp_replace(c, "\"", "")
    c = regexp_replace(c, "^rt ", "")
    c = regexp_replace(c, "(https?\://)\S+", "")
    c = regexp_replace(c, "<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});", "")
    return c
df_clean = df_clean.withColumn('review_text',clean_text(col("review_text")).alias("text"))