Assignment Requirements

The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.

For assignment 2, start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing. Implement at least two of these recommendation algorithms:

  • Content-Based Filtering
  • User-User Collaborative Filtering
  • Item-Item Collaborative Filtering

As an example of implementing a Content-Based recommender, you could build item profiles for a subset of MovieLens movies from scraping http://www.imdb.com/ or using the API at https://www.omdbapi.com/ (which has very recently instituted a small monthly fee). A more challenging method would be to pull movie summaries or reviews and apply tf-idf and/or topic modeling.

You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.

You may use the course text’s recommenderlab or any other library that you want. Please provide at least one graph, and a textual summary of your findings and recommendations.

Environment Prep

if (!require('recommenderlab')) install.packages('recommenderlab')
if (!require('ggplot2')) install.packages('ggplot2')
if (!require('XML')) install.packages('XML')

Data Import

For this project we will use a subset of Dataset 2+ of the Jester dataset.

This dataset is used courtesy of Dr. Ken Goldberg.

Publication Citation: Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.

To make it easier to import, the .zip file has been downloaded and the resulting .xls converted to a simple csv. We also utilize the joke .zip in the final step.

jester <- read.csv("jesterfinal151cols.csv", header=F)
jester <- jester[,1:101]# remove last 50 so it matches with our available joke text
Full <- dim(jester)

Subset

Our goal here is simply to get a manageable subset of the full dataset, but with a nice sparseness ratio to make it easier to understand and implement the basic modelling concepts.

Subset Method 1: User

Subset by num jokes a user rated.

# to keep our utility matrix manageable, we will use only the tiny slice with the number of observations equal to our cutoff
n <- .7 # will keep matrix semi-sparse
cutoff <- quantile(jester$V1, n)[[1]]
temp <- jester[jester$V1 == cutoff,] 

# change 99s (default NA) to NAs
temp[temp==99] <- NA
Sub1 <- dim(temp)

Subset Method 2: Jokes

Subset by removing columns with NO and ALL ratings.

# https://stackoverflow.com/questions/11330138/find-columns-with-all-missing-values
allNAs <- sapply(temp, function(x) all(is.na(x)))
#table(allNAs)
temp <- temp[,!allNAs]
noNAs <- sapply(temp, function(x) all(!is.na(x)))
#table(noNAs)
utility <- temp[,!noNAs]
Sub2 <- dim(utility)

Data Reduction Table

percentReduction <- round((1 - Sub2/Full) * 100, 0)
review <- t(data.frame(Full, Sub1, Sub2, percentReduction, 
                       row.names=c("Users", "Jokes")))
knitr::kable(review)
Users Jokes
Full 50692 101
Sub1 447 101
Sub2 447 79
percentReduction 99 22

Final Subset Review

populated <- colSums(!is.na(utility))
quantile(populated)
##    0%   25%   50%   75%  100% 
##   1.0  17.5  41.0 110.0 335.0
qplot(populated) + stat_bin(bins = 50) + ggtitle("Distribution of Number of Times a Joke was Ranked in Final Utility Matrix")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data Preparation

In this section we will take convert our utility matrix into a “realRatingMatrix” object for use with the “recommenderlab” package, do some initial data exploration, and split our data into train and test sets to prepare for modeling.

RecommenderLab Matrix

jesterUtility <- as(as.matrix(utility), "realRatingMatrix")
rm(jester, temp, review) #clean house

Data Exploration

mean_rating <- colMeans(jesterUtility, na.rm = T)
quantile(mean_rating)
##        0%       25%       50%       75%      100% 
## -4.456250  1.251562  2.145089  2.885979  4.872396
goodrating <- quantile(mean_rating, .5)
qplot(mean_rating) + ggtitle("Distribution of Joke Rating") + geom_vline(xintercept = goodrating, col='orange')

image(jesterUtility[1:80,], main="Heatmap: First 80 Users v Jokes")

Modeling

K-Fold Cross Validation

We will use leave one out cross validation to compare different models.

# use leave one out and k = 10
set.seed(643)
eval_sets <- evaluationScheme(data=jesterUtility, 
                              method = "cross-validation",
                              given = -1,
                              goodRating = goodrating, # >= 50% of the mean ratings
                              k = 10)
eval_sets
## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 10 run(s).
## Good ratings: >=2.145089
## Data set: 447 x 79 rating matrix of class 'realRatingMatrix' with 6207 ratings.
getData(eval_sets, "train")
## 396 x 79 rating matrix of class 'realRatingMatrix' with 5486 ratings.
getData(eval_sets, "known")
## 51 x 79 rating matrix of class 'realRatingMatrix' with 670 ratings.
getData(eval_sets, "unknown")
## 51 x 79 rating matrix of class 'realRatingMatrix' with 51 ratings.

Comparing Multiple Models

We’ll train Collaborative Filtering models with an eye toward 3 factors:

  • Item Based or User Based
  • Similarity Measures of cosine or pearson
  • Normalized by Z-score

For simplicity, we will consider Content Filtering, Binarization, and neighborhood size selction out of scope here.

models_to_evaluate <- list(
    IBCF_cos = list(name = "IBCF", param = list(method = "cosine")),
    IBCF_pear = list(name = "IBCF", param = list(method = "pearson")),
    IBCF_cos_Z = list(name = "IBCF", param = list(
        normalize = "Z-score", method="cosine")),
    UBCF_cos = list(name = "UBCF", param = list(method = "cosine")),
    UBCF_pear = list(name = "UBCF", param = list(method = "pearson")),
    UBCF_cos_Z = list(name = "UBCF", param = list(
        normalize = "Z-score", method="cosine")),
    random = list(name = "RANDOM", param =  NULL)
)

n_recommendations <- c(1, 3, 5, 10, 15, 25, 40)

list_results <- evaluate(x = eval_sets, 
                         method = models_to_evaluate, 
                         n = n_recommendations, 
                         type = "topNList")

evalNums <- evaluate(x = eval_sets, 
               method = models_to_evaluate, 
               type="ratings")

ROC | AUC

plot(list_results, annotate=1, legend = "bottomright") 
title("ROC curve")

Precision | Recall

plot(list_results, "prec/rec", annotate = 1, legend = "topright")
title("Precision-recall")

Get RMSE for models

evalTable <- avg(evalNums)
evalTable <- t(sapply(evalTable, rbind))
colnames(evalTable) <- c("RMSE", "MSE", "MAE")
knitr::kable(evalTable)
RMSE MSE MAE
IBCF_cos 4.086914 16.83027 3.026274
IBCF_pear 3.944595 15.72619 2.915857
IBCF_cos_Z 4.155497 17.36644 3.022101
UBCF_cos 3.922120 15.52978 2.914728
UBCF_pear 3.893935 15.31391 2.862430
UBCF_cos_Z 3.885944 15.25021 2.863518
random 5.241212 27.54999 3.809554

Results

Across all models, User-Based Collaborative filtering that uses cosine similarity and is fit to the normal distribution using Z-scores had high marks in AUC, Precision v Recall, and RMSE.

Let’s get use this model to recommend a single joke to a single user.

Single User: Pred + Fav

eval_recommender <- Recommender(data=getData(eval_sets, "train"), method = "UBCF", parameter = list(method="cosine", normalize="Z-score"))

# extract prediction for one user using best model
mypredict <- predict(object = eval_recommender, newdata = getData(eval_sets, "known"), type = "topNList", n=1)
recc_user_1 <- mypredict@items[[1]]
user1_topjoke <- mypredict@itemLabels[recc_user_1]
recjokeRating <- mypredict@ratings[[1]]

# what was their other highest rated joke?  
test <- normalize(getData(eval_sets, "known"), method="Z-score")
test <- as(test, 'matrix')
user1_favjoke <- names(which.max(test[1,] ))
favjokeRating <- max(test[1,], na.rm=T)

Get Joke Helper Function

getJoke <- function(recnum) {
    # get a recommended joke from the zip file w/o unzipping only in mem
    recnum <- as.numeric(gsub("[A-Z]", "", recnum))
    filename <- paste0('jokes/init', recnum, '.html')
    joke <- readLines(unz('jester_dataset_1_joke_texts.zip', filename))
    # inspired by: http://www.quantumforest.com/2011/10/reading-html-pages-in-r-for-text-processing/
    html <- htmlTreeParse(joke,useInternal = TRUE)
    text <- unlist(xpathApply(html, '//td', xmlValue))
    text <- gsub('\n', ' ', text)
    text <- sub("^[^a-zA-Z]*", "", text)
    text <- gsub('  ', ' ', text)
    text <- gsub("\\\"", "'", text)
    text <- paste(text, collapse = ' ')
}