Assignment Requirements

The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.

For assignment 2, start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing. Implement at least two of these recommendation algorithms:

Content-Based Filtering
User-User Collaborative Filtering
Item-Item Collaborative Filtering

As an example of implementing a Content-Based recommender, you could build item profiles for a subset of MovieLens movies from scraping http://www.imdb.com/ or using the API at https://www.omdbapi.com/ (which has very recently instituted a small monthly fee). A more challenging method would be to pull movie summaries or reviews and apply tf-idf and/or topic modeling.

You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.

You may use the course text’s recommenderlab or any other library that you want. Please provide at least one graph, and a textual summary of your findings and recommendations.

Environment Prep

if (!require('recommenderlab')) install.packages('recommenderlab')
if (!require('ggplot2')) install.packages('ggplot2')
if (!require('XML')) install.packages('XML')

Data Import

For this project we will use a subset of Dataset 2+ of the Jester dataset.

This dataset is used courtesy of Dr. Ken Goldberg.

Publication Citation: Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.

To make it easier to import, the .zip file has been downloaded and the resulting .xls converted to a simple csv. We also utilize the joke .zip in the final step.

jester <- read.csv("jesterfinal151cols.csv", header=F)
jester <- jester[,1:101]# remove last 50 so it matches with our available joke text
Full <- dim(jester)

Subset

Our goal here is simply to get a manageable subset of the full dataset, but with a nice sparseness ratio to make it easier to understand and implement the basic modelling concepts.

Subset Method 1: User

Subset by num jokes a user rated.

# to keep our utility matrix manageable, we will use only the tiny slice with the number of observations equal to our cutoff
n <- .7 # will keep matrix semi-sparse
cutoff <- quantile(jester$V1, n)[[1]]
temp <- jester[jester$V1 == cutoff,] 

# change 99s (default NA) to NAs
temp[temp==99] <- NA
Sub1 <- dim(temp)

Subset Method 2: Jokes

Subset by removing columns with NO and ALL ratings.

# https://stackoverflow.com/questions/11330138/find-columns-with-all-missing-values
allNAs <- sapply(temp, function(x) all(is.na(x)))
#table(allNAs)
temp <- temp[,!allNAs]
noNAs <- sapply(temp, function(x) all(!is.na(x)))
#table(noNAs)
utility <- temp[,!noNAs]
Sub2 <- dim(utility)

Data Reduction Table

percentReduction <- round((1 - Sub2/Full) * 100, 0)
review <- t(data.frame(Full, Sub1, Sub2, percentReduction, 
                       row.names=c("Users", "Jokes")))
knitr::kable(review)

	Users	Jokes
Full	50692	101
Sub1	447	101
Sub2	447	79
percentReduction	99	22

Final Subset Review

populated <- colSums(!is.na(utility))
quantile(populated)

##    0%   25%   50%   75%  100% 
##   1.0  17.5  41.0 110.0 335.0

qplot(populated) + stat_bin(bins = 50) + ggtitle("Distribution of Number of Times a Joke was Ranked in Final Utility Matrix")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data Preparation

In this section we will take convert our utility matrix into a “realRatingMatrix” object for use with the “recommenderlab” package, do some initial data exploration, and split our data into train and test sets to prepare for modeling.

RecommenderLab Matrix

jesterUtility <- as(as.matrix(utility), "realRatingMatrix")
rm(jester, temp, review) #clean house

Data Exploration

mean_rating <- colMeans(jesterUtility, na.rm = T)
quantile(mean_rating)

##        0%       25%       50%       75%      100% 
## -4.456250  1.251562  2.145089  2.885979  4.872396

goodrating <- quantile(mean_rating, .5)
qplot(mean_rating) + ggtitle("Distribution of Joke Rating") + geom_vline(xintercept = goodrating, col='orange')

image(jesterUtility[1:80,], main="Heatmap: First 80 Users v Jokes")

Modeling

K-Fold Cross Validation

We will use leave one out cross validation to compare different models.

# use leave one out and k = 10
set.seed(643)
eval_sets <- evaluationScheme(data=jesterUtility, 
                              method = "cross-validation",
                              given = -1,
                              goodRating = goodrating, # >= 50% of the mean ratings
                              k = 10)
eval_sets

## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 10 run(s).
## Good ratings: >=2.145089
## Data set: 447 x 79 rating matrix of class 'realRatingMatrix' with 6207 ratings.

getData(eval_sets, "train")

## 396 x 79 rating matrix of class 'realRatingMatrix' with 5486 ratings.

getData(eval_sets, "known")

## 51 x 79 rating matrix of class 'realRatingMatrix' with 670 ratings.

getData(eval_sets, "unknown")

## 51 x 79 rating matrix of class 'realRatingMatrix' with 51 ratings.

Comparing Multiple Models

We’ll train Collaborative Filtering models with an eye toward 3 factors:

Item Based or User Based
Similarity Measures of cosine or pearson
Normalized by Z-score

For simplicity, we will consider Content Filtering, Binarization, and neighborhood size selction out of scope here.

models_to_evaluate <- list(
    IBCF_cos = list(name = "IBCF", param = list(method = "cosine")),
    IBCF_pear = list(name = "IBCF", param = list(method = "pearson")),
    IBCF_cos_Z = list(name = "IBCF", param = list(
        normalize = "Z-score", method="cosine")),
    UBCF_cos = list(name = "UBCF", param = list(method = "cosine")),
    UBCF_pear = list(name = "UBCF", param = list(method = "pearson")),
    UBCF_cos_Z = list(name = "UBCF", param = list(
        normalize = "Z-score", method="cosine")),
    random = list(name = "RANDOM", param =  NULL)
)

n_recommendations <- c(1, 3, 5, 10, 15, 25, 40)

list_results <- evaluate(x = eval_sets, 
                         method = models_to_evaluate, 
                         n = n_recommendations, 
                         type = "topNList")

evalNums <- evaluate(x = eval_sets, 
               method = models_to_evaluate, 
               type="ratings")

ROC | AUC

plot(list_results, annotate=1, legend = "bottomright") 
title("ROC curve")

Precision | Recall

plot(list_results, "prec/rec", annotate = 1, legend = "topright")
title("Precision-recall")

Get RMSE for models

evalTable <- avg(evalNums)
evalTable <- t(sapply(evalTable, rbind))
colnames(evalTable) <- c("RMSE", "MSE", "MAE")
knitr::kable(evalTable)

	RMSE	MSE	MAE
IBCF_cos	4.086914	16.83027	3.026274
IBCF_pear	3.944595	15.72619	2.915857
IBCF_cos_Z	4.155497	17.36644	3.022101
UBCF_cos	3.922120	15.52978	2.914728
UBCF_pear	3.893935	15.31391	2.862430
UBCF_cos_Z	3.885944	15.25021	2.863518
random	5.241212	27.54999	3.809554

Results

Across all models, User-Based Collaborative filtering that uses cosine similarity and is fit to the normal distribution using Z-scores had high marks in AUC, Precision v Recall, and RMSE.

Let’s get use this model to recommend a single joke to a single user.

Single User: Pred + Fav

eval_recommender <- Recommender(data=getData(eval_sets, "train"), method = "UBCF", parameter = list(method="cosine", normalize="Z-score"))

# extract prediction for one user using best model
mypredict <- predict(object = eval_recommender, newdata = getData(eval_sets, "known"), type = "topNList", n=1)
recc_user_1 <- mypredict@items[[1]]
user1_topjoke <- mypredict@itemLabels[recc_user_1]
recjokeRating <- mypredict@ratings[[1]]

# what was their other highest rated joke?  
test <- normalize(getData(eval_sets, "known"), method="Z-score")
test <- as(test, 'matrix')
user1_favjoke <- names(which.max(test[1,] ))
favjokeRating <- max(test[1,], na.rm=T)

Get Joke Helper Function

getJoke <- function(recnum) {
    # get a recommended joke from the zip file w/o unzipping only in mem
    recnum <- as.numeric(gsub("[A-Z]", "", recnum))
    filename <- paste0('jokes/init', recnum, '.html')
    joke <- readLines(unz('jester_dataset_1_joke_texts.zip', filename))
    # inspired by: http://www.quantumforest.com/2011/10/reading-html-pages-in-r-for-text-processing/
    html <- htmlTreeParse(joke,useInternal = TRUE)
    text <- unlist(xpathApply(html, '//td', xmlValue))
    text <- gsub('\n', ' ', text)
    text <- sub("^[^a-zA-Z]*", "", text)
    text <- gsub('  ', ' ', text)
    text <- gsub("\\\"", "'", text)
    text <- paste(text, collapse = ' ')
}

Show Favorite and Recommended

print(paste0("Favorite Joke: ", user1_favjoke, " | Normalized Rating: ", favjokeRating))

## [1] "Favorite Joke: V36 | Normalized Rating: 1.81023781447643"

print(getJoke(user1_favjoke))

## [1] " A guy walks into a bar, orders a beer and says to the bartender, 'Hey, I got this great Polish Joke...'  The barkeep glares at him and says in a warning tone of voice: 'Before you go telling that joke you better know that I'm Polish, both bouncers are Polish and so are most of my customers' 'Okay' says the customer,'I'll tell it very slowly.'  "

print(paste0("Recommended Joke: ", user1_topjoke, " | Normalized Rating Prediction: ", recjokeRating))

## [1] "Recommended Joke: V63 | Normalized Rating Prediction: 0.790665187194103"

print(getJoke(user1_topjoke))

## [1] " An engineer, a physicist and a mathematician are sleeping in a room. There is a fire in the room. The engineer wakes up, sees the fire, picks up the bucket of water and douses the fire and goes back to sleep.  Again there is fire in the room. This time, the physicist wakes up, notices the bucket, fills it with water, calculates the optimal trajectory and douses the fire in minimum amount of water and goes back to sleep.  Again there is fire. This time the mathematician wakes up. He looks at the fire, looks at the bucket and the water and exclaims, 'A solution exists' and goes back to sleep. "

DATA 643: Recommender Systems

Project 2

Walt Wells, Summer 2017