The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.
For assignment 2, start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing. Implement at least two of these recommendation algorithms:
As an example of implementing a Content-Based recommender, you could build item profiles for a subset of MovieLens movies from scraping http://www.imdb.com/ or using the API at https://www.omdbapi.com/ (which has very recently instituted a small monthly fee). A more challenging method would be to pull movie summaries or reviews and apply tf-idf and/or topic modeling.
You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.
You may use the course text’s recommenderlab or any other library that you want. Please provide at least one graph, and a textual summary of your findings and recommendations.
if (!require('recommenderlab')) install.packages('recommenderlab')
if (!require('ggplot2')) install.packages('ggplot2')
if (!require('XML')) install.packages('XML')
For this project we will use a subset of Dataset 2+ of the Jester dataset.
This dataset is used courtesy of Dr. Ken Goldberg.
Publication Citation: Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.
To make it easier to import, the .zip file has been downloaded and the resulting .xls converted to a simple csv. We also utilize the joke .zip in the final step.
jester <- read.csv("jesterfinal151cols.csv", header=F)
jester <- jester[,1:101]# remove last 50 so it matches with our available joke text
Full <- dim(jester)
Our goal here is simply to get a manageable subset of the full dataset, but with a nice sparseness ratio to make it easier to understand and implement the basic modelling concepts.
Subset by num jokes a user rated.
# to keep our utility matrix manageable, we will use only the tiny slice with the number of observations equal to our cutoff
n <- .7 # will keep matrix semi-sparse
cutoff <- quantile(jester$V1, n)[[1]]
temp <- jester[jester$V1 == cutoff,]
# change 99s (default NA) to NAs
temp[temp==99] <- NA
Sub1 <- dim(temp)
Subset by removing columns with NO and ALL ratings.
# https://stackoverflow.com/questions/11330138/find-columns-with-all-missing-values
allNAs <- sapply(temp, function(x) all(is.na(x)))
#table(allNAs)
temp <- temp[,!allNAs]
noNAs <- sapply(temp, function(x) all(!is.na(x)))
#table(noNAs)
utility <- temp[,!noNAs]
Sub2 <- dim(utility)
percentReduction <- round((1 - Sub2/Full) * 100, 0)
review <- t(data.frame(Full, Sub1, Sub2, percentReduction,
row.names=c("Users", "Jokes")))
knitr::kable(review)
Users | Jokes | |
---|---|---|
Full | 50692 | 101 |
Sub1 | 447 | 101 |
Sub2 | 447 | 79 |
percentReduction | 99 | 22 |
populated <- colSums(!is.na(utility))
quantile(populated)
## 0% 25% 50% 75% 100%
## 1.0 17.5 41.0 110.0 335.0
qplot(populated) + stat_bin(bins = 50) + ggtitle("Distribution of Number of Times a Joke was Ranked in Final Utility Matrix")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this section we will take convert our utility matrix into a “realRatingMatrix” object for use with the “recommenderlab” package, do some initial data exploration, and split our data into train and test sets to prepare for modeling.
jesterUtility <- as(as.matrix(utility), "realRatingMatrix")
rm(jester, temp, review) #clean house
mean_rating <- colMeans(jesterUtility, na.rm = T)
quantile(mean_rating)
## 0% 25% 50% 75% 100%
## -4.456250 1.251562 2.145089 2.885979 4.872396
goodrating <- quantile(mean_rating, .5)
qplot(mean_rating) + ggtitle("Distribution of Joke Rating") + geom_vline(xintercept = goodrating, col='orange')
image(jesterUtility[1:80,], main="Heatmap: First 80 Users v Jokes")
We will use leave one out cross validation to compare different models.
# use leave one out and k = 10
set.seed(643)
eval_sets <- evaluationScheme(data=jesterUtility,
method = "cross-validation",
given = -1,
goodRating = goodrating, # >= 50% of the mean ratings
k = 10)
eval_sets
## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 10 run(s).
## Good ratings: >=2.145089
## Data set: 447 x 79 rating matrix of class 'realRatingMatrix' with 6207 ratings.
getData(eval_sets, "train")
## 396 x 79 rating matrix of class 'realRatingMatrix' with 5486 ratings.
getData(eval_sets, "known")
## 51 x 79 rating matrix of class 'realRatingMatrix' with 670 ratings.
getData(eval_sets, "unknown")
## 51 x 79 rating matrix of class 'realRatingMatrix' with 51 ratings.
We’ll train Collaborative Filtering models with an eye toward 3 factors:
For simplicity, we will consider Content Filtering, Binarization, and neighborhood size selction out of scope here.
models_to_evaluate <- list(
IBCF_cos = list(name = "IBCF", param = list(method = "cosine")),
IBCF_pear = list(name = "IBCF", param = list(method = "pearson")),
IBCF_cos_Z = list(name = "IBCF", param = list(
normalize = "Z-score", method="cosine")),
UBCF_cos = list(name = "UBCF", param = list(method = "cosine")),
UBCF_pear = list(name = "UBCF", param = list(method = "pearson")),
UBCF_cos_Z = list(name = "UBCF", param = list(
normalize = "Z-score", method="cosine")),
random = list(name = "RANDOM", param = NULL)
)
n_recommendations <- c(1, 3, 5, 10, 15, 25, 40)
list_results <- evaluate(x = eval_sets,
method = models_to_evaluate,
n = n_recommendations,
type = "topNList")
evalNums <- evaluate(x = eval_sets,
method = models_to_evaluate,
type="ratings")
plot(list_results, annotate=1, legend = "bottomright")
title("ROC curve")
plot(list_results, "prec/rec", annotate = 1, legend = "topright")
title("Precision-recall")
evalTable <- avg(evalNums)
evalTable <- t(sapply(evalTable, rbind))
colnames(evalTable) <- c("RMSE", "MSE", "MAE")
knitr::kable(evalTable)
RMSE | MSE | MAE | |
---|---|---|---|
IBCF_cos | 4.086914 | 16.83027 | 3.026274 |
IBCF_pear | 3.944595 | 15.72619 | 2.915857 |
IBCF_cos_Z | 4.155497 | 17.36644 | 3.022101 |
UBCF_cos | 3.922120 | 15.52978 | 2.914728 |
UBCF_pear | 3.893935 | 15.31391 | 2.862430 |
UBCF_cos_Z | 3.885944 | 15.25021 | 2.863518 |
random | 5.241212 | 27.54999 | 3.809554 |
Across all models, User-Based Collaborative filtering that uses cosine similarity and is fit to the normal distribution using Z-scores had high marks in AUC, Precision v Recall, and RMSE.
Let’s get use this model to recommend a single joke to a single user.
eval_recommender <- Recommender(data=getData(eval_sets, "train"), method = "UBCF", parameter = list(method="cosine", normalize="Z-score"))
# extract prediction for one user using best model
mypredict <- predict(object = eval_recommender, newdata = getData(eval_sets, "known"), type = "topNList", n=1)
recc_user_1 <- mypredict@items[[1]]
user1_topjoke <- mypredict@itemLabels[recc_user_1]
recjokeRating <- mypredict@ratings[[1]]
# what was their other highest rated joke?
test <- normalize(getData(eval_sets, "known"), method="Z-score")
test <- as(test, 'matrix')
user1_favjoke <- names(which.max(test[1,] ))
favjokeRating <- max(test[1,], na.rm=T)
getJoke <- function(recnum) {
# get a recommended joke from the zip file w/o unzipping only in mem
recnum <- as.numeric(gsub("[A-Z]", "", recnum))
filename <- paste0('jokes/init', recnum, '.html')
joke <- readLines(unz('jester_dataset_1_joke_texts.zip', filename))
# inspired by: http://www.quantumforest.com/2011/10/reading-html-pages-in-r-for-text-processing/
html <- htmlTreeParse(joke,useInternal = TRUE)
text <- unlist(xpathApply(html, '//td', xmlValue))
text <- gsub('\n', ' ', text)
text <- sub("^[^a-zA-Z]*", "", text)
text <- gsub(' ', ' ', text)
text <- gsub("\\\"", "'", text)
text <- paste(text, collapse = ' ')
}
print(paste0("Favorite Joke: ", user1_favjoke, " | Normalized Rating: ", favjokeRating))
## [1] "Favorite Joke: V36 | Normalized Rating: 1.81023781447643"
print(getJoke(user1_favjoke))
## [1] " A guy walks into a bar, orders a beer and says to the bartender, 'Hey, I got this great Polish Joke...' The barkeep glares at him and says in a warning tone of voice: 'Before you go telling that joke you better know that I'm Polish, both bouncers are Polish and so are most of my customers' 'Okay' says the customer,'I'll tell it very slowly.' "
print(paste0("Recommended Joke: ", user1_topjoke, " | Normalized Rating Prediction: ", recjokeRating))
## [1] "Recommended Joke: V63 | Normalized Rating Prediction: 0.790665187194103"
print(getJoke(user1_topjoke))
## [1] " An engineer, a physicist and a mathematician are sleeping in a room. There is a fire in the room. The engineer wakes up, sees the fire, picks up the bucket of water and douses the fire and goes back to sleep. Again there is fire in the room. This time, the physicist wakes up, notices the bucket, fills it with water, calculates the optimal trajectory and douses the fire in minimum amount of water and goes back to sleep. Again there is fire. This time the mathematician wakes up. He looks at the fire, looks at the bucket and the water and exclaims, 'A solution exists' and goes back to sleep. "