On the Difficulty of Evaluating Baselines: A Study on Recommender Systems

2019-05-04Code Available0· sign in to hype

Steffen Rendle, Li Zhang, Yehuda Koren

Code Available — Be the first to reproduce this paper.

Code

github.com/srendle/libfm
In papernone★ 0
github.com/tohtsky/myFM
none★ 0

Abstract

Numerical evaluations with comparisons to baselines play a central role when judging research in recommender systems. In this paper, we show that running baselines properly is difficult. We demonstrate this issue on two extensively studied datasets. First, we show that results for baselines that have been used in numerous publications over the past five years for the Movielens 10M benchmark are suboptimal. With a careful setup of a vanilla matrix factorization baseline, we are not only able to improve upon the reported results for this baseline but even outperform the reported results of any newly proposed method. Secondly, we recap the tremendous effort that was required by the community to obtain high quality results for simple methods on the Netflix Prize. Our results indicate that empirical findings in research papers are questionable unless they were obtained on standardized benchmarks where baselines have been tuned extensively by the research community.

Tasks

Collaborative Filtering Recommendation Systems

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
MovieLens 10M	Bayesian timeSVD++ flipped	RMSE	0.75	—	Unverified
MovieLens 10M	Bayesian timeSVD++	RMSE	0.75	—	Unverified
MovieLens 10M	Bayesian SVD++	RMSE	0.76	—	Unverified
MovieLens 10M	SGD MF	RMSE	0.77	—	Unverified
MovieLens 10M	U-RBM	RMSE	0.82	—	Unverified

On the Difficulty of Evaluating Baselines: A Study on Recommender Systems

Code

Abstract

Tasks

Benchmark Results

Reproductions