|关键词: 推荐系统 实验设计 算法评价 反思与讨论
|Details Matter: Revisiting Experimental Settings in Recommender Systems
|SHI Shaoyun,WANG Chenyang,MA Weizhi,ZHANG Min,LIU Yiqun,MA Shaoping
|Department of Computer Science and Technology, Institute for Artificial Intelligence, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
|With the development of Internet in recent years, information overload has become a critical issue for users on various online platforms. To address this issue, recommender system stands out and comes to constitute a vital part in people's daily life. It not only makes it easier for users to access the information and services in need, but also brings benefits for companies by improving resource utilization. Therefore, personalized recommendation algorithms have gained increasing attention in industry and have attracted a surge of research interests in the meantime. Restricted by practical factors such as platform and efficiency, many researchers in personalized recommendation have no access to online systems to evaluate their algorithms. Thus, offline evaluation becomes the most common practice in the research area. However, different recommendation scenarios involve heterogeneous types of data. Furthermore, most user behaviors are implicit feedback with plenty of noises. These factors lead to complicated and divergent experimental settings in offline recommendation experiments. In practice, many important details are easy to be neglected and different researchers may have different perceptions towards detailed settings. For example, it can be a question that whether the item pool of negative sampling during training should include the interacted items in the valid/test dataset. One may simply sample negative items from non-interacted items or also view valid/test items as possible negative items. Similarly, there also exist various detailed settings during other processes, from training to testing (e.g., data preprocessing, the usage of known negative samples, the choice of candidates in Top-N ranking). These experimental details are usually omitted in the writing of research papers but potentially affect the comparison between recommendation algorithms. Besides, these settings somewhat determine the scientificity of experiment designs and some of them may even lead to opposite or wrong conclusions. Given these observations, this work thoroughly revisits the details in different aspects of recommendation experiments, including data preprocessing, model training, validation, testing, and evaluation metrics. We enumerate the common choices in each aspect and some are coupled with empirical experiments to demonstrate the effects of different experimental settings. We show that some settings indeed lead to flipped positions when ranking different recommendation algorithms. Finally, a guiding summary of experimental details is concluded, involving principles that are optional, suggestive, or necessary to be adopted. With the help of this summary, researchers are more capable of avoiding possible implementation traps and designing recommendation experiments in a scientific way.
|Key words: recommender system experimental settings algorithm evaluation revisiting and discussion