细节决定成败：推荐系统实验反思与讨论

施韶韵; 王晨阳; 马为之; 张敏; 刘奕群; 马少平

引用本文：

施韶韵,王晨阳,马为之,张敏,刘奕群,马少平.细节决定成败：推荐系统实验反思与讨论[J].信息安全学报,2021,6(5):52-67 [点击复制]
SHI Shaoyun,WANG Chenyang,MA Weizhi,ZHANG Min,LIU Yiqun,MA Shaoping.Details Matter: Revisiting Experimental Settings in Recommender Systems[J].Journal of Cyber Security,2021,6(5):52-67 [点击复制]

本文已被：浏览 11334次下载 11543次	码上扫一扫！
细节决定成败：推荐系统实验反思与讨论
施韶韵, 王晨阳, 马为之, 张敏, 刘奕群, 马少平
0 字体:加大+\|默认\|缩小-
(清华大学计算机系, 北京信息科学与技术国家研究中心, 北京 100084)

摘要:

近些年来，随着互联网的迅速发展，用户在各种在线平台上接收到海量的信息，信息爆炸成为一个关键性问题。在此背景下，推荐系统逐步渗透到人们工作生活的各个场景，已成为不可或缺的一环。它不仅可以帮助用户快速获得想要的信息和服务，还可以提高资源利用效率，从而给企业带来更多效益。因此，个性化推荐算法不仅获得了工业界广泛的关注，也是科研领域的研究热点之一。在个性化推荐的研究中，受限于平台与效率等因素，研究者大多无法将算法部署到在线系统上进行评价，因此离线评价成为推荐领域研究的主要方式。然而个性化推荐涉及到的场景复杂，可获得的数据信息多种多样，用户行为多为隐式反馈且存在许多噪声，这使得推荐系统离线评价的实验设定复杂多变，存在大量易被忽视却十分重要的细节。比如在训练采样负例时，既可以仅从用户没有交互过的商品中采样，也可以将验证测试集的商品视作未知交互加入采样池。同样，从训练到测试在很多其他环节也涉及这样的实现细节（如数据集处理、已知负样本的使用、Top-N排序候选集范围等）。这些实验细节通常不会在学术论文中被显式提及，却潜在影响了模型效果的对比，还决定着实验的科学性，甚至会导致相反或错误的分析结论。本文从数据集处理、模型训练、验证与测试、效果评价等多个角度，系统地讨论与反思了推荐系统实验中的细节设定。对于每个环节，我们枚举了若干常见设定，并在真实数据集上验证了其中某些设定的实际影响。实验结果表明一些细节确实会导致关于模型优劣的不同结论。最终我们形成了关于推荐系统实验细节的指导性总结，包括可选、建议、必须的三类设定，希望帮助推荐算法研究者规避实现细节上的陷阱，更科学合理地设计实验。

关键词: 推荐系统实验设计算法评价反思与讨论

DOI：10.19363/J.cnki.cn10-1380/tn.2021.09.04

投稿时间：2021-05-09修订日期：2021-08-05

基金项目:本课题得到国家重点研发计划（No.2018YFC0831900）、国家自然科学基金（No.61672311，No.61532011）和清华大学国强研究院资助。

Details Matter: Revisiting Experimental Settings in Recommender Systems

SHI Shaoyun, WANG Chenyang, MA Weizhi, ZHANG Min, LIU Yiqun, MA Shaoping

(Department of Computer Science and Technology, Institute for Artificial Intelligence, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China)

Abstract:

With the development of Internet in recent years, information overload has become a critical issue for users on various online platforms. To address this issue, recommender system stands out and comes to constitute a vital part in people's daily life. It not only makes it easier for users to access the information and services in need, but also brings benefits for companies by improving resource utilization. Therefore, personalized recommendation algorithms have gained increasing attention in industry and have attracted a surge of research interests in the meantime. Restricted by practical factors such as platform and efficiency, many researchers in personalized recommendation have no access to online systems to evaluate their algorithms. Thus, offline evaluation becomes the most common practice in the research area. However, different recommendation scenarios involve heterogeneous types of data. Furthermore, most user behaviors are implicit feedback with plenty of noises. These factors lead to complicated and divergent experimental settings in offline recommendation experiments. In practice, many important details are easy to be neglected and different researchers may have different perceptions towards detailed settings. For example, it can be a question that whether the item pool of negative sampling during training should include the interacted items in the valid/test dataset. One may simply sample negative items from non-interacted items or also view valid/test items as possible negative items. Similarly, there also exist various detailed settings during other processes, from training to testing (e.g., data preprocessing, the usage of known negative samples, the choice of candidates in Top-N ranking). These experimental details are usually omitted in the writing of research papers but potentially affect the comparison between recommendation algorithms. Besides, these settings somewhat determine the scientificity of experiment designs and some of them may even lead to opposite or wrong conclusions. Given these observations, this work thoroughly revisits the details in different aspects of recommendation experiments, including data preprocessing, model training, validation, testing, and evaluation metrics. We enumerate the common choices in each aspect and some are coupled with empirical experiments to demonstrate the effects of different experimental settings. We show that some settings indeed lead to flipped positions when ranking different recommendation algorithms. Finally, a guiding summary of experimental details is concluded, involving principles that are optional, suggestive, or necessary to be adopted. With the help of this summary, researchers are more capable of avoiding possible implementation traps and designing recommendation experiments in a scientific way.

Key words: recommender system experimental settings algorithm evaluation revisiting and discussion