Wide & Deep Learning for Recommender Systems (2016)

In this paper, we present Wide & Deep learning — jointly trained wide linear models and deep neural networks — to combine the benefits of memorization and generalization for recommender systems.

hypothesis

1. 推荐系统简介

一个完整的推荐系统主要分为两个部分：retrieval和ranking，如图1所示（The retrieval system returns a short list of items that best match the query using various signals , usually a combination of machine-learned models and human-defined rules; the ranking system ranks all items by their scores）。

图1 Overview of the recommender system

2. WIDE & DEEP LEARNING

该论文提出了一种用于ranking模块的Wide & Deep learning framework，如图2所示。

图2 The spectrum of Wide & Deep models

该框架包括两个部分：Wide Component和Deep Component。那么，为什么需要这两个部分呢？论文中给出了解释：

Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort
With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features

另外，在整个框架中，Wide Component和Deep Component进行joint training，即两者的结果输入到一个logistic loss function中，如图3所示，因此在训练过程中同时更新各自的参数。

图3 Wide & Deep model

该论文以Apps推荐为例，给出了Wide & Deep model的落地方案。图4为该Apps推荐系统的pipeline，主要包括Data Generation、Model Training以及Model Serving这三个部分：

Data Generation：用于生成训练数据
Model Training：模型训练，模型的具体结构如图5所示
Model Serving：模型部署（响应时间为10 ms左右）

图4 Apps recommendation pipeline overview

图5 Wide & Deep model structure for apps recommendation

图6展示了Offline AUC/Online Acquisition Gain的实验结果。

图6 Offline & online metrics of different models

有意思的是Deep的Offline AUC比Wide要低，但是其Online Acquisition Gain比Wide要高2.9%。对于这一现象可能有几种解释：

相比Deep，Wide更易在Offline的数据集上过度学习，即overfit
Offline metrics与Online metrics不线性相关

总之，如何设计Offline metrics或者offline测试也是一个重要的研究课题。

图7展示了Serving Latency的实验结果，显然，Serving Latency主要依赖于Batch size和Number of Threads。

图7 Serving latency

3. 总结

Wide & Deep model structure：在Wide的基础上，引入Deep模块用于特征提取（Wide linear models can effectively memorize sparse feature interactions using cross-product feature transformations; deep neural networks can generalize to previously unseen feature interactions through low-dimensional embeddings）
joint training：与ensemble和stacking等模型训练方式相比，joint training是一种新颖的模型训练方式