数据挖掘交叉特征案例_数据挖掘特征工程资源-CSDN文库

3星 · 超过75%的资源需积分: 50 18 浏览量 2017-07-31 21:18:12 上传评论收藏 285KB PDF 举报

### 数据挖掘交叉特征案例：电子商务中的重复购买者预测 #### 摘要在电子商务领域，数据挖掘技术被广泛应用于客户行为分析、个性化推荐、市场趋势预测等方面。特别是在大规模促销活动（如“双11”购物节）期间，商家通过各种促销手段吸引大量新客户。然而，这些新客户中很大一部分可能只是一次性交易者，对于商家长期销售增长的影响有限。因此，准确识别哪些新客户有可能成为长期忠诚客户变得尤为重要，这不仅有助于降低促销成本，还能提高投资回报率（ROI）。本文详细介绍了一种基于数据挖掘的交叉特征构建方法，并以此为基础进行重复购买者预测。 #### 背景与问题定义阿里巴巴在2015年国际人工智能联合会议（IJCAI）上举办了一个关于重复购买者预测的国际竞赛，该竞赛基于2014年天猫“双11”购物节的销售数据。竞赛吸引了来自全球各地的753支队伍参与。本次竞赛的主要目标是开发一个能够准确预测哪些新用户会成为重复购买者的模型。竞赛主办方提供了包括用户基本信息、购买历史、产品信息等在内的多种数据集。 #### 方法与实现为了赢得比赛，研究团队采取了全面的数据预处理、特征工程以及模型训练策略。具体来说： 1. **特征工程**： - **用户特征**：构建了包含用户年龄、性别、地理位置、购买频率、购买金额等多个维度的用户画像。 - **商品特征**：分析了商品类别、品牌、价格范围等信息，同时结合用户对商品的评价和反馈来丰富商品特征。 - **交互特征**：通过挖掘用户与商品之间的交互记录，例如浏览、收藏、购买等行为，进一步提取了用户偏好特征。 - **交叉特征**：构建了一系列交叉特征，例如用户-商品交互次数、特定时间段内的购买频率等，这些特征在预测模型中起到了关键作用。 2. **模型选择与优化**： - 初始阶段采用了逻辑回归、随机森林等传统机器学习算法。 - 随后引入了梯度提升树（GBDT）、深度神经网络（DNN）等更复杂的模型，并进行了参数调优。 - 采用K折交叉验证评估模型性能，并根据反馈调整模型结构和参数。 3. **结果与分析**： - 经过多轮迭代与优化，最终模型在测试集上取得了优秀的预测性能。 - 特别是交叉特征的引入显著提升了模型的预测精度。 #### 关键技术点解析 1. **特征工程的重要性**：特征工程是机器学习项目成功的关键因素之一。通过对原始数据进行清洗、转换、组合等操作，可以构建出更有意义的特征，进而提升模型的性能。在本案例中，通过构建用户、商品、交互等方面的特征，特别是交叉特征，有效地捕捉到了用户行为模式，为预测模型提供了有力支持。 2. **交叉特征的作用**：交叉特征是指将两个或多个基本特征组合在一起形成的新特征。例如，在本案例中，“用户-商品交互次数”就是一个典型的交叉特征。这类特征能够捕获不同维度之间潜在的关联性，对于揭示复杂的行为模式非常有用。通过构建这样的交叉特征，可以显著提高预测模型的效果。 3. **模型选择与优化**：针对不同的任务需求选择合适的模型至关重要。在本案例中，最初采用了一些较为简单的模型进行探索性分析，随后逐步引入了更复杂的模型并进行优化。这种策略既确保了初期模型的快速搭建，又通过后续的优化提升了整体性能。 #### 结论通过上述案例的研究与实践，我们发现数据挖掘技术在电子商务领域的应用前景广阔。尤其是在特征工程方面，通过构建丰富的特征集，尤其是交叉特征，可以极大地提升预测模型的效果。此外，合理选择并优化模型也是取得良好预测效果的重要环节。这一成果不仅为重复购买者预测提供了有效的方法，也为其他电商领域的数据分析工作提供了有益的参考。

资源推荐

资源详情

资源评论

Repeat Buyer Prediction for E-Commerce

Guimei Liu

⋆

, Tam T. Nguyen

⋆

, Gang Zhao

, Wei Zha

⋆

, Jianbo Yang

Jianneng Cao

⋆

, Min Wu

⋆

, Peilin Zhao

⋆

, Wei Chen

⋆

Data Analytics Department, Institute for Infocomm Research, Singapore 138632,

{liug,nguyentt,zhaw,caojn,wumin,zhaop}@i2r.a-star.edu.sg

Development Bank of Singapore, {george.g.zhao, nus.waltchan}@gmail.com

General Electric, jianbo.yang@ge.com

ABSTRACT

A large number of new buyers are often acquired by mer-

chants during promotions. However, many of the attracted

buyers are one-time deal hunters, and the promotions may

have little long-lasting impact on sales. It is important for

merch ants to identify who can be converted to regular loyal

buyers and then target them to reduce promotion cost and

increase the return on investment (ROI). At International

Joint Conferences on Artiﬁcial Intelligence (IJCAI) 2015, Al-

ibaba hosted an international competition for rep eat buyer

prediction based on the sales data of the “Double 11” shop-

ping event in 2014 at Tmall.com. We won the ﬁrst place at

stage 1 of the competition out of 753 teams. In this paper, we

present our winning solution, which con sists of comprehen-

sive feature engineering and model training. We created pro-

ﬁles for users, merchants, brands, categories, items an d their

interactions via extensive feature engineering. These proﬁles

are not only useful for this particular prediction task, but

can also be used for other important tasks in e- commerce,

such as customer segmentation, product recommendation,

and customer base augmentation for brands. Feature engi-

neering is often the most important factor for the success

of a prediction task, but not much work can be found in

the literature on feature engineering for prediction tasks in

e-commerce. Our work prov ides some useful hints and in-

sights for data science practitioners in e-commerce.

Keywords

Repeat Buyer Prediction; Feature Engineering; E-commerce

1. INTRODUCTION

Large business-to-consumer (B2C) e-commerce websites,

such as Amazon and Alibaba, often run nationwide sales

promotions on sp ecial days like Black Friday and Double

11 (Singles’ Day). Merchants acquire new customers during

these events. However, most new customers are one-time

Permission to make digital or hard copies of all or pa rt of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

KDD ’16, August 13-17, 2016, San Francisco, CA, USA

 2016 ACM. ISBN 978-1-4503-4232-2/16/08. . . $15.00

DOI: http://dx.doi.org/10.1145/2939672.2939674

deal hunters, and promotions to them usually do not gener-

ate return on investment (RO I) as expected by merchants.

Therefore, merchants need to identify potential loyal ones

from these new customers, so as to conduct targeted ad-

vertisements (and promotions) towards them to lower th e

promotion cost. It is diﬃcult for any individual merchant

to identify its potential loyal customers as it has little in-

formation on its new customers. B2C e-commerce websites

instead have the click stream data and purchase history of

all the customers at all the merchants on their platforms.

Thus, they can learn th e preferences and habits of the new

customers from their historical data, and then predict how

likely a new customer will buy again from a same merchant.

At IJCAI 2015, Alibaba hosted an international competi-

tion

for repeat buyer prediction based on the sales data of

the “Double 11” day of 2014 at Tmall.com—the largest B2C

platform in China. Double 11 is the biggest online shopping

event in China with sales (in Tmall and Taobao) at US$5.8

billion in 2013, US$9.3 billion in 2014, and over US$14.3 bil-

lion in 2015

. Data provided to the competition include a

number of merch ants an d their new buyers acquired during

the event, and six m onths of user activity log data before

the event. The task is to predict which new customers of

a given merchant would buy items from the same merchant

again within six months. These new buyers are called repeat

buyers of the respective merchants.

We won the ﬁrst place at stage 1 of the competition. Our

winning solution consists of comprehensive feature engineer-

ing and model training. In particular, we generated various

types of features to describe users, merchants, brands, cat-

egories, items and their interactions from diﬀerent aspects.

We have trained various classiﬁcation models, including Fac-

torization Machine [14, 11], Logistic Regression [1, 2], Ran-

dom Forest [5], GBM [10], and XGBoost [6]. We have also

used ensemble techniques to blend multiple classiﬁers to-

gether to further imp rove the performance.

The repeat buyer prediction problem can b e formulated

as a typical classiﬁcation problem, as most of the competi-

tion participants did. Model training of this task is not much

diﬀerent from that of other classiﬁcation tasks. Instead, fea-

ture engineering is the main component that distinguishes

this task from others. Feature engineering, an integral part

of data science, is often the key to th e success of a ma-

chine learning project. It can be more diﬃcult than learning

http://ijcai-15.org/index.php/

repeat-buyers-prediction-competition

https://en .wikipedia.org/wiki/S ingles

Day

Table 1: Statistics of training and testing data

data #users #merchants #pairs #positive pairs positive%

train 212,062 1,993 260,864 15,952 6.12%

test 212,108 1,993 261,477 16,037 6.13%

because it is domain-speciﬁc, while machine learning algo-

rithms are largely general-purpose. Much t rial and error

can go into feature design, and it is typically where most

of the eﬀort in a machine learning project goes [8]. While

thousands of classiﬁcation algorithms have been prop osed

and studied in the research community, not much work has

been reported on feature engineering for prediction tasks in

e-commerce. Therefore, in this paper we focus on feature

engineering. We will describe how to generate various types

of features from user activity log data and study the impor-

tance of these features via extensive experiments. The fea-

tures we generated can be used in all kinds of e-commerce

applications, such as customer segmentation, product rec-

ommendation, and customer base augmentation for brands.

We hope that our work can be valuable for data science prac-

titioners, who need to develop solutions for prediction tasks

in e-commerce.

The rest of the paper is organized as follows. Section 2

gives th e problem description. Section 3 describes the fea-

tures we have generated. Model ensemble is brieﬂy d escribed

in Section 4. In Section 5, the importance of features is stud-

ied and top features are listed. Finally, Section 6 concludes

the pap er.

2. PROBLEM DESCRIPTION

For the repeat buyer prediction competition, the follow-

ing data are provided as shown on the top of Figure 1:

demographic information of users, six months of user ac-

tivity log data prior to the “Double 11” promotion, and

training and testing hnew buyer, merchanti pairs, where the

ﬁrst purchase of the new buyer from the merchant is on

the “Doub le 11” promotion. User demographic data con-

tains the age and gender of users. The age values are di-

vided into seven ranges. The class label of a training hnew

buyer, m erchanti pair is known, and it indicates whether the

new buyer bought items from the merchant again within six

months after the “Double 11” promotion. The class labels of

testing hnew buyer, merchanti pairs are hidden. The task is

to predict the class labels of th e testing pairs. The compe-

tition was carried out in two stages. In stage 1, all the data

were released to the contestants except for class labels of

testing p airs, which were released after stage 1. Stage 2 ran

on the cloud platform of Alibaba for bigger data, and the

data were not released. Therefore, in this paper, we focus

on the data of stage 1.

Table 1 shows the statistics of the training and testing

data. The set of merchants in training data and that in test-

ing data are the same except for a single merchant. Users in

the training and testing data have no overlap. The second

last column is the number of positive hn ew buyer, merchanti

pairs such that the new buyer bought items from the mer-

chant again within six months. The last column is the per-

centage of such positive pairs. The percentage of positive

pairs is around 6%, which indicates that most of the new

buyers are indeed one-time deal hu nters.

The user activity log data contains the following ﬁelds:

user

id, merchant id, item id, cat id, brand id, action type

Table 2: Statistics of log activity data

#rows #users #merchants #items #categories #brands

54,925,330 424,170 4,995 1,090,390 1,658 8,444

Table 3: Statistics of action types

click add-to-cart purchase add-to-favourite

48,550,713 (88.39%) 76,750 (0.14%) 3,292,144 (5.99%) 3,005,723 (5.47%)

and time stamp. Action type takes four values: 0 for click,

1 for add-to-cart, 2 for purchase and 3 for add-to-favourite.

Products sold in diﬀerent merchants are assigned diﬀerent

item

ids even if the products are exactly the same. Table

2 shows the statistics of the user activity log data. Many

merch ants in the log data do not have new buyers in the

training or testing data. They are included in the log data

because some new buyers visited them. The activities of the

new buyers at these merchants are valuable information for

inferring the preferences and habits of the new buyers.

Table 3 shows the number of the four types of actions.

The m ajority of actions are clicks. The number of add-

to-cart actions is very small, so we merge the ad d-to-cart

actions with click actions.

The user activity log data provided in this competition

are very typical in e-commerce prediction tasks. However,

the log data are not in a form that is amenable to learning.

We need to construct new features from them and then join

the new features with t he training and testing data. In t he

next section, we describe how we d o this.

3. FEATURE ENGINEERING

The user activity log data contain ﬁve entities: users, mer-

chants, brands, categories and items. The characteristics of

these entities and their interactions can be predictive of the

class labels. For example, users are more likely to buy again

from a merchant selling snacks than from a merchant sell-

ing electronic products within six months, since snacks are

cheaper and are consumed much faster than electronic prod-

ucts. We generated a large number of features to describe

the characteristics of the ﬁve types of entities and their pair-

wise interactions. In the rest of this section, we ﬁrst give an

overview of all the generated features, and then describe the

features in details.

3.1 Overview of features and proﬁles

The features we generated range from basic counts to com-

plex features like similarity scores, tren ds, PCA (Principal

Component Analysis) and LDA (Latent Dirichlet allocation)

features. All the features of an entity form the proﬁle of the

entity. We have ﬁve entity proﬁles and ﬁve interaction pro-

ﬁles as shown at the bottom of Figure 1. Table 4 gives a

summary of the ty pes of features contained in these proﬁles.

User-merchant interaction is the most important interaction

among the ﬁve pairwise interaction proﬁles as the task is to

predict wheth er a user will return to a merchant to b uy

again. Therefore, user-merchant proﬁle contains more fea-

tures than t he other interaction proﬁles.

The original training/testing data contain only user

ids

and merchant

ids as shown on the top of Figure 1. We ex-

panded the training/testing data by adding age

range and

gender of users, item id, brand id, an d category id as shown

in the middle of Figure 1, where item id is the id of the item

bought by the user from the merchant on the Double 11 day,

剩余9页未读，继续阅读

评论收藏

内容反馈

人工智能小白菜

2018-05-08

这个pdf是一篇英文论文......

RoaringKitty

粉丝: 6w+
资源: 26

数据挖掘交叉特征案例

A Preprocessing Scheme for High-Cardinality Categorical Attributes

基于UCI中Car Evaluation数据集的分类、回归与聚类

数据挖掘汽车评估

决策树ID3算法实验_数据集car_databases

人工智能实验 ID3决策树（java实现）

数据挖掘商业案例分析与及实现.pdf

数据挖掘技术及案例教程(含工具)

数据挖掘_数据挖掘实例_

《数据挖掘原理与应用——SQL Server 2005 数据库》算法案例

数据挖掘的典型应用案例

categorical-distribution-js:JavaScript 的分类分发库。 能够在线学习，对分布进行采样并将其转储到数组中存储以备后用

Python库 | categorical_encoding-0.2.0-py3-none-any.whl

在分类及预测任务中对高维类别变量的预处理方法

img_preprocessing.m

PreProcessing

R语言数据挖掘 方法及应用 薛微编著 +案例数据及代码

IBM数据挖掘报告案例

SAS编程与数据挖掘商业案例 配套程序文件及数据文件 共15个章节.rar

《SAS编程与数据挖掘商业案例》数据集和代码

数据挖掘应用案例：图书销售智能分析

邮政数据挖掘与中邮保险交叉销售.pdf

大数据数据挖掘案例大数据数据挖掘案例

医学数据挖掘..pdf

数据挖掘研究案例

数据挖掘的相关案例和demo.zip

数据挖掘导论 范明 范宏建等译

数据挖掘导论 完整版

数据挖掘导论 完整版_数据挖掘_

小白学数据挖掘与机器学习 基于SPSS Modeler实现 含全部数据及模型文件.rar

广东工业大学数据挖掘12年试卷

最新资源

categorical-distribution-js:JavaScript 的分类分发库。能够在线学习，对分布进行采样并将其转储到数组中存储以备后用

R语言数据挖掘方法及应用薛微编著 +案例数据及代码

SAS编程与数据挖掘商业案例配套程序文件及数据文件共15个章节.rar

数据挖掘导论范明范宏建等译

数据挖掘导论完整版

数据挖掘导论完整版_数据挖掘_

小白学数据挖掘与机器学习基于SPSS Modeler实现含全部数据及模型文件.rar