数据挖掘交叉特征案例

3星(超过75%的资源)
所需积分/C币:50 2017-07-31 21:18:12 285KB PDF
45
收藏 收藏
举报

数据挖掘交叉特征案例
Raw data user_ id merchant id item_ id cat id brand id action type time_sta expanded testing data user id age range i gender merchant id item id cat id brane user id age range gender merchant id item id cat id brand id prob 341766 39057571382162 1639680 0460577264513687622 341766 12180075210282337 Feature 3605762 215819181816114066 Engineering Generated featu merchant id brand id fea_1 圖凹毆 merchant id fea. brand profile merchant-brand(Mb) profile user-brand(UB) profile user-merchant(UM)profile ofile merchant protile chant id fea item profile ser-category(UC) profile tem id fe merchant-category(MC)profile luser id cat id fea 1.fea category profile merchant id cat id fea_1..fea_n ■6 Figure 1: Feature engineering on raw data Table 1: Summary of features and profiles user merchant category en user-merchant user-brand user-category feat profile profile profile profileprofile(UM) profile (UB ( MtBant-brand merch activit features age related features elsted 12 May Double 11 day and brand id and category_id are the brand and category of the item. If a user bought multiple items from a merchant June JulyAug Oct on the double 11 day, then the most frequent one is used click/add-to-cart he features are joined with the expanded training/testin purchase data as follows: entity features are joined based on their re- dd-to-fayourite spective ids; age related features are joined by the respective entity id and age range gender related features are joined by the respective entity id and gender; interaction features Figure 2: Action history of an example entity are joined by the two entity ids involved 3. 2 Count/ratio features erating more complex features. For the entity shown in Fig- Each entity has three types of actions-click, purchase and ure 2, the monthly click counts are(1, 1,1, 2, 1, 3, 3), the ldd-to-favourites--over the six-month period from 12 May, monthly purchase counts are(0, 1,0,0.0,0, 1), and the 2014 to 11 Nov, 2014. Figure 2 shows the action history of an monthly add-to-favourite counts are (0,0,0,0, 1, 1.0) example entity. The three types of actions are represented The overall counts of click, purchase, and add-to-favorite are 12, 2, and 2, respectively. Action ratio is the propor- Action count, action ratio and day count. Action tion of a particular action type over all action types, and counts are number of clicks, purchases, add-to-favourite ac it can be calculated in each month or over the whole data tions in each month(monthly counts or over the whole data period. For the entity in Figure 2, the overall click ratio is period(overall count). Count features are the basis for gen- 12/(12+2+2)=0.75, and June click ratio is 1/(1+1+0)=0.5 Day counts are the number of days with a particular ac "Double 11 "day are more likely to come back as well tion type in each month or over the whole data period. Day Figures 5(c), 5(d), 6(e),6(f), 7(c), 7(e), and 7(f) count features are mainly used to differentiate regular buy Merchant aggregation features are generated for users ers from occasional buyers. For example, a user with 10 Given a user, his/ her merchant-purchase-day-aggregation fea- purchase actions in one single day is different from another tures are calculated by first counting the number of days that user with the same number of purchase actions that spread the user bought items from each individual merchant, and over 10 different days. The latter user is considered a more then calculating the mean, standard deviation, max and me- regular buyer than the former dian over all the merchants. from which the user had made ction count, action ratio, and day count features can be at least one purchase. Merchant-purchase-item-aggregation generated for pairs of entities as well. For example, for each features are calculated over the number of unique items that (user, merchant) pair, monthly click counts are the num- the user purchased from each merchant in a similar way. ber of times the user clicked some items of the merchant in For click and add-to-favourite actions. merchant-action-day each month. For a(merchant, brand pair, overall purchase aggregation and merchant-action-item-aggregation features counts are the number of times products of the brand were are calculated in the same way purchased from the merchant over the whole data period Merchant aggregation features reflect users,habit. If a Not all the monthly features are directly used, otherwise user tends to buy from or visit merchants multiple times on there will be too many features. Instead, we use these fea- average, then he /she is likely to buy from new merchants tures to generate more complex features, like monthly ag- again(see Figures 5(a), 5(b)and 7D)) Product diversity features. For a user, product diversity 3.4 Recent activity features features are the number of unique items, brands and cate- Double 11 features are counts of clicks, purchases, add gories that the user clicked. purchased or added to favourites to-favourites on the Double 11 day. The ratio of the double each month or over the whole data period. For a merchant 11 counts to the overall counts are also calculated. For the product diversity features are defined in a similar way. For entity in Figure 2. its Double 11 click count is 1, its Double a(user, merchant)pair, product diversity features are the 11 click ratio is 1/ 12=0.083 and its Double 11 buy ratio is number of unique items, brands and categories of the mer- 1/2=0.5. If a user has a high Double 1l buy ratio, then the chant that were clicked, purchased or added to favourites user is more likely to be a one-time deal hunter by the user in each month or over the whole data period Latest one-week features and latest one-month fea The intuition behind product diversity features is that if a tures are counts/ratio of clicks, add-to-favourites, and pur- user is interested in more items of a merchant, then the user chases in the last one week and in the last one month before is more likely to buy again from the merchant(see Figures Double 11, respectively. 6(a),7(d)and7i) Penetration features. Penetration feature of an item is 3.5 Complex features defined as the number of users, who have purchased the item Trend features are calculated based on monthly features in a given time interval. We have also computed penetra Given monthly counts or monthly ratios y=(y1 tion features for merchants, brands and categories. A large over seven months from May to November, the slope of the reputation, so users are more likely to come bace as a good customer base usually indicates that the entity ha trend line is calculated as a= n 1(mivi) ∑1(x2)-(∑-1(x1) where n=7 and i= i We also calculated the deviation of the latest month from 3.3 Aggregation features the previous months and normalized it using either mean or Monthly aggregation features are mean, standard devi standard deviation as follows: d J/7- ation,max and median of monthly action counts, monthly where y7 is the feature value in November, u and o are the day counts, monthly product diversity counts monthly mean and standard deviation of the feature values over the penetration counts. previous six month User aggregation features are calculated for merchants, Repeat buyer features. Repeat buyer number of a mer- brands, categories,(merchant, brand) pairs and(merchant chant is defined as the number of users. who bought on category) pairs. The user-purchase-day-aggregation features at least two different days from the merchant. For items of a merchant are calculated by first counting the number brands and categories, repeat buyer number is defined as the of days that each individual user bought items from the number of users, who bought the item brand /category on merchant, and then calculating the mean, standard devi- t least two different days. Repeat buyer ratio of a mer ation, max and median over all the users of the merchant. chant/item/brand /category is defined as the ratio of repeat User-Purchase-item-aggregation features of a merchant are buyers to all buyers(including non-repeating buyers) of the defined in a similar way over the number of unique items that mer chant/item/brand /category each user purchased from the merchant. For click and add- A repeat buy day of a user at a merchant is defined as a to-favourite actions, user-action-day-aggregation and user day. such that the user bought items from the merchant both action-item-aggregation features are calculated in the same before and on the day. Repeat day number of a merchant is way for merchants. For other entities, only the purchase the sum of the repeat buy day of all its users. Repeat buy action is considered. User aggregation features are also gen day ratio is the ratio of the repeat day number of a merchant erated by considering only users of a specific gender or age to the sum of the buy days of all the users of the merchants range Repeat buyer features are also calculated for pairs of en The intuition behind user aggregation features is-given tities. For example, for a(merchant, brand)pair, repeat a merchant, if users visited it or bought items from it more buyer number is the number of users, who bought items of than once on average, then new buyers of the merchant on the brand on at least two different days from the merchant A high number or a high proportion of repeat buyers in- created documents, we generate features (.e, distributions dicates that the entity is widely liked, so the customers are of topics over merchants) for users. Similarly, we model mer- more likely to come back again. Our experiment results chants as documents and users as words, and generate fea confirmed this(see Figures 7(a), 7(g) and 7(j)) tures(i.., distributions of topics over users) for merchants Market share features measure how important a brand/ We set the number of topics to 40 based on the performance category is to a merchant, or how important a merchant is of predictive models to a brand/ category. Take a(merchant, brand) pair as an example. Let NMB be the number of purchases of the brand 3.6 Age/gender related features from the merchant, NM be the total number of purchases Different user groups may favor different types of prod- from the merchant, and NB be the number of purchases of ucts. For example, clothes and cosmetics are more attractive the brand from all the merchants. Similarly, we define umB to women while electric products are more appealing to men as the number of users buying the brand from the merchant, As such, we generated features to describe the popularity UM the total number of buyers of the merchant, and UB the of merchants. brands. categories. and items within different number of buyers of the brand from all the merchants. The user groups, where users are grouped based on their gell- following four features are then generated der or age range. These features include overall buy counts 1) merchant's market share on the brand= NMB/NB InlonChly aggregation oll Inlonthly buy counts, penetration 2) merchant's user share on the brand=UMB/UB features alld repeat buyer eatures. Only users of d partic 3)brand's market share within the merchant= NMB/NM e range or a particular gelder user share wthin the merchan MB/OM Chese fealures The first two features measure how important a merchant is to a brand, and the last two features measure how important 3.7 Featu a brand is to a merchant. Similarly, we can calculate market share features for(merchant, category) pair We have generated 1364 features in total. It is crucial to identify important ones and remove those that are of little User-merchant similarity features measure how similar a user and a merchant are based on brands or categories use to reduce the training cost[13, 7. During the competi tion, we tested both the wrapper method described in [15 They are calculated based on the four market share features defined above and the preferences of users on brands/cat 16 and the feature ranking function provided by XGBoost and we found that all the methods yield very similar fe egories. The preferences are measured by the times or the number of days the user clicked, purchased, or added to fa ture rankings. In this paper, we report the results by the vorites the brand/ category. Suppose that a merchant has feature ranking function of XGBoost. Besides ranking all the features together, we also group features based on their five brands with respective market shares(0.1. 0.2, 0.05 0.3, 0.01). and that the number of times of a user buying types or the profiles they belong to. We rank the features the five brands are (0, 1, 2, 0, 2) We can compute the within each group separately, and output the top features inner product of the two vectors, and take it as the sim We have also evaluated the importance of each feature group ilarity score between the user and the merchant, that is by leaving it out 0.1×0+0.2×1+0.05×2+0.3×0+0.01×2=0.32.We can also take the max, instead of sum, over all brands, then 4. MODEL TRAINING the similarity score is 0. 2 X 1=0.2 In the competition, we have trained various classification Intuitively, the more similar a user and a merchant are, models, including Factorization Machine [14, Logistic Re- the more likely the user will buy from the merchant again gression [1, Random Forest, GBM 10, and XGBoost [6 (see Figure 7(h)) PCA features are generated based on the similarity be where grid search was used to select the optimal parameters XGBoost performed the best tween merchants. Give a pair of merchants, we use the num- To further improve the performance, we used ensemble ber of users who bought items from both of them as their similarity score. The total number of merchants is 4,995 techniques to blend together the predictions made by the above single classifiers. The blending model is basically a Therefore, a matrix of 4.995 4.995 is built. This matrix is weighted sum as defined below. highly sparse with most elements equal to 0. Simply adding it into the feature list does not obviously improve the accu racy of classification models, but dramatically increases the p(u,m)=>2×x1(u,m) model training time. As such, we have applied PCA(prin cipal component analysis)3, 12 on the similarity matrix Then for each merchant, the top-10 principal coordinates where p(u, m) is the final probability that a user u will are used as merchant features make a repeated purchase from a merchant m, pilu, m) is LDA features. Latent Dirichlet Allocation(LDA)4 is the probability predicted by the i-th single model, W, is the often used in text mining to retrieve topics from a corpus of weight assigned to the i-th single model, and k is the number of single models documents. It views each document as a mixture of various topics, where each topic is characterized by a distribution We tested two methods to assign the weight wi to the i- over words. The retrieved distribution of the topics can be th single model. For the first method, we manually assign taken as a feature of the document. We first model users weights to single models, such that single models with higher as documents and merchants as words. Given a user, we AUC (the area under the ROC curve) scores receive bigger extract from log activity data all the merchants, from which weights. For the second method, we built a classifier to learn the user purchased items, and treat these merchants as the the weights. In particular, we generated a k-dimensional words of the users document. By applying lda on these feature vector for each user-merchant pair(u, m), where the i-th dimension is the probability pilu, m) by the i-th Table 5: Feature groups and their AUC scores Feature groups#* features AUC leave-out AUC I feature 70036 ent it chant profile 221 0.6601 0.70096 brand profile 0.65818 0.70071 profiles category profile 62087 0.69944 0.50000 item profile 0.60915 0.70002 Factorizat or interactionuser hant profil D.62952 AuUC0679820680970674 06830306950907028 070494 profiles B and UC p 0.59148 MB and MC profile 58 action coun cOull/lalo onthly action ratio 0.70026 Figure 3: AUC of single models and the blending product diversity 0.67829 model gregalioll aggregatio user aggregatio features merchant aggregation 24|52328703 single model. We trained a linear model on these feature recent test one 0.66516 activity latest one il 57935 vectors to learn the weights 1Ug Our experiments showed that manually assigned weights repeat buyer features CM similarity score 1920.6476 0.70043 are often as good as and sometimes even better than the 10 0.4546 0.70023 LDA 0.64639 0.69877 learned weights in this application. Therefore, in the com- related petition, we mainly manually assigned weights to blend the gender related 65397 7UUI7 predictions of different models together. We did it in an incremental manner in each round we kept the best pre- leave-out AUC diction thus far, and blended it with the prediction of a ingle Inodel. If the resultanT AUC score did not improve we discarded both the blended model and the single model otherwise, we updated the best prediction using the blended model Figure 3 shows the auC scores of the single models the testing data. The two linear models, Factorization Ma chine and Logistic Regression, performed closely. Although their scores are not really high, they contribute to the over- (a)entity and interaction profiles I aUC score in the blending model. In ensemble algorithm family, Random Forest has the worst AUC score. However we found that bagging of Random Forest models can im prove the score significantly. XGBoost has the best AUC score of 0.70282. Comparing with the runner up gradient Boosting machine, its improvement is more than 0.7%. We have blended around 20 single models with various parame ter settings and feature settings, and achieved an aUC score of 0.70494, which is an improvement of 0. 21% over the best feature types single model (i.e, XGBoost) (b)feature types 5.A PERFORMANCE STUDY Figure 4: AUC scores of feature groups In this section we evaluate the importance of features in groups. We first conduct the experiments on training data y five-fold cross validation to measure the importance of of features turns out to be not important, we can remove each group of features( Section 5.1), and to find the top fea- the whole group to save the cost of both feature engineer tures locally in each group as well as globally in the full fea ing and model training Table 5 lists the feature grou ture set(Section 5.2). Extensive experimental results show together with their sizes. The second row shows the AUC that some features are less important - removing them has score of the full feature set. We study merchant-brand(MB) only marginal effect on the performance of the predictive profile and merchant-category(MC) profile together, since models. Such a finding can help a user to determine the they are of small sizes. Likewise, user-brand (UB) profile subset of features to be applied in real applications for a and user-category(UC) profile are investigated together good balance between model accuracy and training time We use five-fold cross validation to evaluate the impor hen, we carry out the experiments on testing data (Sec ance of each group of features. The predictive model in tion 5.3 ). We study how the prediction accuracy increases the experiments is XGBoost with the parameter setting as we incrementally use more features to train models eta=0.04, rounds=400, max depth=7, min-child-weight 5.1 Importance of feature groups 200 and subsample=0. 8. The five folds are the same for all the experiments, and the reported results are the averages We have generated 1364 features in total. They are or- over the five folds. We first train XGBoost by each group ganized in groups either by type or by profile as summa- of features alone. The AUC" column in Table 5 and the rized in Table 4 and discussed in Section 3. Features in a auC bars in Figure 4 are the averaged AUC scores of XG same group are generated based on the same hypothesis, Boost over the five folds. Clearly, the higher the score, the and their importance are evaluated together. If one group more important the group of features is. We also evaluate l U merchant click term num awc UM total buy action (a)user, rank 96 user, rar (a)user-merchant, rank 2 b)user-merchant, rank 7 equency trequency rrnpartinn cf easley r」 ~等等 MG unor buy dny nun atd M user buy day nam avg UE total buy_ actior_num UC Loor buy action num rafo (c)merchant, rank 1 (d)merchant, rank 82 (c) user-brand, rank 151 (d)user-category, rank 582 thl 0 BG_ user buy day num_s时 EG repeat buy MB user buy day num std uC usar buy day num std (e) brand, Rank 245 (f)brand, rank 468 (e) merchant-brand, rank 3 (f)Imlerchant-category, rank 12 proportion of Fooivaa [w Figure 6: Top features in interaction profiles 005 interaction profiles, MB and MC profiles together have the highest AUC score, anld the AUC score drops the Inost ir uset Luy ite: I nnl avy co repeat Luy usaI_ ratiL Chey are excluded (g) category, rank 404 Ainong all the feature ly pes, MontHly aggregation hlas the (h) category, rank 392 highest AUC score--0.68729. When all che 276 aggregatiOn features (164 Monthly aggregaTiOn feat ures, 88 user aggre- gaLiOn fealures and 24 merchant aggregation features) are used to build XGBoost. the auc score is 0.6945. whiich is just 0.82% lower than the AUC score when all the fealures are used. LDA has the lowest leave-oul AUC score (i.e AUC score drops InIOsl if LDA features are excluded), which 0. 23% lower Chan the AUC of the lull re XtD ten_( Xe tem 0 Lure sel. The results above suggest that ally feature group (i)item, rank 74 () item, rank 178 can be rellloved withoul decreasing the AUC score Inuch The "leave-out AUC" column in Table 5 indicates some Figure 5: Top features in entity profiles feature groups may be redundant given all other features chant profile brand profile UB and uc pre files, monthly aggregation features, merchant aggregation the importance of a group of features by considering the full features. latest -one-month features. user-merchant similar feature set but excluding the group. In this way, if the AUC ity features and age-related features. The leave-out AUC score drops more, then the group is more important. The scores of these feature groups are even slightly higher than 'leave-out auC column in Table 5 and the leave-out AUC that of the full feature set. For example, if we exclude bars in Figure 4 give the results merchant profile from the full feature set, then the leave- Among the 5 entity profiles in Table 5, merchant pro- out AUC score is 0. 70096, which is slightly higher than file and brand profile have the highest AUC scores: 0.6601 0.70036. If we remove all the above seemingly redundant fea and 0.65818, respectively. However, if one of them is re ture groups, the number of remaining features is 691(50.7%0 moved. the leave-out AUC score is even a bit higher than of total features) and the AUC score becomes 0.69936 which 0. the AUC score of the full feature set. This indi- is lower than the auC score of the full feature set cates the redundancy of the two profiles. User profile has the lowest aUC score, but the auc score drops the most 5.2 Top features if it is removed. This implies that user profile has imp XGBoost calculates a gain score for each feature, which tant information that does not exist in other profiles. For measures how important a feature is to the model. We rank Table 6: Features with high profile ranking global rank ty weaver 少以 ulser gregation &IMG-user-buy-day-hlum-s ser_buy_day-num_avg chant brand Liull BG_user-Luy-day-llulll-sLu uru deviation ur nunlber uf days tliat users uichased the Draud ender related er are considered repea nder proportion of repeat buy days rs of a particular gender 404 category brega C-user_buy-item-num-avg verage number of items in the category that were bough: by users epea CG_bREa 4 it onthly action count xIU-item-U times that the item was clicked in October times that the item was clicked in September Rer=冂T olint UR_tota.L_buly_action_llm ght the hrand 52 overall action rat UC-uscr-buy-actionnumratioratio of the timea that thc user purchased the category to the total actions merchant- user aggregation MB_user_._std standard deviation of the number of days that user ht the brand from brand the merchant, only users of a particular gender are considered 2 user aggrega: ion MClRer-huy-day-nuim-std ers hought. the category categor Table 7: Features with high global ranking global rank profile catur namc t trend t_day_c times the use ant-brand l user aggregatic ay-num-avg G_user-buy_._avg ne Item trom the merchant, only 10 user aggregation MC_1ser_huly_day_nnm_avg ays that users houg. t he ca.tegory 13 isel monthly aggregation U-monthly-click-merchant-num_std standard deviation of the number of merchants clicked by the user every chantuser-share simscore_sum by first calculating the product of the times that the user bought a brand d the brand's user share within the merchant, and then taking sum over n brand: rsit number of unique items purchased by the user in the merchart ratio nonart category average number of days that users bought the category. liser merchant aggrega-[ U-merchant-click-ay-num-std standard deviation of the number of day product diversity atic rat total numbcr of merchant thot thc uacr toolc aomc acting features based on their average gain scores over the five folds in the merchant are the top-2 features in the profile, with Fach feature has two rankings: 1)profile rank i ng is t he ran the latter being t he top feature globally. In profiles of user ing of a feature when only features in the corresponding pro- merchant, merchant -hrand and merchant-category top fea file are used to build XGBoost. models, and 2) global ranking tures locally in the profiles a lso have high global rankings is the ranking of a feature when all of the 1364 features are Table 7 lists the top-20 features hased on global rankin used to build XGBoost models. Table 6 shows the top one Those already reported in Table 6 are not repeated in Ta or two features in each profile based on profile ranking. The ble 7. Top features are mainly from user aggregation ( global rankings of these features are in column"global rank features), repeat buyer (3), and product diversity (3) which All the features we generated are numeric. To visualize account for almost 2 /3 of the top-20 features. Figure 7 shows their correlations with the class labels. we discretize their the statistics of the top features, including discretized bins values using the method in 9. To avoid generating too the frequency, and proportion of positive instances in the many small bins, we set the minimum number of instances bins. Features U-monthly-clickmerchant-num-std (rank 13) in a bin to 5000. Figure 5 reports the relative frequency of and U-bugmerchant-ratio (rank 20) are not shown in the each bin(blue bar with the frequency given by the left y- figure, because they have only one bin after discretization axis) and the proportion of positive instances therein(green Feature user_seller_store-visit_dagcount-MDP is set to-999 bar with proportion value given by the right y-axis) for top if a user did not visit the merchant from May to October features in entity profiles. Figure 6 reports the top features This feature is discretized into two bins: (-inf, -999 and (- in interaction profiles. The x-axis in Figures 5 and 6 gives 999. +inf ) and the second bin has a higher proportion of the value ranges of the discretized bins. The proportion of positive classes. It indicates that if a user visited a merchant positive instances increases with the feature values in most before November, then the ore likely to buy from cases. None of the features is a strong indicator of class the merchant again after Double 11 labels. The maximal information gain of all features is only Some user aggregation features and repeat buyer features 0. 00868 after discretization capture like characteristics of entities or interactions from Among features in user profile, the average number of different aspects and they are highly correlated. For exam items clicked in or purchased from merchants are the top ple, feature MB_repeat-bugldagLratio(Figure 7(a))and fea- 2 features. For merchant profile, the average and standard ture MB_user- bug_dag_num_aug(Figure 7(c) are merchant deviation of the number of days that users made a purchase brand features, and they show similar patterns. The former MB repEat buy dey ato user seler stole visit day ount NDP m 1Dusar action type 2 rata (a)repeat buyer, rank 4 ren ranK a (a)action ratio, rank 25 (b) double 11, rank 40 nrmporinn nf rosiny lost tweak UN ciak tom (c) user aggregation, rank 6(d) product diversity, rank 8 (c) latest one week, rank 38(d)latest one month, rank 50 proponon of Cosines hrdi. tI MG user buy day nur avg user bu, day num avg (e user aggregatiOn, rank 9(f) user aggregaTion, rank 10 penetration cat monthl colFax (e) penetration, rank 65 (f) PCA, rank 68 tien sf equitincy〓 frequency 06 day rEtIo UM_buy.action_num _brand merchant_user_shan-smsore_sum (g) repeat buyer, rank 11 (h) UM similarity score, rank 14 LDA user 7 (g) LDA user, rank 42 (h) LDA merchant, rank 122 Figure 8: Top features in other feature type groups and vice versa. 'The same correlation is observed for fea- buy itom n MC repeat buy dsy ratio ture MC_user_bug_dag-numaig(Figure 7(f) and feature (i) product diversity, rank 15 ( repeat buyer, rank 16 M C_repeat-bug-_day-ratio(Figure 7()). Both features are merchant-category features Several feature types, including monthly action ratio, dou- oc propotion o rosnes roporion of Positiv〓 ble ll, latest-one-week, latest-one-month, penetration, PCA and LDA, do not occur in the global top-20 feature list (la h:』L ble 7). 'Table 8 reports the top features in these feature groups based on global ranking. Figure 8 shows the stati tics of these features C user tuy_day_num awg u merchan click day num std (k) user aggregation, rank 17 (1)merchant aggregation,rank ig 5.3 Performance on testing data Section 5. 1 and 5.2 evaluate the feature iMportance oll training data by five-fold cross validation. In this subsection Figure 7: Features with high global ranking we evaluate the importance of features on the testing data We set the parameters of XGBoost as follows: eta=0.01 Ilrounlds=2000, Imax deptl=7, IlilLchild-weight= 200, and the proportion of buyers, who bought the brand from the subsample=0.8. When all the features are used, the testing merchant on at least two different days, among the users AUC score is 0.702508 as shown in the second row of Table who bought the brand from the merchant at least once. The 9. The last column of the table is the percentage of AUC latter is the average number of days that users bought the brand from the merchant. A larger MB_repeat_bug_dag-ratio In the competition, we used smaller learning rate and more value often implies a larger MB-user-bug_day-numaig value rounds to achieve the slightly better AUC score of 0.70282 Table 8: Features with the highest global ranking among features in the remaining feature groups global rank profile 上 eature name ade by the user in October to the total number of U_buy-llll_mey int ratio of the nuI r of merchants that the user made a purchase from on Double ll to the total number of merchants that the user took some actions on double actio the number of clicks made by the user in the last week before Double 1l ser- merchant lat∈ he number of unique items clicked by the user in the m ategory 己trat penetration-cat-nonthll-calMax number of users who purchased some item of the category in Noven nerchant-similairty-PCAlO 4211ser LDA user I LDA user 7 the value of the 7-th topic when users are regarded as documents h ante ar Table 9: AUC on testing data 2GeneralizedlinearmodelsAvailableonhttp roles feature auc %o of drop //scikit-learn. org/stable/modules/linear- model. html I all feature types all profiles 0.694927 3 H. Abdi and L J. Williams. Principal component Profil 1.08% counte/ratio erall merchant analysis. Wiley Interdisciplinary Reviews: day counts. Produc, profile diversity, monthly|merchant Computational Statistics, 2(4): 433 459, 2010 tion,user ag- profile 4 D M. Blei, A.Y. Ng, and M. I Jordan. Latent double 1 dirichlet allocation. Journal of Machine Learning latest onc month Rescarch,3(4-5):9931022.2003 LDA features 5 L. Breiman. Random forests. Mach. Learn 45(1):532,2001 6 T Chen and T. He. Xgboost: extreme gradien plus brand pro- boosting. Available on gory profil https://github.com/dmlc/xgboost 0.T01g1 erchant simi [7 M. Dash and H. Lill. Feature selection for arity features classification. Intelligent data ana l. 1(1): 131-156 monthly enetrating eatures nd PCA 8 P Domingos. A few useful things to know about machine learning. Communications of the ACM. 55(10):7887,2012 score drop, when only subsets of features(as specified in the [9 U M. Fayyad and K. B. Irani. Multi-interval second and third columns of the table) are used discretization of continuous-valued attributes for Feature set 2 contains 354 features from nine feature types classification learning. In Proc. of the Internationa and three profiles. Its aUC score is only 1. 08% lower than Joint Conference on Uncertainty in Al, pages the AUC score when all the 1364 features (i.e, feature set 1) 1022-1027,1993 are used. When more feature types and/ or profiles are added [10 J. H. Friedman. Greedy function approximation: A (top-down in Table 9), the AUC score increases marginally radient boosting machine. Annals of statistics The results again imply that we can use a smaller number 2911891232,2000. of features to train predictive models without decreasing the 1Y.C. Juan, W.S. Chin, and Y. Zhuang. Field-aware AUC score significantly factorization machines. Available on https://github.com/guestwalk/libffm 6. CONCLUSION [12】S. Le and上.H. Julie josse.上 actomineh:anB In this paper, we presented our winning solution for the package for multivariate analysis. Journal of statistical epeat buyer prediction competition hosted at IjCai 2015 software,25(1):1-18,2008 conference. We generated a large number of features to cap [13 L C. Molina, L. Belanche, and Angela Nebot. Feature ture the preferences and behaviors of users, characteristics of selection algorithms: A survey and experimental merchants, brands, categories and items and the interactions evaluation In ICDM, pages 306-313, 2002 among them. Our study shows that none of the features [14 S Rendle. Factorization machines with libfm. ACM generated is a strong indicator of class labels, so we need Transactions on Intelligent systems and Technology, hundreds of features to achieve a relatively high AUC score 3(3),2012. We hope our winning solution, along with concrete analysis [15 K.Q. Shen, C.J. Ong, X-P. Li, and E. Wilder-Smith on feature engineering would serve as a solid stepping stone Feature selection via sensitivity analysis of svm for practitioners to solving future e-commerce problems. It probabilistic outputs. Machi me Learning, 70(1): 1-20, is a tedious task to generate and manage a large number of 2008 features. As our next step, we will explore how to automate [16] J. B. Yang and C.-J. Ong. An effective feature the feature generation and selection process for e-commerce selection method via mutual information estimation prediction tasks TEEE Transactions on Systems, Man and Cybernetics PatB),42(6):150-1559,2012 7. REFERENCES 1 Fitting generalized linear models. Available on https://stat.ethz.ch/r-manuaL/r-devel/library/stats/ html/ glm. html

...展开详情
试读 10P 数据挖掘交叉特征案例
立即下载 低至0.43元/次 身份认证VIP会员低至7折
一个资源只可评论一次,评论内容不能少于5个字
人工智能小白菜 这个pdf是一篇英文论文......
2018-05-08
回复
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
关注 私信
上传资源赚钱or赚积分
最新推荐
数据挖掘交叉特征案例 50积分/C币 立即下载
1/10
数据挖掘交叉特征案例第1页
数据挖掘交叉特征案例第2页

试读结束, 可继续读1页

50积分/C币 立即下载 >