数据挖掘导论（英文版·原书第2版）美陈封能（Pang-NingTan）2019版-（中）资源-CSDN文库

数据挖掘

需积分: 18 182 浏览量 2023-03-05 17:17:10 上传评论收藏 12.19MB PDF 举报

资源推荐

资源详情

资源评论

Randomforestshavebeenempiricallyfoundtoprovidesignificant

improvementsingeneralizationperformancethatareoftencomparable,ifnot

superior,totheimprovementsprovidedbytheAdaBoostalgorithm.Random

forestsarealsomorerobusttooverfittingandrunmuchfasterthanthe

AdaBoostalgorithm.

4.10.7EmpiricalComparisonamong

EnsembleMethods

Table4.5 showstheempiricalresultsobtainedwhencomparingthe

performanceofadecisiontreeclassifieragainstbagging,boosting,and

randomforest.Thebaseclassifiersusedineachensemblemethodconsistof

50decisiontrees.Theclassificationaccuraciesreportedinthistableare

obtainedfromtenfoldcross-validation.Noticethattheensembleclassifiers

generallyoutperformasingledecisiontreeclassifieronmanyofthedatasets.

Table4.5.Comparingtheaccuracyofadecisiontreeclassifieragainst

threeensemblemethods.

DataSet Numberof(Attributes,

Classes,Instances)

Decision

Tree(%)

Bagging(%) Boosting(%) RF(%)

Anneal (39,6,898) 92.09 94.43 95.43 95.43

Australia (15,2,690) 85.51 87.10 85.22 85.80

Auto (26,7,205) 81.95 85.37 85.37 84.39

Breast (11,2,699) 95.14 96.42 97.28 96.14

Cleve (14,2,303) 76.24 81.52 82.18 82.18

4.11ClassImbalanceProblem

Inmanydatasetsthereareadisproportionatenumberofinstancesthat

belongtodifferentclasses,apropertyknownasskeworclass

imbalance.Forexample,considerahealth-careapplicationwherediagnostic

reportsareusedtodecidewhetherapersonhasararedisease.Becauseof

theinfrequentnatureofthedisease,wecanexpecttoobserveasmaller

numberofsubjectswhoarepositivelydiagnosed.Similarly,increditcard

frauddetection,fraudulenttransactionsaregreatlyoutnumberedbylegitimate

transactions.

Thedegreeofimbalancebetweentheclassesvariesacrossdifferent

applicationsandevenacrossdifferentdatasetsfromthesameapplication.

Forexample,theriskforararediseasemayvaryacrossdifferentpopulations

ofsubjectsdependingontheirdietaryandlifestylechoices.However,despite

theirinfrequentoccurrences,acorrectclassificationoftherareclassoftenhas

greatervaluethanacorrectclassificationofthemajorityclass.Forexample,it

maybemoredangeroustoignoreapatientsufferingfromadiseasethanto

misdiagnoseahealthyperson.

Moregenerally,classimbalanceposestwochallengesforclassification.First,

itcanbedifficulttofindsufficientlymanylabeledsamplesofarareclass.Note

thatmanyoftheclassificationmethodsdiscussedsofarworkwellonlywhen

thetrainingsethasabalancedrepresentationofbothclasses.Althoughsome

classifiersaremoreeffectiveathandlingimbalanceinthetrainingdatathan

others,e.g.,rule-basedclassifiersandk-NN,theyareallimpactedifthe

minorityclassisnotwell-representedinthetrainingset.Ingeneral,aclassifier

trainedoveranimbalanceddatasetshowsabiastowardimprovingits

performanceoverthemajorityclass,whichisoftennotthedesiredbehavior.

Asaresult,manyexistingclassificationmodels,whentrainedonan

imbalanceddataset,maynoteffectivelydetectinstancesoftherareclass.

Second,accuracy,whichisthetraditionalmeasureforevaluating

classificationperformance,isnotwell-suitedforevaluatingmodelsinthe

presenceofclassimbalanceinthetestdata.Forexample,if1%ofthecredit

cardtransactionsarefraudulent,thenatrivialmodelthatpredictsevery

transactionaslegitimatewillhaveanaccuracyof99%eventhoughitfailsto

detectanyofthefraudulentactivities.Thus,thereisaneedtousealternative

evaluationmetricsthataresensitivetotheskewandcancapturedifferent

criteriaofperformancethanaccuracy.

Inthissection,wefirstpresentsomeofthegenericmethodsforbuilding

classifierswhenthereisclassimbalanceinthetrainingset.Wethendiscuss

methodsforevaluatingclassificationperformanceandadaptingclassification

decisionsinthepresenceofaskewedtestset.Intheremainderofthis

section,wewillconsiderbinaryclassificationproblemsforsimplicity,where

theminorityclassisreferredasthepositive classwhilethemajorityclass

isreferredasthenegative class.

4.11.1BuildingClassifierswithClass

Imbalance

Therearetwoprimaryconsiderationsforbuildingclassifiersinthepresenceof

classimbalanceinthetrainingset.First,weneedtoensurethatthelearning

algorithmistrainedoveradatasetthathasadequaterepresentationofboth

themajorityaswellastheminorityclasses.Somecommonapproachesfor

ensuringthisincludesthemethodologiesofoversamplingandundersampling

(+)

(−)

thetrainingset.Second,havinglearnedaclassificationmodel,weneedaway

toadaptitsclassificationdecisions(andthuscreateanappropriatelytuned

classifier)tobestmatchtherequirementsoftheimbalancedtestset.Thisis

typicallydonebyconvertingtheoutputsoftheclassificationmodeltoreal-

valuedscores,andthenselectingasuitablethresholdontheclassification

scoretomatchtheneedsofatestset.Boththeseconsiderationsare

discussedindetailinthefollowing.

OversamplingandUndersampling

Thefirststepinlearningwithimbalanceddataistotransformthetrainingset

toabalancedtrainingset,wherebothclasseshavenearlyequal

representation.Thebalancedtrainingsetcanthenbeusedwithanyofthe

existingclassificationtechniques(withoutmakinganymodificationsinthe

learningalgorithm)tolearnamodelthatgivesequalemphasistoboth

classes.Inthefollowing,wepresentsomeofthecommontechniquesfor

transforminganimbalancedtrainingsettoabalancedone.

Abasicapproachforcreatingbalancedtrainingsetsistogenerateasample

oftraininginstanceswheretherareclasshasadequaterepresentation.There

aretwotypesofsamplingmethodsthatcanbeusedtoenhancethe

representationoftheminorityclass:(a)undersampling,wherethefrequency

ofthemajorityclassisreducedtomatchthefrequencyoftheminorityclass,

and(b)oversampling,whereartificialexamplesoftheminorityclassare

createdtomakethemequalinproportiontothenumberofnegativeinstances.

Toillustrateundersampling,consideratrainingsetthatcontains100positive

examplesand1000negativeexamples.Toovercometheskewamongthe

classes,wecanselectarandomsampleof100examplesfromthenegative

classandusethemwiththe100positiveexamplestocreateabalanced

trainingset.Aclassifierbuiltovertheresultantbalancedsetwillthenbe

剩余499页未读，继续阅读

评论收藏

内容反馈

woodballhead

粉丝: 22
资源: 12

数据挖掘导论（英文版·原书第2版）美陈封能（Pang-Ning Tan）2019版-（中）

最新资源

数据挖掘导论（英文版·原书第2版）美陈封能（Pang-Ning Tan）2019版-（中）

数据挖掘导论（英文版·原书第2版）美陈封能（Pang-Ning Tan）2019版-（下）

数据挖掘导论（英文版·原书第2版）美陈封能（Pang-Ning Tan）2019版（上）

《数据挖掘导论》英文版课件

数据挖掘课程（英文）ppT

数据挖掘导论第二版答案，Pang-Ning Tan.pdf

数据挖掘导论 习题答案 (Pang-Ning Tan）

数据挖掘导论（完整版）PPT.

数据挖掘导论（笔记）

数据挖掘挖掘挖掘导论课程ppt

数据挖掘导论(带标签完整版)

数据挖掘导论 完整版

数据挖掘导论-完整版

数据挖掘导论(完整版)习题答案英文原版

《数据挖掘导论》ppt 课件

ChatGPT教程（终极版）最全整理

Neural Networks from Scratch in Python学习资料

博客中Kmeans以及FCM算法数据（免积分）

hugging face的models-openai-clip-vit-large-patch14文件夹

神经网络回归预测--气温数据集

XGBoost+LightGBM+LSTM-光伏发电量预测

Mathwork+Matlab+编程手册

Stable-Diffusion WEBUI 简体中文语言包（2023.05.30更新）

中文短信数据集-带标签

时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

亚博K210模型训练部署

Plecs电力电子仿真PLECS41.64 电力系统仿真软件免安装版本

机器学习期末复习题及答案

基于鲸鱼优化算法优化VMD参数试看效果代码(目标函数为样本熵)

最新资源

数据挖掘导论习题答案 (Pang-Ning Tan）

数据挖掘导论完整版