MI.rar_choosebestfeature_featureselection资源-CSDN文库

共1个文件

m：1个

版权申诉

72 浏览量 2022-09-14 22:58:53 上传评论收藏 2KB RAR 举报

特征选择是机器学习和数据分析中的一个关键步骤，它涉及到从原始数据集中挑选出最相关、最有影响力的特征子集，以提高模型的性能和解释性。"MI.rar_choose best feature_feature selection"这个标题暗示了我们讨论的主题是基于信息增益（Mutual Information, MI）的特征选择方法。信息增益是一种衡量特征与目标变量之间关联程度的指标，广泛用于决策树算法，特别是ID3、C4.5和CART等。在特征选择过程中，信息增益能帮助我们确定哪些特征能最大程度地减少数据的不确定性，从而更好地预测目标变量。描述中的"this is the best choose for feature selection"表明，基于MI的特征选择可能被推崇为一种高效的方法。这可能是因为MI考虑了特征和目标变量之间的非线性关系，不局限于二元分类问题，并且能够处理离散和连续特征。标签"choose_best_feature"和"feature_selection"进一步强调了我们的重点是找到最佳特征子集。特征选择的策略有多种，包括过滤式（filter）、包裹式（wrapper）和嵌入式（embedded）。过滤式方法快速但可能忽略特征间交互；包裹式方法全面搜索最优子集，但计算成本高；嵌入式方法则在模型训练过程中完成特征选择，平衡了效率和效果。 MI方法属于过滤式特征选择，计算每个特征与目标变量的信息增益，然后根据这些值进行排序。选择阈值或前k个具有最高信息增益的特征，可以形成最终的特征子集。这种方法的优点是计算相对简单，适合大数据集，但缺点可能是无法捕捉特征间的相互作用。压缩包中的"MI.m"文件很可能是一个MATLAB程序，用于计算信息增益并执行特征选择。MATLAB是一种强大的编程环境，特别适合数值计算和数据处理。该代码可能包含了计算MI、排序特征和选择最佳特征的函数，用户可以通过调用这些函数来处理他们的数据集。在实际应用中，特征选择的影响因素还包括过拟合预防、计算资源限制、以及对模型可解释性的需求。选择最佳特征子集是一个权衡过程，需要根据具体问题和数据特性进行调整。例如，如果数据量不大，可能会选择包裹式方法以寻找全局最优；而面对大规模数据时，过滤式方法如MI特征选择可能更为实用。基于信息增益的特征选择是一种有效的数据预处理技术，它可以帮助我们从原始数据中筛选出最有价值的部分，提升模型的预测能力和理解度。在实际操作中，我们需要根据项目需求和资源限制，选择合适的特征选择策略，并结合适当的评估指标来验证选择的效果。

资源推荐

资源详情

资源评论

收起资源包目录

MI.rar （1个子文件）

MI.m 4KB

function [features,weights] = MI(features,labels,Q) % function [features,weights] = MI(features,labels,Q) % Estimates the mutual information between features and associated class labels using a quantized feature space. % % Inputs: % features: N x F sized matrix of features, where N is the number of samples and F is the number of features % labels: N x 1 sized vector of class labels corresponding to each sample % Q: the number of quantization levels used for the features (default = 12) % % Outputs: % features: F x 1 sized vector of feature indices in the % descending order of relevance. % weights: F x 1 sized vector of feature relevances (MIs) in the % descending order. % % Author: Okko Rasanen, 2013. Mail: okko.rasanen@aalto.fi % % The algorithm can be freely used for research purposes. % % Please see J. Pohjalainen, O. Rasanen & S. Kadioglu: "Feature Selection Methods and % Their Combinations in High-Dimensional Classification of Speaker Likability, % Intelligibility and Personality Traits", Computer Speech and Language, 2015, for more details. if nargin <3 Q = 12; end edges = zeros(size(features,2),Q+1); % Compute feature-specific quantization bins so that each bin has approximately equal number of % samples in the training set for k = 1:size(features,2) minval = min(features(:,k)); maxval = max(features(:,k)); if minval==maxval continue; end quantlevels = minval:(maxval-minval)/500:maxval; N = histc(features(:,k),quantlevels); totsamples = size(features,1); N_cum = cumsum(N); edges(k,1) = -Inf; stepsize = totsamples/Q; for j = 1:Q-1 a = find(N_cum > j.*stepsize,1); edges(k,j+1) = quantlevels(a); end edges(k,j+2) = Inf; end % Quantize data according to the obtained bins S = zeros(size(features)); for k = 1:size(S,2) S(:,k) = quantize(features(:,k),edges(k,:))+1; end % Compute mutual information (MI) between the quantized features and % the class labels I = zeros(size(features,2),1); for k = 1:size(features,2) I(k) = computeMI(S(:,k),labels,0); end % Sort features into descending order [weights,features] = sort(I,'descend'); %% EOF function [I,M,SP] = computeMI(seq1,seq2,lag) % function [I,M,SP] = computeMI(seq1,seq2,lag) % Computes the mutual information (MI) between seq1 and seq2 at the % given delay (lag) between the sequences. % % Inputs: % % seq1: a discrete sequence of length N % seq2: a discrete sequence of length N % lag: the number of elements that seq1 is delayed with respect to % seq2 (a positive or negative integer). Default = 0; if nargin <3 lag = 0; end if(length(seq1) ~= length(seq2)) error('Input sequences are of different length'); end % Count the frequency and probability of each symbol in seq1 lambda1 = max(seq1); symbol_count1 = zeros(lambda1,1); for k = 1:lambda1 symbol_count1(k) = sum(seq1 == k); end symbol_prob1 = symbol_count1./sum(symbol_count1)+0.000001; % Count the frequency and probability of each symbol in seq2 lambda2 = max(seq2); symbol_count2 = zeros(lambda2,1); for k = 1:lambda2 symbol_count2(k) = sum(seq2 == k); end symbol_prob2 = symbol_count2./sum(symbol_count2)+0.000001; % Compute the joint occurrence frequencies of symbol pairs at the given lag M = zeros(lambda1,lambda2); if(lag > 0) for k = 1:length(seq1)-lag loc1 = seq1(k); loc2 = seq2(k+lag); M(loc1,loc2) = M(loc1,loc2)+1; end else for k = abs(lag)+1:length(seq1) loc1 = seq1(k); loc2 = seq2(k+lag); M(loc1,loc2) = M(loc1,loc2)+1; end end % Product of individual state probabilities as a matrix SP = symbol_prob1*symbol_prob2'; % Pair joint probability M = M./sum(M(:))+0.000001; % Compute MI I = sum(sum(M.*log2(M./SP))); function y = quantize(x, q) x = x(:); nx = length(x); nq = length(q); y = sum(repmat(x,1,nq)>repmat(q,nx,1),2);

评论收藏

内容反馈

版权申诉