时间: 2021-05-10 | 次数: |
王磊, 刘雨, 刘志中,等.基于属性离散和特征度量的决策树构建算法[J].河南理工大学学报(自然科学版),2021,40(3):127-133.
WANG L, LIU Y, LIU Z Z,et al.Decision tree construction algorithm based on attribute dispersion and feature measurement[J].Journal of Henan Polytechnic University(Natural Science) ,2021,40(3):127-133.
基于属性离散和特征度量的决策树构建算法
王磊, 刘雨, 刘志中, 齐俊艳
河南理工大学 计算机科学与技术学院,河南 焦作454000
摘要:针对基于信息熵的决策树算法中存在多值属性偏向、连续属性处理不佳和时间复杂度较高等问题,提出一种基于离散比概念的决策树特征度量方法。首先采用K-means聚类算法对连续性数值属性进行离散化处理,其次利用属性在各个分类中的权重以及在整个条件属性中的权重比值,计算出该属性的离散比,避免了计算熵过程中复杂的对数运算,最后根据离散比的大小确定各个特征属性之间的拓扑结构,完成树的构建。结果表明,相较于K_C4.5和Id3_ improved两种改进的决策树算法,基于离散比属性分割的算法能更有效地解决多值属性偏向,降低算法的时间复杂度,并且在实际产生的连续性数据集的分类应用上有进一步的突破。
关键词:决策树;属性离散;离散比;K-means
doi:10.16186/j.cnki.1673-9787.2020040086
基金项目:国家自然科学基金资助项目(61872126);河南省重点科技攻关项目(192102210123 )
收稿日期:2019/04/23
修回日期:2019/05/29
出版日期:2021/05/15
Decision tree construction algorithm based on attribute dispersion and feature measurement
WANG Lei, LIU Yu, LIU Zhizhong, QI Junyan
College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454000 ,H&nan, China
Abstract:Aiming at the problems of multi-valued attribute bias, poor continuous attribute processing and high time complexity in the decision tree algorithm based on information entropy, a decision tree feature measurement method was proposed based on the concept of dispersion ratio. First, the K-means clustering algorithm was used to discretize the attributes of the continuous numerical value, and then the weight ratio of the attribute in each classification and the weight ratio in the entire condition attribute were used to calculate the dispersion ratio of the attribute, which avoided the complex logarithmic operation in the entropy calculation, and finally the topology structure between each characteristic attribute was determined according to the size of the dispersion ratio, and the construction of the tree was completed. The experimental results showed that, compared with the two improved decision tree algorithms, K_C4. 5 and Id3_improved, the dispersion-based algorithm conld solve multi-valued attribute bias more effectively than attribute segmentation algorithm, reduce the time complexity of the algorithm, and have further breakthroughs in the classification of the continuous data sets.
Key words:decision tree;attribute dispersion;dispersion ratio;K-means