不平衡数据集的度量指标

为什么要做特征选择,为了降维
熵(entropy)用来衡量系统的不确定性

Branch and bound 分支定界

PCA(Principal Component Analysis),PCA不考虑分类
LDA(Linear Discriminant Analysis),LDA的目标是在减小维数的同时保留尽可能多的类之间的区别


最大化类之间的距离,最小化每一类的散度

PCA计算实例

LDA计算实例


LDA的缺陷,对于C分类问题LDA最多能降到C-1维,当两组数据的均值接近时并不能很好的区分。
论文学习
- 数据清洗
M. A. Hernandez and S. J. Stolfo, “Real-World Data is Dirty: Data Cleansing and The Merge/Purge Problem,” Data Mining and Knowledge Discovery, vol. 2, pp. 9–37, 1998. - 数据缺失
A. Donders, G. van der Heijden, T. Stijnen, and K. Moons, “Review: A Gentle Introduction to Imputation of Missing Values,” Journal of Clinical Epidemiology, vol. 59, pp. 1087-1091, 2006. - 不平衡数据集
N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data Analysis, vol. 6, pp. 429–449, 2002.
N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. - 数据可视化
D. Keim, “Information Visualization and Visual Data Mining,” IEEE Transactions on Visualization and Computer Graphics, vol. 8, pp. 1-8, 2002.