保持数据局部结构的卷积自编码深度聚类-河南理工大学出版中心

>> 自然科学版 >> 网络首发 >> 正文

保持数据局部结构的卷积自编码深度聚类

时间: 2025-06-16

次数:

李顺勇,邢煜曼,胥瑞.保持数据局部结构的卷积自编码深度聚类[J].河南理工大学学报(自然科学版)，doi:10.16186/j.cnki.1673-9787.2025010018.

LI S Y, XING Y M, XU R．Convolutional autoencoder deep clustering for preserving the local structure of data[J]. Journal of Henan Polytechnic University(Natural Science), doi:10.16186/j.cnki.1673-9787.2025010018.

保持数据局部结构的卷积自编码深度聚类(网络首发)

李顺勇^1,2，邢煜曼¹，胥瑞¹

(1.山西大学数学与统计学院,山西太原030006; 2.山西大学复杂系统与数据科学教育部重点实验室,山西太原030006)

摘要: 目的深度聚类算法在提升聚类性能方面逐渐超越传统聚类方法。研究旨在提出一种新的深度聚类算法，用于更高效地处理高维数据并探索其底层流形结构，以提升聚类的质量。方法本文提出了一种保持数据局部结构的卷积自编码深度聚类算法（convolutional autoencoder deep clustering algorithm for preserving the local structure of data, CADC）。该方法通过卷积自编码器学习原始高维数据的低维嵌入表示，保持数据的局部结构，同时进一步对低维嵌入进行底层流形学习。在此基础上，采用高斯混合模型（gaussian mixture model, GMM）对底层流形数据进行聚类。与其他深度聚类算法不同，CADC无需对聚类网络进行额外训练，从而简化了算法的复杂性。本文还对均匀流形逼近与投影算法（uniform manifold approximation and projection, UMAP）中的关键参数（n_neighbors和min_dist）进行了分析，探讨了它们对聚类性能的影响，并通过实验确定了这两个参数的最优取值。结果实验在MNIST、Fashion-MNIST、USPS和Pendigits 4个真实数据集上进行，结果表明，CADC算法在聚类性能上显著优于传统聚类算法和部分现有的深度聚类算法。通过参数分析发现，UMAP降维中的n_neighbors和min_dist参数的设置对聚类性能有重要影响，n_neighbors设置为20，min_dist设置为0时，能够得到较为理想的聚类效果。结论本文提出的CADC算法利用卷积自编码器和局部流形学习，在无需额外训练聚类网络的情况下，能够有效地提高聚类性能。该算法为深度聚类领域提供了一种新的方法选择，在处理复杂高维数据的聚类任务中具有潜在的广泛应用价值。

关键词: 深度聚类；卷积自编码；UMAP流形；GMM聚类；特征提取；降维

doi:10.16186/j.cnki.1673-9787.2025010018

基金项目: 国家自然科学基金资助项目（82274360）；山西省基础研究计划资助项目（202303021221054）；山西省回国留学人员科研资助项目(2024-002)

收稿日期：2025-01-12

修回日期：2025-03-25

网络首发日期：2025-06-16

Convolutional autoencoder deep clustering for preserving the local structure of data

LI Shunyong^1,2, XING Yuman¹, XU Rui¹

(1. School of Mathematics and Statistics, Shanxi University, Taiyuan 030006, Shanxi, China; 2. Key Laboratory of Complex Systems and Data Science of Ministry of Education, Shanxi University, Taiyuan 030006, Shanxi, China)

Abstract: Objectives Deep clustering algorithms are increasingly surpassing traditional clustering methods in improving clustering performance. This study aims to propose a novel deep clustering algorithm to efficiently handle high-dimensional data and explore its underlying manifold structure, thereby enhancing clustering quality. Methods A convolutional autoencoder deep clustering algorithm for preserving the local structure of data (CADC) was proposed in this paper. A convolutional autoencoder was used to learn low-dimensional embedding representations of the original high-dimensional data, with the local structure of the data being preserved. Manifold learning was further performed on the low-dimensional embeddings. Based on this, the Gaussian Mixture Model (GMM) was employed to cluster the data on the underlying manifold. Unlike other deep clustering algorithms, additional training of the clustering network was not required for CADC, simplifying the algorithm's complexity. Additionally, key parameters (n_neighbors and min_dist) in the UMAP dimensionality reduction method were analyzed, their impact on clustering performance was investigated, and their optimal values were determined through experiments. Results Experiments were conducted on four real-world datasets, MNIST, Fashion-MNIST, USPS and Pendigits. It was demonstrated that the CADC algorithm significantly outperformed traditional clustering algorithms and some existing deep clustering algorithms. It was revealed by parameter analysis that the settings of the n_neighbors and min_dist parameters in UMAP had a significant impact on clustering performance. Specifically, the most favorable clustering results were yielded when n_neighbors was set to 20 and min_dist was set to 0. Conclusions The CADC algorithm, which utilized convolutional autoencoders and local manifold learning, could effectively improve clustering performance without the need for additional training of the clustering network. This algorithm provided a new methodological option for deep clustering and held great potential for applications in clustering complex high-dimensional data.

Key words: deep clustering; convolutional auto-encoder; UMAP manifold; GMM clustering; feature extraction; dimensionality reduction6

附件【2025010018-李顺勇-保持数据局部结构的卷积自编码深度聚类（最终稿）.docx】已下载次