摘要: 随着网络技术、传感器技术的飞速发展,产生了许多能够制造、传播数据的智能终端。它们每天都在以惊人的速度产生巨额的数据,海量数据带来了数据存储、数据处理的困难和挑战。传统的数据挖掘技术依然无法面对这种新的挑战。39472
大数据时代的核心问题就是通过新的数据挖掘技术筛选数据海洋里的珍珠。借助云计算技术的己经成为主流的研究方向。Hadoop是Apache的一款开源软件,它提供了包括HDFS和MapReduce在内的云计算软件平台的基础架构,可以在其上部署数据库、数据仓库等一系列组件,已经成为最热门的数据挖掘平台。
本文重点研究了Hadoop软件框架中的HDFS, MapReduce, HBase等组件的核心架构及其运行机制,并搭建了具有高鲁棒性的Hadoop集群。本文通过研究传统分类聚类算法的特点和局限性,提出了一种基于云计算的数据挖掘系统的设计方案,并对算法进行优化及验证分析。实验表明,本文改进的数据挖掘系统能够应对大数据时代的数据存储、数据传输、数据挖掘的挑战,而且具有可扩展、可定义、高鲁棒性等优点。
毕业论文关键词: Hadoop ;朴素贝叶斯 ;K均值 ;分类 ;聚类
Hadoop-based data digging algorithm study and realization
Abstract: With the rapid development of network technology and sensor technology, may smart terminals that can produce and spread data are produced. They produce mass data with the amazing speed everyday. However, mass data brings difficulty and challenge of data stock and data processing. The traditional data digging technology could not face such new challenge now.
The core problem in the age of big data is to select the valuable one among mass data with the help of new data digging technology. Cloud computing has been the major research direction. Hadoop is an open-source software of Apache. It provides the basic structure of cloud computing software platforms such as HDFS and MapReduce, where can arrange a series of components such as database and data warehouse. Hadoop has been the hottest data digging platform now.
The paper focuses on studying the core structure and operating mechanism of components such as HDFS, MapReduce, and HBase in the framework of Hadoop, and builds the high robustness of Hadoop assembly. The paper, with studying the features and restrictions of traditional sorting algorithms and clustering algorithms, puts forward a cloud computing-based design plan for data digging system, and optimizes and analyzes algorithms. The experiment shows that the improved data digging system mentioned in the paper can face the challenges of data stock, data transmission, and data digging in the age of big data, which is definable, extendible, and highly robust.
Keywords: Hadoop ; K-Means ; Naive Bayes ; Sorting ; Clustering;
目录
摘要 i
Abstract i
目录 iii
1 绪论 1
1.1 课题背景及意义 1
1.2 云计算研究现状 2
1.3 论文主要工作 4
2 Hadoop及数据挖掘概述 6
2.1 HADOOP云计算平台 6
2.1.1 Hadoop生态圈架构 6
2.1.2 HDFS和MapReduce 7
2.1.3 Hive数据挖掘 11
2.1.4 HBase分布式数据库 11
2.1.5 Sqoop并行数据迁移 13
2.1.6 本章小结 13
2.2 数据挖掘概述 14
2.2.1 数据挖掘概念 14
2.2.2 数据挖掘过程 15
2.3 分类算法概述 16
2.3.1 决策树 17
2.3.2 贝叶斯 17
2.3.3 人工神经网络 17 基于hadoop的数据挖掘算法研究与实现:http://www.751com.cn/jisuanji/lunwen_39778.html