基于hadoop的海量文本分类算法研究

菜单

摘要：进入大数据时代后，无论是互联网数据还是离线数据都成指数级增长，而这些海量数据主要以文本结构化或半结构化文件为主，因此，如何从海量数据中有效快速查找用户需要的的有效数据，提高用户的查找准确率成为一个巨大的挑战。查找文本数据首先需要对文本数据进行精确有效的分类，所以文本分类成为文本数据处理的主要难点。因此，本文的研究目的，在于基于现有的硬件基础下研究高效的海量文本分类算法。63216

本文基于Hadoop研究海量文本的存储和文本分类。首先，设计并实现分布式、高可靠、高可用性的数据存储模块，解决现在海量文本存储困难的问题。然后，提出基于MapReduce的分布式并行中文分词算法，改进MapReduce的InputFormat读取数据模式，解决Hadoop处理小文件效率低下的问题，相比默认的MapReduce中文分词能够提高52倍的分词效率，并能够解决现阶段海量文本分词困难的现状。最后，本文将基于MapReduce分布式计算框架研究海量的web文本分类算法，建立贝叶斯文本分类模型，进行实验验证，本文研究的文本分类算法对于未知文本分类的准确性和召回率高达97%。

毕业论文关键词： HDFS；Hadoop；MapReduce；文本分类；中文分词

Research on massive text classification algorithm

Based on Hadoop

Abstract: After entering the era of big data, whether it is Internet data or offline data are increasing exponentially, and these data mainly text structured or semi-structured documents, therefore, how to effectively search user needs from the valid data in mass data, improve the user search accuracy becomes a great challenge. Finding text data requires accurate and efficient classification of text data, so text categorization becomes the main difficulty of text data processing. Therefore, the purpose of this paper is to study the efficient massive text classification algorithm based on the existing hardware.

This paper studies the storage of massive text and text classification based on Hadoop. First, we design and implement a distributed, high reliability and high availability data storage module, which can solve the problem of massive text storage. Then, the proposed MapReduce distributed parallel Chinese segmentation algorithm based on improved MapReduce InputFormat read data model, to solve the problem of low efficiency of Hadoop with small files, compared Chinese word MapReduce default can increase 52 times the word segmentation efficiency, and can solve the present situation of massive text segmentation difficult. Finally, the web text classification algorithm based on MapReduce distributed computing framework of mass, establish the classification model, experimental verification, text classification algorithm is proposed in this paper on the accuracy of unknown text classification up to 97%.

Keywords: HDFS; Hadoop; MapReduce; Text categorization; Chinese word segmentation

1 引言 1

1.1 研究背景 1

1.2 国内外研究现状 2

1.2.1大数据国内外研究现状 2

1.2.2文本分类研究现状 4

1.3 主要工作 4

1.4 论文组织结构 5

2 大数据技术HADOOP的研究 6

上一篇：java+mysql网上图书销售系统的设计与实现
下一篇：asp.net培训中心考试系统开发与建设

关闭

暂无收藏

About

751论文网手机版...

主页：http://www.751com.cn

关闭返回

基于MATLAB的图像增强算法设计

jsp+sqlserver高校二手商品交...

基于Kinect的手势跟踪与识别算法设计

JAVA基于安卓平台的医疗护工管理系统设计

java+mysql设备监控记录的大...

基于核独立元分析的非线...

基于Hadoop的制造过程大数据存储平台构建

中考体育项目与体育教学合理结合的研究

杂拟谷盗体内共生菌沃尔...

酸性水汽提装置总汽提塔设计+CAD图纸

大众媒体对公共政策制定的影响

java+mysql车辆管理系统的设计+源代码

乳业同业并购式全产业链...

河岸冲刷和泥沙淤积的监测国内外研究现状

电站锅炉暖风器设计任务书

十二层带中心支撑钢结构...

当代大学生慈善意识研究+文献综述

栏目

About