摘要:进入大数据时代后,无论是互联网数据还是离线数据都成指数级增长,而这些海量数据主要以文本结构化或半结构化文件为主,因此,如何从海量数据中有效快速查找用户需要的的有效数据,提高用户的查找准确率成为一个巨大的挑战。查找文本数据首先需要对文本数据进行精确有效的分类,所以文本分类成为文本数据处理的主要难点。因此,本文的研究目的,在于基于现有的硬件基础下研究高效的海量文本分类算法。63216
本文基于Hadoop研究海量文本的存储和文本分类。首先,设计并实现分布式、高可靠、高可用性的数据存储模块,解决现在海量文本存储困难的问题。然后,提出基于MapReduce的分布式并行中文分词算法,改进MapReduce的InputFormat读取数据模式,解决Hadoop处理小文件效率低下的问题,相比默认的MapReduce中文分词能够提高52倍的分词效率,并能够解决现阶段海量文本分词困难的现状。最后,本文将基于MapReduce分布式计算框架研究海量的web文本分类算法,建立贝叶斯文本分类模型,进行实验验证,本文研究的文本分类算法对于未知文本分类的准确性和召回率高达97%。
毕业论文关键词: HDFS;Hadoop;MapReduce;文本分类;中文分词
Research on massive text classification algorithm
Based on Hadoop
Abstract: After entering the era of big data, whether it is Internet data or offline data are increasing exponentially, and these data mainly text structured or semi-structured documents, therefore, how to effectively search user needs from the valid data in mass data, improve the user search accuracy becomes a great challenge. Finding text data requires accurate and efficient classification of text data, so text categorization becomes the main difficulty of text data processing. Therefore, the purpose of this paper is to study the efficient massive text classification algorithm based on the existing hardware.
This paper studies the storage of massive text and text classification based on Hadoop. First, we design and implement a distributed, high reliability and high availability data storage module, which can solve the problem of massive text storage. Then, the proposed MapReduce distributed parallel Chinese segmentation algorithm based on improved MapReduce InputFormat read data model, to solve the problem of low efficiency of Hadoop with small files, compared Chinese word MapReduce default can increase 52 times the word segmentation efficiency, and can solve the present situation of massive text segmentation difficult. Finally, the web text classification algorithm based on MapReduce distributed computing framework of mass, establish the classification model, experimental verification, text classification algorithm is proposed in this paper on the accuracy of unknown text classification up to 97%.
Keywords: HDFS; Hadoop; MapReduce; Text categorization; Chinese word segmentation
目录
1 引言 1
1.1 研究背景 1
1.2.1大数据国内外研究现状 2
1.2.2文本分类研究现状 4
1.3 主要工作 4
1.4 论文组织结构 5
2 大数据技术HADOOP的研究 6