面向新闻同步的网站内容自动提取与分类研究

菜单

摘要本文重点在于面向新闻同步的网站内容自动提取与分类研究，涉及到的技术包括数据库技术，中文分词技术，网页信息自动抓取技术，以及信息分类技术等。在阐述相关技术和概念的基础上，完成一个模拟新闻网页内容自动提取与分类的软件。在信息时代，网页信息量爆炸式增长，传统的信息抓取和分类技术效率低下。通过本文的研究可以看出，网站内容自动提取与分类技术效率高、精确度高，正好适用于网络信息量庞大的今天。64312

关键字：自动提取自动分类中文分词

毕业论文外文摘要

Title A Research On Website Content Of News Synchronous Automatic Extraction And Classification

Abstract This article focuses on news-oriented Web site content is automatically synchronized extraction and classification study, involving technology, including database technology, Chinese word segmentation, web pages automatically grab information technology, and information classification technology. In explaining the relevant technologies and concepts based on the content of news pages to complete an analog automatic extraction and classification of the software. In the information age, the amount of information pages explosive growth, the traditional information capture and classification techniques inefficient. Through this study we can see, the site content is automatically extraction and classification techniques, high efficiency, high accuracy, just a huge amount of information applicable to network today.

Keywords Automatic retrieval Automatic classification Chinese word segmentation

1概述 1

1.1研究背景 1

1.2研究意义 2

2相关技术 2

2.1网页内容自动抓取 3

2.2中文分词技术 5

2.3自动分类技术 6

3需求分析 10

3.1基本功能 10

3.2流程图 12

4系统设计 13

4.1数据库设计 13

4.2抓取算法 14

4.3分类算法 15

5系统实现 16

5.1系统用户界 16

5.2数据库界面 17

5.3系统运行 17

结论 21

致谢 22

参考文献 23

1概述

1.1研究背景

信息的提取研究来自于实际的需求。信息提取最开始是自然语言处理领域中的一个，研究从自然语言文档中提取语法和语义相关的结构化信息，如从句子中获取动词，名词等主要的语法成分，获取命名实体及实体间的关系等。随着www的普及教育和飞速发展，就全球来说网站数量早已经过亿，而网页数量更是多达数千亿。网络信息形式多变，质量相差很多，而网页新闻则是其中主要的一类使用率很高的资讯。自产生以来，查看网页新闻就成为人们日常上网常做的事情。网页新闻因为其内容多样性、发布及时性等以往媒体不能相比的优势，而成为人们获取最新新闻动态的方式。没有一家网站可以包含所有的新闻，并满足不同用户的各种需求。幸亏互联网的开放性使它的网站建设成本比传统媒体低得多，因此产生了千千万万的新闻网站。这些网站不停的发布最新的信息从而产生了数亿的新闻网页。