网络爬虫互联网信息采集程序的开发

菜单

摘要：互联网信息采集程序通过提供一个基于定制模板的下载机制，准确定位所需信息在网页HTML结构树中的位置，把指定网页的HTML文件解析成XML结构树并通过灵活的网络爬虫技术下载指定位置的内容，提取准确而高效的信息。定制模板由用户自行设置，并且程序能够能定期自动跟踪相关网站或网页，进行比较分析，过滤，抽取下载和规整入库等，对互联网信息进行有针对性的定向采集，进而从互联网上获取所需信息，大大提高网络信息获取的效率。系统可根据用户需求对各不同种类的互联网信息进行分类采集，功能全面，适合非专业人士日常使用。65852

毕业论文关键词：定向采集；模板；解析；XML结构

Title The development of Internet information collection program

Abstract

Internet information collection program by providing a custom template-based download mechanism, accurate positioning of the required information in the page's HTML tree to specify the page's HTML file is parsed into an XML tree and the specified location through a flexible web crawler technology Downloadextract accurate and efficient information. Custom template is set by the user, and the program can be automatic tracking related sites or pages on a regular basis for a comparative analysis, filtering, decimation download and structured storage, information on the Internet for targeted targeted acquisition, which are obtained from the Internet required information, greatly improving the efficiency of network access to information.The system can do the categories acquisition according to user needs of different types of Internet information. And the system is full-featured and suitable for daily use of non-professionals.

Keywords: directed acquisition; template; accurate; XML structure

1.绪论 1

1.1研究背景· 1

1.2国内外研究现状· 1

1.3论文研究内容· 2

2相关概念与技术· 4

2.1Spider技术简介· 4

2.2正则表达式简介· 8

2.3XML技术简介· 12

3基于互联网信息采集系统的需求分析和详细设计 15

3.1可行性分析 15

3.2需求分析 15

3.3详细设计 16

4基于互联网信息采集系统的具体实现分析 22

4.1设置模块的设计与实现 22

4.2采集模块的设计与实现 24

4.3数据库映射的实现 27

4.4系统测试 27

结论· 30

致谢· 31

参考文献· 32

1. 绪论

1.1研究背景

随着网络的高速发展，互联网成为了海量信息的载体。互联网如今已经普及到了各行各业，各个年龄段人群。网民们可以共享的资源信息越来越多，同时也越来越多样化。互联网已经到人们的生活、学习、工作的每一个部分，为人们提供越来越多的各式各样的资源，为人们来了巨大的便捷。互联网的服务模式逐渐趋于多样化，传播方式也趋于多样化。门户网站、论坛、博客等传统和新兴的服务模式并存。网络电视，网络广播等多媒体信息发布形式同时也在迅速增多。如何有效的提取并利用这些巨大的信息成为了一个新的挑战。

信息多、有用的少、分布杂乱无章、不断发展变化是互联网上信息资源的特点,信息来源的异构性是网络信息难以采集整理再利用的焦点。近年来关于Web 信息的利用研究很多,大多集中在搜索引擎技术,旨在利用先进系统和人工智能技术,以一定策略在互联网中搜集、发现、理解、组织信息后为用户提供网页、图片、软件等检索服务。