1 Purpose of the Research
With the speedy development of computer and Internet, more and more information is
generated and exists in the Internet. How to get accurate and useful information?
Search engine is a good tool to obtain useful information. So it becomes the most
popular on-line service besides E-mails.
The working process of general search engine can be described as follows: Firstly the
network robot also named Spider skims through the Internet, collects the URL of web
pages and the information contained in the pages; the spider stores the information into
the index database; then the search utility sets up the Web page of a list of links to URLs
which the search engine can find in its index matched the site visitor’s search keywords.
But there is so much irrelevant information in the result pages. So people pay more
and more attention to vertical search in a certain area.
Business information is just a small part of the network information. If we want to
search business information, it will take much more time and energy to download all
the information the general spider program found and to judge whether it is business
information or not. So the study of implementing an efficient commerce-oriented spi-
der program is necessary and of real value. In this paper, a method to implement a
commerce-oriented search engine will be introduced.
2 Process of Realizing
The network robot always starts from a certain web page or several pages, and then
goes through all the pages it can find. So firstly the spider analyzes the HTML code of a
Filter Technology of Commerce-Oriented Network Information 503
web page, seeks the hyperlinks in the page, and then skims through all the linking pages
using recursion or non-recursion algorithm. Recursion is an algorithm that can shift the
program logic back to itself. It is simple, but it can not be applied to multi-thread
technique. Therefore it can’t be adopted in an efficient spider program. Using
non-recursion method, spider program puts the hyperlinks it found into the waiting
论文网http://www.751com.cn/
queue instead of transferring to it. When the spider program has finished scanning the
current web page, it will link the next URL in the queue according to the algorithm.
A hyperlink would be judged by the commerce-oriented spider if it is related to
commerce or not before it is added to the queue. The way to achieve it is as follows:
1. Collect some typical commerce-related documents and transform them to text
files as exercise texts originally; 本文来自辣~文'论,文·网原文请找腾讯3249.114
2. Use LSA theory to build up an entry-text matrix of exercise texts. Using LSA
model, a text set can be denoted as r*m entry matrix D. “M” means the quantity of texts
in the text set , while “r” represents the number of different entries in the text set. That
is, each different entry corresponds to a row of the matrix D; and each text file corre-
sponds to a column of the matrix D. Dd=[] , Here, d is the weight of entry I in
ij r×m ij
text j. As is known to all, there are many formulas to calculate weight in the traditional
vector representations. Following is a very familiar formula to calculate weight:
上一页 [1] [2] [3] [4]
网络信息过滤技术英文文献和翻译 第4页下载如图片无法显示或论文不完整,请联系qq752018766