没有合适的资源?快使用搜索试试~ 我知道了~
数据挖掘 技术问题和 研究
需积分: 0 1 下载量 65 浏览量
2010-06-27
06:25:02
上传
评论 1
收藏 88KB PDF 举报
温馨提示
试读
17页
discuss mining with respect to web data referred here as web data mining. In particular, our focus is on web data mining research in context of our web warehousing project called WHOWEDA (Warehouse of Web Data). We have categorized web data mining into threes areas; web content mining, web structure mining and web usage mining. We have highlighted and discussed various research issues involved in each of these web data mining category. We believe that web data mining will be the topic of exploratory research in near future.
资源详情
资源评论
资源推荐
1
Research Issues in Web Data Mining
SANJAY MADRIA, SOURAV S BHOWMICK, W. -K NG, E. P. LIM
Center for Advanced Information Systems, School of Applied Science
Nanyang Technological University, Singapore 639798
{askumar, p517026, awkng, aseplim}@ntu.edu.sg
Abstract. In this paper, we present an overview of research issues in web mining. We discuss
mining with respect to web data referred here as web data mining. In particular, our focus is on
web data mining research in context of our web warehousing project called WHOWEDA
(Warehouse of Web Data). We have categorized web data mining into threes areas; web content
mining, web structure mining and web usage mining. We have highlighted and discussed various
research issues involved in each of these web data mining category. We believe that web data
mining will be the topic of exploratory research in near future.
1 Introduction
The advent of the World Wide Web has caused a dramatic increase in the usage of the Internet. The World
Wide Web is a broadcast medium where a wide range of information can be obtained at a low cost.
Information on the WWW is important not only to individual users, but also to the business organizations
especially when the critical decision-making is concerned. Most users obtain WWW information using a
combination of search engines and browser, however, these two types of retrieval mechanisms do not
necessarily address all of a user’s information needs. This is particularly true in the case of business
organizations that currently lack suitable tools to systematically harness strategic information from the
web and analyze these data to discover useful knowledge to support decision making. A recent study
provides a comprehensive and comparative evaluation of the most popular search engines [1]. A more
recent survey of web query processing has appeared in [23].
The resulting growth in on-line information combined with the almost unstructured web data
necessitates the development of powerful yet computationally efficient web data mining tools. Web data
mining can be defined as the discovery and analysis of useful information from the WWW data. Web
involves three types of data; data on the WWW, the web log data regarding the users who browsed the
web pages and the web structure data. Thus, the WWW data mining should focus on three issues; web
2
structure mining, web content mining [8] and web usage mining [2,10,13]. Web structure mining involves
mining the web document’s structures and links. In [24], some insight is given on mining structural
information on the web. Our initial study [5] has shown that web structure mining is very useful in
generating information such visible web documents, luminous web documents and luminous paths; a path
common to most of the results returned. In this paper, we have discussed some applications in web data
mining and E-commerce where we can use these types of knowledge. Web content mining describes the
automatic search of information resources available on-line. Web usage mining includes the data from
server access logs, user registration or profiles, user sessions or transactions etc. A survey of some of the
emerging tools and techniques for web usage mining is given in [2]. In our discussion here, we focus on
the research issues in web data mining with respect to the web warehousing project called WHOWEDA
(Warehouse of Web Data).
The key objective of WHOWEDA at the Centre for Advanced Information Systems in Nanyang
Technological University, Singapore is to design and implement a web warehouse that materializes and
manages useful information from the web to support strategic decision making. We are building a web
warehouse [7] using the database approach of managing a web warehouse containing strategic information
coupled from the web that may also inter-operate with conventional data warehouses. One of the
important areas of our work involves the development of techniques for mining useful information from
the web. We would be integrating WHOWEDA with intelligent tools for information retrieval and extend
the data mining techniques to provide a higher level of data organization for unstructured data available on
the web.
With respect to our web data mining approach, we argue that extracting information from a very
small subset of all HTML web pages is also an instance of web data mining. In WHOWEDA, we focus on
mining a subset of web pages stored in one or more web tables because we believe that due to the
complexity and vastness of the web, mining information from a subset of web stored in the web tables is
more feasible option. Our web warehousing approach allows us to do this effectively as we materialize
only the results returned in response to a user's query graph.
2 WHOWEDA
In WHOWEDA, we introduced our web data model. It consists of a hierarchy of web objects. The
fundamental objects are Nodes and Links, where nodes correspond to HTML text documents and links
3
correspond to hyper-links interconnecting the documents in the WWW. These objects consist of a set of
attributes as follows: Nodes = [url, title, format, size, date, text] and link = [source-url, target-url, label,
link-type]. In our web warehouse, Web Information Coupling System (WICS) [9] is a database system for
managing and manipulating coupled information extracted from the Web. We have defined a set of
coupling operators to manipulate the web tables and correlate additional useful and related information
[9].
We materialize web data as web tuples stored in web tables. Web tuples, representing directed
connecting graphs, are comprised of web objects (Nodes and Links). We associate with each web table a
web schema that binds a set of web tuples in a web table. A web schema contains the meta-data that binds
a set of web tuples to a web table in the form of connectivities and predicates defined on node and link
variables. Connectivities represent structural properties of web tuples by describing possible paths
between node variables. Predicates on the other hand specify the additional conditions that must be
satisfied by each tuple to be included in the web table. In WICS, a user expresses a web query in the form
of a query graph consisting of some nodes and links representing web documents and hyperlinks in those
documents, respectively. Each of these nodes and links can have some keywords imposed on them to
represent those web documents that contain the given keywords in the documents and/or hyperlinks.
When the query graph is posted over the WWW, a set of web tuples each satisfying the query graph are
harnessed from the WWW. Thus, the web schema of a table resembles the query graph used to derive the
web tuples stored in web table. Note that the results are returned as web tuples. Note that some nodes and
links in the query graph may not have keywords imposed. They are called unbound nodes and links,
respectively.
Consider a query to find all data mining related publications by the computer science faculty at
Stanford University, starting with the web page
http://www.cs.stanford.edu/people/faculty.html.
The query above may be expressed as follows :
AI or database e
publications
http://www.cs.stanford.edu/people/faculty.html data mining
x
y
z
4
The above query graph is assigned as schema to the web table generated in response the above
query. The schema corresponding to the above query graph can be formally expressed as <X
n
, X
l
, C, P>
where X
n
is the set node variables; x,y,z in the example above, X
l
is the set of link variables; - (unbound
link) and e in the example, C is set of connectivities ; k
1
Λ k
2
where k
1
= x<->y, k
2
= y<e>z and P is a set
of predicates as follows : p
1
Λ p
2
Λ p
3
Λ p
4
such that p
1
(x) = [x.url EQUALS
http://www.cs.standford.edu/people/faculty.html], p
2
(e) = [e.label CONTAINS "publications"],
p
3
(y) = [y.text contains "AI or database"], p
4
(z) = [z.text CONTAINS "data mining"].
The query returns all web tuples satisfying the web schema given above. These web tuples
contain the faculty page, the faculty member’s page that should contain the word such "AI or database"
and the respective publications page if it contains the word "data mining". Thus, many instances of the
query graph shown above will be returned as web tuples. We show one of the instance of the above query
graph below.
Widom active database
Research
publications
http://www.cs.stanford.edu/people/faculty.html web data mining
3 Web Structure Mining
Web information retrieval tools make use of only the text on pages, ignoring valuable information
contained in links. Web structure mining aims to generate structural summary about web sites and web
pages. The focus of structure mining is therefore on link information, which is an important aspect of web
data. Given a collection of interconnected web documents, interesting and informative facts describing
their connectivity in the web subset can be discovered. We are interested in generating the following
structural information from the web tuples stored in the web tables.
• Measuring the frequency of the local links in the web tuples in a web table. Local links connect the
different web documents residing in the same server. This informs about the web tuples (connected
documents) in the web table that have more information about inter-related documents existing at the
same server. This also measures the completeness of the web sites in a sense that most of the closely
剩余16页未读,继续阅读
mardan0126
- 粉丝: 5
- 资源: 15
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0