【免费】数据挖掘技术问题和研究资源-CSDN文库

是关于数据挖掘技术方面的资料

需积分: 0 65 浏览量 2010-06-27 06:25:02 上传评论 1 收藏 88KB PDF 举报

资源详情

资源评论

资源推荐

Research Issues in Web Data Mining

SANJAY MADRIA, SOURAV S BHOWMICK, W. -K NG, E. P. LIM

Center for Advanced Information Systems, School of Applied Science

Nanyang Technological University, Singapore 639798

{askumar, p517026, awkng, aseplim}@ntu.edu.sg

Abstract. In this paper, we present an overview of research issues in web mining. We discuss

mining with respect to web data referred here as web data mining. In particular, our focus is on

web data mining research in context of our web warehousing project called WHOWEDA

(Warehouse of Web Data). We have categorized web data mining into threes areas; web content

mining, web structure mining and web usage mining. We have highlighted and discussed various

research issues involved in each of these web data mining category. We believe that web data

mining will be the topic of exploratory research in near future.

1 Introduction

The advent of the World Wide Web has caused a dramatic increase in the usage of the Internet. The World

Wide Web is a broadcast medium where a wide range of information can be obtained at a low cost.

Information on the WWW is important not only to individual users, but also to the business organizations

especially when the critical decision-making is concerned. Most users obtain WWW information using a

combination of search engines and browser, however, these two types of retrieval mechanisms do not

necessarily address all of a user’s information needs. This is particularly true in the case of business

organizations that currently lack suitable tools to systematically harness strategic information from the

web and analyze these data to discover useful knowledge to support decision making. A recent study

provides a comprehensive and comparative evaluation of the most popular search engines [1]. A more

recent survey of web query processing has appeared in [23].

The resulting growth in on-line information combined with the almost unstructured web data

necessitates the development of powerful yet computationally efficient web data mining tools. Web data

mining can be defined as the discovery and analysis of useful information from the WWW data. Web

involves three types of data; data on the WWW, the web log data regarding the users who browsed the

web pages and the web structure data. Thus, the WWW data mining should focus on three issues; web

structure mining, web content mining [8] and web usage mining [2,10,13]. Web structure mining involves

mining the web document’s structures and links. In [24], some insight is given on mining structural

information on the web. Our initial study [5] has shown that web structure mining is very useful in

generating information such visible web documents, luminous web documents and luminous paths; a path

common to most of the results returned. In this paper, we have discussed some applications in web data

mining and E-commerce where we can use these types of knowledge. Web content mining describes the

automatic search of information resources available on-line. Web usage mining includes the data from

server access logs, user registration or profiles, user sessions or transactions etc. A survey of some of the

emerging tools and techniques for web usage mining is given in [2]. In our discussion here, we focus on

the research issues in web data mining with respect to the web warehousing project called WHOWEDA

(Warehouse of Web Data).

The key objective of WHOWEDA at the Centre for Advanced Information Systems in Nanyang

Technological University, Singapore is to design and implement a web warehouse that materializes and

manages useful information from the web to support strategic decision making. We are building a web

warehouse [7] using the database approach of managing a web warehouse containing strategic information

coupled from the web that may also inter-operate with conventional data warehouses. One of the

important areas of our work involves the development of techniques for mining useful information from

the web. We would be integrating WHOWEDA with intelligent tools for information retrieval and extend

the data mining techniques to provide a higher level of data organization for unstructured data available on

the web.

With respect to our web data mining approach, we argue that extracting information from a very

small subset of all HTML web pages is also an instance of web data mining. In WHOWEDA, we focus on

mining a subset of web pages stored in one or more web tables because we believe that due to the

complexity and vastness of the web, mining information from a subset of web stored in the web tables is

more feasible option. Our web warehousing approach allows us to do this effectively as we materialize

only the results returned in response to a user's query graph.

2 WHOWEDA

In WHOWEDA, we introduced our web data model. It consists of a hierarchy of web objects. The

fundamental objects are Nodes and Links, where nodes correspond to HTML text documents and links

correspond to hyper-links interconnecting the documents in the WWW. These objects consist of a set of

attributes as follows: Nodes = [url, title, format, size, date, text] and link = [source-url, target-url, label,

link-type]. In our web warehouse, Web Information Coupling System (WICS) [9] is a database system for

managing and manipulating coupled information extracted from the Web. We have defined a set of

coupling operators to manipulate the web tables and correlate additional useful and related information

[9].

We materialize web data as web tuples stored in web tables. Web tuples, representing directed

connecting graphs, are comprised of web objects (Nodes and Links). We associate with each web table a

web schema that binds a set of web tuples in a web table. A web schema contains the meta-data that binds

a set of web tuples to a web table in the form of connectivities and predicates defined on node and link

variables. Connectivities represent structural properties of web tuples by describing possible paths

between node variables. Predicates on the other hand specify the additional conditions that must be

satisfied by each tuple to be included in the web table. In WICS, a user expresses a web query in the form

of a query graph consisting of some nodes and links representing web documents and hyperlinks in those

documents, respectively. Each of these nodes and links can have some keywords imposed on them to

represent those web documents that contain the given keywords in the documents and/or hyperlinks.

When the query graph is posted over the WWW, a set of web tuples each satisfying the query graph are

harnessed from the WWW. Thus, the web schema of a table resembles the query graph used to derive the

web tuples stored in web table. Note that the results are returned as web tuples. Note that some nodes and

links in the query graph may not have keywords imposed. They are called unbound nodes and links,

respectively.

Consider a query to find all data mining related publications by the computer science faculty at

Stanford University, starting with the web page

http://www.cs.stanford.edu/people/faculty.html.

The query above may be expressed as follows :

AI or database e

publications

http://www.cs.stanford.edu/people/faculty.html data mining

The above query graph is assigned as schema to the web table generated in response the above

query. The schema corresponding to the above query graph can be formally expressed as <X

, X

, C, P>

where X

is the set node variables; x,y,z in the example above, X

is the set of link variables; - (unbound

link) and e in the example, C is set of connectivities ; k

Λ k

where k

= x<->y, k

= y<e>z and P is a set

of predicates as follows : p

Λ p

such that p

(x) = [x.url EQUALS

http://www.cs.standford.edu/people/faculty.html], p

(e) = [e.label CONTAINS "publications"],

(y) = [y.text contains "AI or database"], p

(z) = [z.text CONTAINS "data mining"].

The query returns all web tuples satisfying the web schema given above. These web tuples

contain the faculty page, the faculty member’s page that should contain the word such "AI or database"

and the respective publications page if it contains the word "data mining". Thus, many instances of the

query graph shown above will be returned as web tuples. We show one of the instance of the above query

graph below.

Widom active database

Research

publications

http://www.cs.stanford.edu/people/faculty.html web data mining

3 Web Structure Mining

Web information retrieval tools make use of only the text on pages, ignoring valuable information

contained in links. Web structure mining aims to generate structural summary about web sites and web

pages. The focus of structure mining is therefore on link information, which is an important aspect of web

data. Given a collection of interconnected web documents, interesting and informative facts describing

their connectivity in the web subset can be discovered. We are interested in generating the following

structural information from the web tuples stored in the web tables.

• Measuring the frequency of the local links in the web tuples in a web table. Local links connect the

different web documents residing in the same server. This informs about the web tuples (connected

documents) in the web table that have more information about inter-related documents existing at the

same server. This also measures the completeness of the web sites in a sense that most of the closely

剩余16页未读，继续阅读

评论收藏

内容反馈

mardan0126

粉丝: 5
资源: 15

数据挖掘技术问题和研究

评论0

最新资源

数据挖掘 技术问题和 研究

评论0

空间数据挖掘技术 数据挖掘技术

面向在线智慧学习的教育数据挖掘技术研究.pdf

基于python的Web数据挖掘技术研究与实现

数据挖掘在各行业的应用论文

数据挖掘技术分析与研究.pdf

数据挖掘技术与分类算法研究

数据挖掘技术创新应用研究.pdf

搜索引擎及网络数据挖掘相关技术研究.pdf

数据挖掘技术在农业数据挖掘平台中的应用研究.pdf

Hadoop平台数据挖掘技术研究.pdf

论文研究-基于数据挖掘技术的在线学习行为研究综述.pdf

近年来我国数据挖掘研究综述

数据挖掘技术及其法律问题研究.pdf

高维数据挖掘技术研究

数据挖掘技术研究的现状及展望.pdf

大数据时代下数据挖掘技术的应用

关于大数据挖掘中的数据分类算法技术的研究.pdf

数据挖掘技术的研究与分析.pdf

基于数据挖掘技术研究方剂配伍规律述评

第一讲：python网络爬虫基础知识

Python基于机器学习实现的股票价格预测、股票预测源码+数据集，机器学习大作业

身份证前6位对应的省市区代码（超详细）

抖音用户浏览行为数据集

两阶段鲁棒优化/综合能源系统/需求响应/微电网/多目标优化/优化调度matlab-yalmip-cplex/gurobi文章复现

SPSS中介效应分析插件（Process和mediate插件）

基于在线教学平台的数据挖掘与学习行为分析超星集团数据集

regress函数实例代码

Fragstats V4.2 软件计算景观指数的参数文件示例

新闻数据集（对应新闻文本分类案例）

最新资源

数据挖掘技术问题和研究

空间数据挖掘技术数据挖掘技术