没有合适的资源?快使用搜索试试~ 我知道了~
Representing Web Applications As Knowledge Graphs.pdf
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 175 浏览量
2024-10-24
15:46:03
上传
评论
收藏 779KB PDF 举报
温馨提示
Representing Web Applications As Knowledge Graphs.pdf
资源推荐
资源详情
资源评论
Representing Web Applications As
Knowledge Graphs
Yogesh Chandrasekharuni
Skil Inc, USA
yogesh@skil.ai
Abstract—Traditional methods for crawling and parsing web
applications predominantly rely on extracting hyperlinks from
initial pages and recursively following linked resources. This
approach constructs a graph where nodes represent unstructured
data from web pages, and edges signify transitions between them.
However, these techniques are limited in capturing the dynamic
and interactive behaviors inherent to modern web applications. In
contrast, the proposed method models each node as a structured
representation of the application’s current state, with edges
reflecting user-initiated actions or transitions. This structured
representation enables a more comprehensive and functional
understanding of web applications, offering valuable insights
for downstream tasks such as automated testing and behavior
analysis.
I. INTRODUCTION
Web applications require rich data representation for down-
stream tasks such as automation testing, user behavior analy-
sis, and functional verification. Traditional web parsers operate
through a structured yet simplistic algorithm:
1) Initialize a queue with the starting page.
2) Set a maximum depth (if applicable) and initialize the
current depth to zero.
3) While the queue is not empty and the maximum depth
is not exceeded:
a) Dequeue the next page from the queue.
b) If the page has not been visited:
i) Navigate to the page.
ii) Extract the desired data and store it as a node.
iii) Extract all hyperlinks from the page.
iv) Add all unseen and unvisited hyperlinks to the
queue.
v) Mark the current page as visited.
c) Increment the depth if moving to a new level.
4) Stop when all pages are visited or the maximum depth
is reached.
While this approach effectively scrapes static web applica-
tions, it falls short in handling dynamic applications, where
significant portions of the application are unreachable through
simple hyperlink navigation. Modern web applications often
follow structured user flows, which involve interaction beyond
hyperlinks. For instance, in an e-commerce site, reaching the
checkout page might require several actions: searching for a
product, adding it to the cart, entering a delivery location,
and only then accessing the checkout. Traditional parsers,
which rely solely on clicking hyperlinks, cannot capture such
dynamic flows and are limited in their ability to represent the
application’s state accurately.
Additionally, many web applications exhibit variability at
the same endpoint depending on the user’s context. For ex-
ample, a checkout page may display ”Ready to purchase” for
one user and ”Item cannot be delivered to your location” for
another, based on the delivery address provided.
In this work, the proposed solution overcomes these limita-
tions by representing each unique state of a web application
as a node, with edges defined by specific actions taken within
the application. This method captures the full complexity of
user flows, allowing for a more accurate and interpretable
knowledge representation of web applications.
II. BACKGROUND
Early web crawlers, such as World Wide Web Wanderer
(1993) [1], were primarily designed to map the size of the
web by collecting basic HTML from static websites [2]. As
the web expanded, tools like JumpStation emerged, becoming
the first search engine to use crawlers for indexing web content
[3]. These early systems, however, were limited to handling
static web content, as dynamic web pages driven by JavaScript
and AJAX had not yet become widespread.
The emergence of dynamic content significantly compli-
cated the process of web scraping for traditional parsers.
Frameworks such as Beautiful Soup (2004) were introduced
to facilitate the extraction of structured data from increasingly
complex web pages. Although effective for parsing static
HTML content, these tools were inherently limited in their
capacity to handle dynamic, JavaScript-driven web elements
or to interact with user-initiated events. As modern web appli-
cations began to rely heavily on dynamic content loading and
client-side interactions, more advanced methodologies became
necessary to accurately capture these behaviors. Several tools
have been developed to address these challenges. Selenium [4]
is widely used for automating browser interactions, allowing
developers to simulate user actions such as clicking, typing,
and submitting forms.
To address these limitations, visual web scraping tools like
Octoparse [5] emerged, offering user-friendly interfaces that
allowed non-programmers to automate the extraction of both
static and dynamic website data. These tools simulate user
behavior, such as clicks and form submissions, to capture data.
However, tools like Octoparse lack self-exploration capabili-
ties and are unable to reason through or autonomously navigate
arXiv:2410.17258v1 [cs.IR] 6 Oct 2024
资源评论
soso1968
- 粉丝: 3063
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功