密级:
硕 士 学 位 论 文
论文题目 面向 Web 应用的智能化服务封装系
统设计与实现
作者姓名 王乃博
指导教师 尹建伟教授
学科(专业) 计算机科学与技术
所在学院 计算机科学与技术学院
提交日期 2020 年 3 月 30 日
A Dissertation Submitted to Zhejiang
University for the Degree of
Master of Engineering
TITLE: Design and Implementation of
Intelligent Service Wrapper System
for Web Application
Author: Naibo Wang
Supervisor: Prof. Jianwei Yin
Subject: Computer Science and Technology
College: Computer Science and Technology
Submitted Date: 30 March, 2020
浙江大学硕士学位论文 摘要
i
摘要
互联网与物联网技术的发展改变了链接方式,服务化成为各行业的发展趋势,
也成为企业业务开放的重要模式,IBM 更是提出了 API 经济的战略,这一战略模
式要求把硬件、软件、数据、人力等资源封装成 Web 服务,以 API 的方式供第三
方使用,从而支撑生态体系的孵化。Web 应用是一种重要的软件与数据资源,随
着互联网的快速发展和普遍应用,Web 应用数据和资源的数量出现了爆发式的增
长。面向 Web 应用资源,研究智能化服务封装方法,实现以服务请求的方式执行
Web 数据的采集和处理任务,将会简化用户的工作量、促进产业生态的孵化。
针对 Web 应用封装 Web 服务时生成效率低、技术要求高的挑战,本文设计
了一个面向 Web 应用的智能化服务封装系统(OKAPI),从而实现可视化、去编
程化的 Web 应用数据采集,有效的服务于数据分析、金融财政、IT 等行业用户。
本文结合 Web 数据提取、Web 页面分块等相关研究工作,基于面向服务的架
构和 Chrome 扩展开发技术,对系统中服务提供者、服务请求者和服务注册中心
三类用户的 Web 应用数据采集需求进行了分析,并提出了系统的总体设计架构。
针对 Web 页面结构清晰,用户提交表单后可得到 Web 数据的简单数据采集
场景,本文研究了 Web 页面表单检测、Web 页面分块和块排序等算法,以实现对
Web 应用数据采集服务的生成和调用。同时,针对用户自行配置操作规则以进行
多样化 Web 数据采集的复杂数据采集场景,本文研究了同类型元素匹配、服务流
程定义和执行等算法,从而满足对 Web 页面进行自定义的数据采集的需求。
对简单数据采集和复杂数据采集分系统中生成的定制化的 Web 应用数据采
集服务的测试结果表明,由智能化服务封装系统生成的 Web 应用数据采集服务的
生成成功率和可达性均在 99%以上,服务响应时间均低于 500ms,相比于 XWRAP
等 Web 数据提取系统,该系统生成 Web 数据采集任务的效率提升了 10 倍以上;
同时,Web 页面分块算法的执行准确率在 96%以上,数据采集速率超过 600 条/分
钟,系统执行大规模数据采集任务的稳定性超过 95%,验证了系统中各服务运行
的鲁棒性及服务执行的准确性、高效性。
最后,系统在高分服务网格平台国家重大工程上封装的 128 个高分相关服务
的总调用次数已超过 1 万次,总检索次数超过 1000 次,体现了系统对国家级重
大专项的支撑作用,验证了系统的实用性。
关键词:WEB 应用,服务封装,Web 数据提取,Web 页面分块,流程管理
浙江大学硕士学位论文 Abstract
ii
Abstract
The development of the Internet and the Internet of Things technology has changed
the way of connection, service-oriented business has become the development trend of
various industries, and it has also become an important model for enterprise business
opening. IBM has also proposed a strategy for the API economy. Data, human resources
and other resources are wrapped into web services for use by third parties with APIs to
support the incubation of the ecosystem. Web applications are important software and
data resources. With the rapid development and widespread application of the Internet,
the amount of data and resources of web application has exploded. Doing research for
web application resources on intelligent service wrapper methods and implementation
of Web data collection and processing tasks in the form of service requests will simplify
the workload of users and promote the incubation of industrial ecology.
In response to the challenges of low generation efficiency and high technical
requirements of web services wrapping process from web applications, this thesis
designs an intelligent service wrapper system (OKAPI) for web applications to achieve
visual and de-programmed web application data collection, which can serve various
industries users such as data analysis, finance, and IT.
This thesis combines the related research work of web data extraction, web page
segmentation to analyze the requirements of service provider, service requester, and
service registry as well as the functions implemented by the system, which are based on
the web service architecture and Chrome extension development technology. Then, this
thesis proposes the overall design architecture of the system.
In view of the simple data collection scenario where users can obtain Web data after
submitting the form from web pages whose structures are clear, this thesis studies the
web page form detection, web page segmentation and block sorting algorithms to
achieve the generation and call of the web application data collection service.
Meanwhile, for the complex data collection scenarios where users configure their own
operation rules for diversified web data collection tasks, this thesis studies the same type
element matching, service process definition and execution algorithms to meet the needs
of custom data collection for web pages.
The test results of the customized web application data collection service generated
in the simple data collection and complex data collection subsystems show that the
generation success rate and reachability of the web application data collection service
浙江大学硕士学位论文 Abstract
iii
generated by the intelligent service wrapper system are both more than 99%, the service
response time is less than 500ms, compared with XWRAP and other Web data extraction
systems, the efficiency of the system to generate web data collection tasks has increased
by more than 10 times; at the same time, the execution accuracy of the web page
segmentation algorithm is 96% as well as the data collection rate exceeds 600 per minute.
Also, the stability of the system in performing large-scale data collection tasks exceeds
95%, which verifies the robustness of each service operation in the system and the
accuracy and efficiency of service execution.
Finally, the total invocation number of 128 high-resolution related services
wrapped by the system on the national major project of the high- resolution service grid
platform has exceeded 10,000 times, and the total retrieval number has exceeded 1,000,
which reflects the support role of the system for major national projects, and verifies the
practicability of the system.
Keywords : web application, service wrapper, web data extraction, web page
segmentation, flow management