## 文献检索列表中的 href 解析为可访问的 URL
设计一个自动化的爬虫,需要获取文献列表中的href,不过有的href值不能直接作为URL进行访问,需要进行解析,本文介绍将获取到的href值解析成可访问的URL
### 文献详情URL解析
**期刊文献**,例如,[TransPath:一种基于深度迁移强化学习的知识推理方法](https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CAPJ&dbname=CAPJLAST&filename=XXWX2021031700R)
这个链接 “https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CAPJ&dbname=CAPJLAST&filename=XXWX2021031700R ” 可进入文章详情
这个是通过selenium获取到的href值为“https://kns.cnki.net/KNS8/Detailsfield=fn&QueryID=0&CurRec=1&recid=&FileName=XXWX2021031700R&DbName=CAPJLAST&DbCode=CAPJ&yx=Y&pr=&URLID=21.1106.TP.20210319.1034.020 ”,但是点击进去就不能获取到文章详情
通过分析两个URL可以发现,只需通过正则表达式提取出`FileName=XXWX2021031700R&DbName=CAPJLAST&DbCode=CAPJ`这一部分然后拼接成“https://kns.cnki.net/kcms/detail/detail.aspx?FileName=XXWX2021031700R&DbName=CAPJLAST&DbCode=CAPJ ”,同样可以进入文章详情
正则表达式:`FileName=(.*?)&DbName=(.*?)&DbCode=(.*?)&`
**外文期刊**:例如,[An ontology-based deep learning approach for triple classification with out-of-knowledge-base entities](https://kns.cnki.net/KNS8/Detail/RedirectScholar?flag=TitleLink&tablename=SJESLAST&filename=SJES2F9E9C8E8C8C9961EF1F032D1ACD3037)
可以直接通过slenium获取到的URL进入文章详情页
**论文文献**:例如,[基于大数据的智能辅助诊疗全流程管理系统的研究与实现](https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CMFD&dbname=CMFDTEMP&filename=1020431527.nh)
URL获取过程和**期刊文献**类似
### 作者详情URL解析
例如,[崔员宁](https://kns.cnki.net/kcms/detail/knetsearch.aspx?dbcode=CAPJ&sfield=au&skey=%e5%b4%94%e5%91%98%e5%ae%81&code=43931005)
可进入详情的链接,“https://kns.cnki.net/kcms/detail/knetsearch.aspx?dbcode=CAPJ&sfield=au&skey=%e5%b4%94%e5%91%98%e5%ae%81&code=43931005 ”
通过selenium获取的URL为“https://kns.cnki.net/KNS8/Detail?sdb=CAPJ&sfield=%e4%bd%9c%e8%80%85&skey=%e5%b4%94%e5%91%98%e5%ae%81&scode=43931005&acode=43931005 ”
这个链接解析起来有点复杂
固定部分:`https://kns.cnki.net/kcms/detail/knetsearch.aspx?`
dbcode:`dbcode=`+ 获取 href 的 sdb 值
skey:`&sfield=au&skey=`+ 获取 href 的 skey 值
code:`&code=`+ 获取 href 的 acode 值
```python
def href_to_url(href):
baseURL = 'https://kns.cnki.net/kcms/detail/knetsearch.aspx?'
m1 = re.search(r'sdb=(.*?)&', href)
m2 = re.search(r'skey=(.*?)&', href)
m3 = re.search(r'acode=.*', href)
dbcode = m1.group(0).replace('sdb=', '')
skey = m2.group(0).replace('skey=', '')
code = m3.group(0).replace('acode=', '')
return '{}dbcode={}sfield=au&skey={}code={}'.format(baseURL, dbcode, skey, code)
```
### 文献来源URL
通过selenium获取到的可直接点击进入详情,无需修改
## 重构
之前是把爬取的数据暂存到了CSV文件中,发现整个项目设计的不太合理,这次把数据先存到mysql中,重构后的E-R图需要用到5个实体表,4个关系表。
- 实体表:文章,作者,学校,论文所在学校,期刊机构
- 关系表:文章-作者,师生关系,文章-来源,作者-学校
E-R图如下所示:
![](https://gitee.com/eternidad33/picbed/raw/master/img/24ad65wd2a23s1d.png)
**2021-3-25 知网爬虫完成,不过IP被限制了**
![](https://gitee.com/eternidad33/picbed/raw/master/img/QQ截图20210325165828.jpg)
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
# 基于知识图谱的学术信息检索系统 本系统以知识图谱技术为基础,实现一个学术信息检索系统,主要实现学术信息定期爬取、学术信息更新、学术关联检索、知识化可视化界面等功能,分为服务器端和客户端两种用户。 服务器端可以在网站后台进行管理,用户通过Web界面在客户端自由检索信息。 **具体功能** 1. 服务器端:管理员可以对爬取信息、图数据库等进行添加、查看、修改或删除; 2. 客户端模块:学术信息检索;师生关系查询;领域知识检索;科研项目查询;学术论坛;学术信息管理。 ```powershell 文件夹介绍 website 代码 resource 资源文件 document 相关文档 ``` -------- 该资源内项目源码是个人的毕设,代码都测试ok,都是运行成功后才上传资源,答辩评审平均分达到96分,放心下载使用! <项目介绍> 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用! 2、本项目适合计算机相关专业(如计科、人工智能、通信工程、自动化、电子信息等)的在校学生、老师或者企业员工下载学习,也适合小白学习进阶,当然也可作为毕设项目、课程设
资源推荐
资源详情
资源评论
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
收起资源包目录
![package](https://csdnimg.cn/release/downloadcmsfe/public/img/package.f3fc750b.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
共 37 条
- 1
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/8aa41e1a9c6c4428a1e22f23f0c8c0da_m0_73728511.jpg!1)
机智的程序员zero
- 粉丝: 2305
- 资源: 4549
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback-tip](https://img-home.csdnimg.cn/images/20220527035111.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)