# oag-cs数据集
## 原始数据
[Open Academic Graph 2.1](https://www.aminer.cn/oag-2-1)
使用其中的微软学术(MAG)数据,总大小169 GB
| 类型 | 文件 | 总量 |
| --- | --- | --- |
| author | mag_authors_{0-1}.zip | 243477150 |
| paper | mag_papers_{0-16}.zip | 240255240 |
| venue | mag_venues.zip | 53422 |
| affiliation | mag_affiliations.zip | 25776 |
## 字段分析
假设原始zip文件所在目录为data/oag/mag/
```shell
python -m gnnrec.kgrec.data.preprocess.analyze author data/oag/mag/
python -m gnnrec.kgrec.data.preprocess.analyze paper data/oag/mag/
python -m gnnrec.kgrec.data.preprocess.analyze venue data/oag/mag/
python -m gnnrec.kgrec.data.preprocess.analyze affiliation data/oag/mag/
```
```
数据类型: venue
总量: 53422
最大字段集合: {'JournalId', 'NormalizedName', 'id', 'ConferenceId', 'DisplayName'}
最小字段集合: {'NormalizedName', 'DisplayName', 'id'}
字段出现比例: {'id': 1.0, 'JournalId': 0.9162891692561117, 'DisplayName': 1.0, 'NormalizedName': 1.0, 'ConferenceId': 0.08371083074388828}
示例: {'id': 2898614270, 'JournalId': 2898614270, 'DisplayName': 'Revista de Psiquiatría y Salud Mental', 'NormalizedName': 'revista de psiquiatria y salud mental'}
```
```
数据类型: affiliation
总量: 25776
最大字段集合: {'id', 'NormalizedName', 'url', 'Latitude', 'Longitude', 'WikiPage', 'DisplayName'}
最小字段集合: {'id', 'NormalizedName', 'Latitude', 'Longitude', 'DisplayName'}
字段出现比例: {'id': 1.0, 'DisplayName': 1.0, 'NormalizedName': 1.0, 'WikiPage': 0.9887880198634389, 'Latitude': 1.0, 'Longitude': 1.0, 'url': 0.6649984481688392}
示例: {'id': 3032752892, 'DisplayName': 'Universidad Internacional de La Rioja', 'NormalizedName': 'universidad internacional de la rioja', 'WikiPage': 'https://en.wikipedia.org/wiki/International_University_of_La_Rioja', 'Latitude': '42.46270', 'Longitude': '2.45500', 'url': 'https://en.unir.net/'}
```
```
数据类型: author
总量: 243477150
最大字段集合: {'normalized_name', 'name', 'pubs', 'n_pubs', 'n_citation', 'last_known_aff_id', 'id'}
最小字段集合: {'normalized_name', 'name', 'n_pubs', 'pubs', 'id'}
字段出现比例: {'id': 1.0, 'name': 1.0, 'normalized_name': 1.0, 'last_known_aff_id': 0.17816547055853085, 'pubs': 1.0, 'n_pubs': 1.0, 'n_citation': 0.39566894470384595}
示例: {'id': 3040689058, 'name': 'Jeong Hoe Heo', 'normalized_name': 'jeong hoe heo', 'last_known_aff_id': '59412607', 'pubs': [{'i': 2770054759, 'r': 10}], 'n_pubs': 1, 'n_citation': 44}
```
```
数据类型: paper
总量: 240255240
最大字段集合: {'issue', 'authors', 'page_start', 'publisher', 'doc_type', 'title', 'id', 'doi', 'references', 'volume', 'fos', 'n_citation', 'venue', 'page_end', 'year', 'indexed_abstract', 'url'}
最小字段集合: {'id'}
字段出现比例: {'id': 1.0, 'title': 0.9999999958377599, 'authors': 0.9998381970774082, 'venue': 0.5978255167296247, 'year': 0.9999750931550963, 'page_start': 0.5085962370685443, 'page_end': 0.4468983111460961, 'publisher': 0.5283799512551735, 'issue': 0.41517357124031923, 'url': 0.9414517743712895, 'doi': 0.37333226530251745, 'indexed_abstract': 0.5832887141192009, 'fos': 0.8758779954185391, 'n_citation': 0.3795505812901313, 'doc_type': 0.6272126634990355, 'volume': 0.43235134434528877, 'references': 0.3283648464857624}
示例: {
'id': 2507145174,
'title': 'Structure-Activity Relationships and Kinetic Studies of Peptidic Antagonists of CBX Chromodomains.',
'authors': [{'name': 'Jacob I. Stuckey', 'id': 2277886111, 'org': 'Center for Integrative Chemical Biology and Drug Discovery, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill , Chapel Hill, North Carolina 27599, United States.\r', 'org_id': 114027177}, {'name': 'Catherine Simpson', 'id': 2098592917, 'org': 'Center for Integrative Chemical Biology and Drug Discovery, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill , Chapel Hill, North Carolina 27599, United States.\r', 'org_id': 114027177}, ...],
'venue': {'name': 'Journal of Medicinal Chemistry', 'id': 162030435},
'year': 2016, 'n_citation': 13, 'page_start': '8913', 'page_end': '8923', 'doc_type': 'Journal', 'publisher': 'American Chemical Society', 'volume': '59', 'issue': '19', 'doi': '10.1021/ACS.JMEDCHEM.6B00801',
'references': [1976962550, 1982791788, 1988515229, 2000127174, 2002698073, 2025496265, 2032915605, 2050256263, 2059999434, 2076333986, 2077957449, 2082815186, 2105928678, 2116982909, 2120121380, 2146641795, 2149566960, 2156518222, 2160723017, 2170079272, 2207535250, 2270756322, 2326025506, 2327795699, 2332365177, 2346619380, 2466657786],
'indexed_abstract': '{"IndexLength":108,"InvertedIndex":{"To":[0],"better":[1],"understand":[2],"the":[3,19,54,70,80,95],"contribution":[4],"of":[5,21,31,47,56,82,90,98],"methyl-lysine":[6],"(Kme)":[7],"binding":[8,33,96],"proteins":[9],"to":[10,79],"various":[11],"disease":[12],"states,":[13],"we":[14,68],"recently":[15],"developed":[16],"and":[17,36,43,63,73,84],"reported":[18],"discovery":[20,46],"1":[22,48,83],"(UNC3866),":[23],"a":[24],"chemical":[25],"probe":[26],"that":[27,77],"targets":[28],"two":[29],"families":[30],"Kme":[32],"proteins,":[34],"CBX":[35],"CDY":[37],"chromodomains,":[38],"with":[39,61,101],"selectivity":[40],"for":[41,87],"CBX4":[42],"-7.":[44],"The":[45],"was":[49],"enabled":[50],"in":[51],"part":[52],"by":[53,93,105],"use":[55],"molecular":[57],"dynamics":[58],"simulations":[59],"performed":[60],"CBX7":[62,102],"its":[64],"endogenous":[65],"substrate.":[66],"Herein,":[67],"describe":[69],"design,":[71],"synthesis,":[72],"structure–activity":[74],"relationship":[75],"studies":[76],"led":[78],"development":[81],"provide":[85],"support":[86],"our":[88,99],"model":[89],"CBX7–ligand":[91],"recognition":[92],"examining":[94],"kinetics":[97],"antagonists":[100],"as":[103],"determined":[104],"surface-plasmon":[106],"resonance.":[107]}}',
'fos': [{'name': 'chemistry', 'w': 0.36301}, {'name': 'chemical probe', 'w': 0.0}, {'name': 'receptor ligand kinetics', 'w': 0.46173}, {'name': 'dna binding protein', 'w': 0.42292}, {'name': 'biochemistry', 'w': 0.39304}],
'url': ['https://pubs.acs.org/doi/full/10.1021/acs.jmedchem.6b00801', 'https://www.ncbi.nlm.nih.gov/pubmed/27571219', 'http://pubsdc3.acs.org/doi/abs/10.1021/acs.jmedchem.6b00801']
}
```
## 第1步:抽取计算机领域的子集
```shell
python -m gnnrec.kgrec.data.preprocess.extract_cs data/oag/mag/
```
筛选近10年计算机领域的论文,从微软学术抓取了计算机科学下的34个二级领域作为领域字段过滤条件,过滤掉主要字段为空的论文
二级领域列表:[CS_FIELD_L2](config.py)
输出5个文件:
(1)学者:mag_authors.txt
`{"id": aid, "name": "author name", "org": oid}`
(2)论文:mag_papers.txt
```
{
"id": pid,
"title": "paper title",
"authors": [aid],
"venue": vid,
"year": year,
"abstract": "abstract",
"fos": ["field"],
"references": [pid],
"n_citation": n_citation
}
```
(3)期刊:mag_venues.txt
`{"id": vid, "name": "venue name"}`
(4)机构:mag_institutions.txt
`{"id": oid, "name": "org name"}`
(5)领域:mag_fields.txt
`{"id": fid, "name": "field name"}`
## 第2步:预训练论文和领域向量
通过论文标题和关键词的**对比学习**对预训练的SciBERT模型进行fine-tune,之后将隐藏层输出的128维向量作为paper和field顶点的输入特征
预训练的SciBERT模型来自Transformers [allenai/scibert_scivocab_uncased](https://huggingface.co/allenai/scibert_scivocab_uncased)
注:由于原始数据不包含关键词,因此使用研究领域(fos字段)作为关键词
1. fine-tune
```shell
python -m gnnrec.kgrec.data.preprocess.fine_tune train
```
```
Epoch 0 | Loss 0.3470 | Train Acc 0.9105 | Va
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
异构图表示学习和推荐算法结合了图神经网络和异构图的特点,用于处理包含多种类型节点和边的复杂图数据,并用于个性化推荐。以下是该算法的概念描述: 1. **异构图**:异构图是指图中包含多种类型的节点和边。例如,在社交网络中,人物节点可能与电影、音乐等不同类型的节点相连,形成了一个异构图结构。 2. **异构图表示学习**:异构图表示学习旨在学习每个节点的低维表示向量,以捕捉节点之间的关系和隐藏特征。通过将不同类型节点和边的信息整合到一个统一的表示空间中,使得可以在此空间中计算节点之间的相似度和关联程度。 3. **图神经网络**:图神经网络是一种用于处理图结构数据的深度学习模型。它可以学习节点的表示向量,并基于这些向量进行预测和推理。在异构图表示学习和推荐算法中,图神经网络可以用于学习不同类型节点的表示向量。 4. **异构图推荐算法**:基于异构图表示学习,推荐算法可以利用学习到的节点表示向量来预测用户与推荐项的关联程度。通过计算节点之间的相似度和用户对其他节点的影响力,为用户生成个性化的推荐结果。 5. **跨层聚合和注意机制**:在异构图表示学习和推荐算法中,......
资源推荐
资源详情
资源评论
收起资源包目录
毕业设计:基于图神经网络的异构图表示学习和推荐算法研究 (136个子文件)
CSDN关注我不迷路.bmp 2.79MB
node_classification.csv 669B
param_analysis.csv 521B
rank.csv 353B
param_analysis.csv 212B
ablation_study.csv 201B
.gitignore 152B
base.html 2KB
register.html 2KB
login.html 937B
_paper_list.html 883B
paper_detail.html 812B
search_author.html 598B
_author_list.html 522B
search_paper.html 510B
author_rank.html 510B
index.html 488B
author_detail.html 427B
plan.md 14KB
readme.md 9KB
readme.md 7KB
readme.md 5KB
README.md 3KB
学者详情.png 82KB
搜索论文.png 77KB
论文详情.png 56KB
RHCO.png 42KB
学者排名.png 39KB
rank_Recall.png 34KB
rank_nDCG.png 31KB
param_analysis_dimension.png 28KB
param_analysis_alpha.png 27KB
param_analysis_Tpos.png 26KB
GARec.png 26KB
param_analysis_alpha.png 25KB
ablation_study_oag-venue.png 15KB
ablation_study_ogbn-mag.png 15KB
model.py 17KB
model.py 13KB
model.py 11KB
tests.py 8KB
train.py 8KB
model.py 7KB
build_author_rank.py 7KB
heco.py 6KB
train.py 6KB
oagcs.py 6KB
build_pos_graph.py 6KB
data.py 6KB
model.py 6KB
fine_tune.py 5KB
train.py 5KB
data.py 5KB
views.py 5KB
extract_cs.py 5KB
build_pos_graph_full.py 4KB
train_full.py 4KB
model.py 4KB
train.py 4KB
train.py 4KB
train.py 4KB
train.py 4KB
model.py 4KB
model.py 4KB
loadoagcs.py 4KB
train.py 4KB
common.py 3KB
smooth.py 3KB
0001_initial.py 3KB
dataloader.py 3KB
metrics.py 3KB
train_full.py 3KB
train_full.py 3KB
rank.py 3KB
train_full.py 2KB
model.py 2KB
ai2000_crawler.py 2KB
random_walk.py 2KB
train.py 2KB
metrics.py 2KB
venue.py 2KB
data.py 2KB
models.py 2KB
core.py 2KB
recall.py 2KB
plot.py 1KB
train_sum.py 1KB
plot.py 1KB
sampler.py 1KB
random_walk.py 1KB
analyze.py 1KB
contrast.py 990B
urls.py 973B
config.py 909B
train_word2vec.py 902B
utils.py 873B
urls.py 727B
manage.py 670B
__init__.py 610B
admin.py 439B
共 136 条
- 1
- 2
资源评论
- douyu123062024-03-27资源内容详实,描述详尽,解决了我的问题,受益匪浅,学到了。
百锦再@新空间代码工作室
- 粉丝: 1w+
- 资源: 806
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功