GNNRecom/gnnrec/kgrec/data/readme.md

158 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# oag-cs数据集
## 原始数据
[Open Academic Graph 2.1](https://www.aminer.cn/oag-2-1)
使用其中的微软学术(MAG)数据总大小169 GB
| 类型 | 文件 | 总量 |
| --- | --- | --- |
| author | mag_authors_{0-1}.zip | 243477150 |
| paper | mag_papers_{0-16}.zip | 240255240 |
| venue | mag_venues.zip | 53422 |
| affiliation | mag_affiliations.zip | 25776 |
## 字段分析
假设原始zip文件所在目录为data/oag/mag/
```shell
python -m gnnrec.kgrec.data.preprocess.analyze author data/oag/mag/
python -m gnnrec.kgrec.data.preprocess.analyze paper data/oag/mag/
python -m gnnrec.kgrec.data.preprocess.analyze venue data/oag/mag/
python -m gnnrec.kgrec.data.preprocess.analyze affiliation data/oag/mag/
```
```
数据类型: venue
总量: 53422
最大字段集合: {'JournalId', 'NormalizedName', 'id', 'ConferenceId', 'DisplayName'}
最小字段集合: {'NormalizedName', 'DisplayName', 'id'}
字段出现比例: {'id': 1.0, 'JournalId': 0.9162891692561117, 'DisplayName': 1.0, 'NormalizedName': 1.0, 'ConferenceId': 0.08371083074388828}
示例: {'id': 2898614270, 'JournalId': 2898614270, 'DisplayName': 'Revista de Psiquiatría y Salud Mental', 'NormalizedName': 'revista de psiquiatria y salud mental'}
```
```
数据类型: affiliation
总量: 25776
最大字段集合: {'id', 'NormalizedName', 'url', 'Latitude', 'Longitude', 'WikiPage', 'DisplayName'}
最小字段集合: {'id', 'NormalizedName', 'Latitude', 'Longitude', 'DisplayName'}
字段出现比例: {'id': 1.0, 'DisplayName': 1.0, 'NormalizedName': 1.0, 'WikiPage': 0.9887880198634389, 'Latitude': 1.0, 'Longitude': 1.0, 'url': 0.6649984481688392}
示例: {'id': 3032752892, 'DisplayName': 'Universidad Internacional de La Rioja', 'NormalizedName': 'universidad internacional de la rioja', 'WikiPage': 'https://en.wikipedia.org/wiki/International_University_of_La_Rioja', 'Latitude': '42.46270', 'Longitude': '2.45500', 'url': 'https://en.unir.net/'}
```
```
数据类型: author
总量: 243477150
最大字段集合: {'normalized_name', 'name', 'pubs', 'n_pubs', 'n_citation', 'last_known_aff_id', 'id'}
最小字段集合: {'normalized_name', 'name', 'n_pubs', 'pubs', 'id'}
字段出现比例: {'id': 1.0, 'name': 1.0, 'normalized_name': 1.0, 'last_known_aff_id': 0.17816547055853085, 'pubs': 1.0, 'n_pubs': 1.0, 'n_citation': 0.39566894470384595}
示例: {'id': 3040689058, 'name': 'Jeong Hoe Heo', 'normalized_name': 'jeong hoe heo', 'last_known_aff_id': '59412607', 'pubs': [{'i': 2770054759, 'r': 10}], 'n_pubs': 1, 'n_citation': 44}
```
```
数据类型: paper
总量: 240255240
最大字段集合: {'issue', 'authors', 'page_start', 'publisher', 'doc_type', 'title', 'id', 'doi', 'references', 'volume', 'fos', 'n_citation', 'venue', 'page_end', 'year', 'indexed_abstract', 'url'}
最小字段集合: {'id'}
字段出现比例: {'id': 1.0, 'title': 0.9999999958377599, 'authors': 0.9998381970774082, 'venue': 0.5978255167296247, 'year': 0.9999750931550963, 'page_start': 0.5085962370685443, 'page_end': 0.4468983111460961, 'publisher': 0.5283799512551735, 'issue': 0.41517357124031923, 'url': 0.9414517743712895, 'doi': 0.37333226530251745, 'indexed_abstract': 0.5832887141192009, 'fos': 0.8758779954185391, 'n_citation': 0.3795505812901313, 'doc_type': 0.6272126634990355, 'volume': 0.43235134434528877, 'references': 0.3283648464857624}
示例: {
'id': 2507145174,
'title': 'Structure-Activity Relationships and Kinetic Studies of Peptidic Antagonists of CBX Chromodomains.',
'authors': [{'name': 'Jacob I. Stuckey', 'id': 2277886111, 'org': 'Center for Integrative Chemical Biology and Drug Discovery, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill , Chapel Hill, North Carolina 27599, United States.\r', 'org_id': 114027177}, {'name': 'Catherine Simpson', 'id': 2098592917, 'org': 'Center for Integrative Chemical Biology and Drug Discovery, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill , Chapel Hill, North Carolina 27599, United States.\r', 'org_id': 114027177}, ...],
'venue': {'name': 'Journal of Medicinal Chemistry', 'id': 162030435},
'year': 2016, 'n_citation': 13, 'page_start': '8913', 'page_end': '8923', 'doc_type': 'Journal', 'publisher': 'American Chemical Society', 'volume': '59', 'issue': '19', 'doi': '10.1021/ACS.JMEDCHEM.6B00801',
'references': [1976962550, 1982791788, 1988515229, 2000127174, 2002698073, 2025496265, 2032915605, 2050256263, 2059999434, 2076333986, 2077957449, 2082815186, 2105928678, 2116982909, 2120121380, 2146641795, 2149566960, 2156518222, 2160723017, 2170079272, 2207535250, 2270756322, 2326025506, 2327795699, 2332365177, 2346619380, 2466657786],
'indexed_abstract': '{"IndexLength":108,"InvertedIndex":{"To":[0],"better":[1],"understand":[2],"the":[3,19,54,70,80,95],"contribution":[4],"of":[5,21,31,47,56,82,90,98],"methyl-lysine":[6],"(Kme)":[7],"binding":[8,33,96],"proteins":[9],"to":[10,79],"various":[11],"disease":[12],"states,":[13],"we":[14,68],"recently":[15],"developed":[16],"and":[17,36,43,63,73,84],"reported":[18],"discovery":[20,46],"1":[22,48,83],"(UNC3866),":[23],"a":[24],"chemical":[25],"probe":[26],"that":[27,77],"targets":[28],"two":[29],"families":[30],"Kme":[32],"proteins,":[34],"CBX":[35],"CDY":[37],"chromodomains,":[38],"with":[39,61,101],"selectivity":[40],"for":[41,87],"CBX4":[42],"-7.":[44],"The":[45],"was":[49],"enabled":[50],"in":[51],"part":[52],"by":[53,93,105],"use":[55],"molecular":[57],"dynamics":[58],"simulations":[59],"performed":[60],"CBX7":[62,102],"its":[64],"endogenous":[65],"substrate.":[66],"Herein,":[67],"describe":[69],"design,":[71],"synthesis,":[72],"structureactivity":[74],"relationship":[75],"studies":[76],"led":[78],"development":[81],"provide":[85],"support":[86],"our":[88,99],"model":[89],"CBX7ligand":[91],"recognition":[92],"examining":[94],"kinetics":[97],"antagonists":[100],"as":[103],"determined":[104],"surface-plasmon":[106],"resonance.":[107]}}',
'fos': [{'name': 'chemistry', 'w': 0.36301}, {'name': 'chemical probe', 'w': 0.0}, {'name': 'receptor ligand kinetics', 'w': 0.46173}, {'name': 'dna binding protein', 'w': 0.42292}, {'name': 'biochemistry', 'w': 0.39304}],
'url': ['https://pubs.acs.org/doi/full/10.1021/acs.jmedchem.6b00801', 'https://www.ncbi.nlm.nih.gov/pubmed/27571219', 'http://pubsdc3.acs.org/doi/abs/10.1021/acs.jmedchem.6b00801']
}
```
## 第1步抽取计算机领域的子集
```shell
python -m gnnrec.kgrec.data.preprocess.extract_cs data/oag/mag/
```
筛选近10年计算机领域的论文从微软学术抓取了计算机科学下的34个二级领域作为领域字段过滤条件过滤掉主要字段为空的论文
二级领域列表:[CS_FIELD_L2](config.py)
输出5个文件
1学者mag_authors.txt
`{"id": aid, "name": "author name", "org": oid}`
2论文mag_papers.txt
```
{
"id": pid,
"title": "paper title",
"authors": [aid],
"venue": vid,
"year": year,
"abstract": "abstract",
"fos": ["field"],
"references": [pid],
"n_citation": n_citation
}
```
3期刊mag_venues.txt
`{"id": vid, "name": "venue name"}`
4机构mag_institutions.txt
`{"id": oid, "name": "org name"}`
5领域mag_fields.txt
`{"id": fid, "name": "field name"}`
## 第2步预训练论文和领域向量
通过论文标题和关键词的**对比学习**对预训练的SciBERT模型进行fine-tune之后将隐藏层输出的128维向量作为paper和field顶点的输入特征
预训练的SciBERT模型来自Transformers [allenai/scibert_scivocab_uncased](https://huggingface.co/allenai/scibert_scivocab_uncased)
由于原始数据不包含关键词因此使用研究领域fos字段作为关键词
1. fine-tune
```shell
python -m gnnrec.kgrec.data.preprocess.fine_tune train
```
```
Epoch 0 | Loss 0.3470 | Train Acc 0.9105 | Val Acc 0.9426
Epoch 1 | Loss 0.1609 | Train Acc 0.9599 | Val Acc 0.9535
Epoch 2 | Loss 0.1065 | Train Acc 0.9753 | Val Acc 0.9573
Epoch 3 | Loss 0.0741 | Train Acc 0.9846 | Val Acc 0.9606
Epoch 4 | Loss 0.0551 | Train Acc 0.9898 | Val Acc 0.9614
```
2. 推断
```shell
python -m gnnrec.kgrec.data.preprocess.fine_tune infer
```
预训练的论文和领域向量分别保存到paper_feat.pkl和field_feat.pkl文件已归一化
该向量既可用于GNN模型的输入特征也可用于计算相似度召回论文
## 第3步构造图数据集
将以上5个txt和2个pkl文件压缩为oag-cs.zip得到oag-cs数据集的原始数据
将oag-cs.zip文件放到`$DGL_DOWNLOAD_DIR`目录下(环境变量`DGL_DOWNLOAD_DIR`默认为`~/.dgl/`
```python
from gnnrec.kgrec.data import OAGCSDataset
data = OAGCSDataset()
g = data[0]
```
统计数据见 [OAGCSDataset](oagcs.py) 的文档字符串
## 下载地址
下载地址:<https://pan.baidu.com/s/1ayH3tQxsiDDnqPoXhR0Ekg>提取码2ylp
大小1.91 GB解压后大小3.93 GB