知识图谱-数据集

程序员文章站 2022-06-12 19:50:13

...

原文链接：https://blog.csdn.net/qq_21097885/article/details/104562276

DBpedia

网址：https://wiki.dbpedia.org/

简介：
DBpedia 是一个很特殊的语义网应用范例，它从*(Wikipedia)的词条里撷取出结构化的资料，以强化*的搜寻功能，并将其他资料集连结至*。透过这样的语意化技术的介入，让*的庞杂资讯有了许多创新而有趣的应用，例如手机版本、地图整合、多面向搜寻、关系查询、文件分类与标注等等。DBpedia 同时也是世界上最大的多领域知识本体之一，也是 Linked Data 的一部分，美国科技媒体 ReadWriteWeb 也将 DBpedia 选为2009 年最佳的语义网应用服务。

DBpedia 2014 版的资料集拥有超过458万的物件，包括144万5000人、73万5000个地点、12万3000张唱片、8万7千部电影、1万9000种电脑游戏、24万1000个组织、25万1000种物种和6000个疾病。其资料不仅被BBC、路透社、纽约时报所采用，也是Google、Yahoo等搜寻引擎检索的对象。

2016年发布的版本中，包括了95亿条RDF格式的三元组数据，其中13亿条是从英文版的*中提取的50亿条来自其他语言，另外32亿条来自Depedia Commons和Wikidata。

文献：

@article{DBLP:journals/ws/BizerLKABCH09,
  author    = {Christian Bizer and
               Jens Lehmann and
               Georgi Kobilarov and
               S{\"{o}}ren Auer and
               Christian Becker and
               Richard Cyganiak and
               Sebastian Hellmann},
  title     = {DBpedia - {A} crystallization point for the Web of Data},
  journal   = {J. Web Semant.},
  volume    = {7},
  number    = {3},
  pages     = {154--165},
  year      = {2009},
  url       = {https://doi.org/10.1016/j.websem.2009.07.002},
  doi       = {10.1016/j.websem.2009.07.002},
  timestamp = {Fri, 27 Dec 2019 21:12:44 +0100},
  biburl    = {https://dblp.org/rec/journals/ws/BizerLKABCH09.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Yago

网址：https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/

中文简介：
Yago是一个开源的数据集，其中的数据是从*、WordNet和GeoNames等多个数据源中自动提取得到的。截止到2012年，就包括超过1千万个实体和1.2亿条事实。

英文简介：

YAGO (Yet Another Great Ontology) is an open source knowledge base developed at the Max Planck Institute for Computer Science in Saarbrücken. It is automatically extracted from Wikipedia and other sources.

As of 2012, YAGO3 has knowledge of more than 10 million entities and contains more than 120 million facts about these entities. The information in YAGO is extracted from Wikipedia (e.g., categories, redirects, infoboxes), WordNet (e.g., synsets, hyponymy), and GeoNames. The accuracy of YAGO was manually evaluated to be above 95% on a sample of facts.[To integrate it to the linked data cloud, YAGO has been linked to the DBpedia ontology[6] and to the SUMO ontology.

YAGO3 is provided in Turtle and tsv formats. Dumps of the whole database are available, as well as thematic and specialized dumps. It can also be queried through various online browsers and through a SPARQL endpoint hosted by OpenLink Software. The source code of YAGO3 is available on GitHub.

YAGO has been used in the Watson artificial intelligence system.

文献：

@inproceedings{DBLP:conf/www/SuchanekKW07,
  author    = {F* M. Suchanek and
               Gjergji Kasneci and
               Gerhard Weikum},
  editor    = {Carey L. Williamson and
               Mary Ellen Zurko and
               Peter F. Patel{-}Schneider and
               Prashant J. Shenoy},
  title     = {Yago: a core of semantic knowledge},
  booktitle = {Proceedings of the 16th International Conference on World Wide Web,
               {WWW} 2007, Banff, Alberta, Canada, May 8-12, 2007},
  pages     = {697--706},
  publisher = {
    {ACM}},
  year      = {2007},
  url       = {https://doi.org/10.1145/1242572.1242667},
  doi       = {10.1145/1242572.1242667},
  timestamp = {Wed, 14 Nov 2018 10:55:41 +0100},
  biburl    = {https://dblp.org/rec/conf/www/SuchanekKW07.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Freebase

网址：http://www.freebase.be/

简介：
类似于*，Freebase的内容是由社区成员贡献的结构化知识。除了人工输入外，Freebase也主动导入如*的结构化知识。
目前，已经被谷歌公司收购。

论文中常用其子集FB13，详见：https://blog.csdn.net/qq_21097885/article/details/103519703

文献：

@inproceedings{DBLP:conf/sigmod/BollackerEPST08,
  author    = {Kurt D. Bollacker and
               Colin Evans and
               Praveen Paritosh and
               Tim Sturge and
               Jamie Taylor},
  editor    = {Jason Tsong{-}Li Wang},
  title     = {Freebase: a collaboratively created graph database for structuring
               human knowledge},
  booktitle = {Proceedings of the {ACM} {SIGMOD} International Conference on Management
               of Data, {SIGMOD} 2008, Vancouver, BC, Canada, June 10-12, 2008},
  pages     = {1247--1250},
  publisher = {
    {ACM}},
  year      = {2008},
  url       = {https://doi.org/10.1145/1376616.1376746},
  doi       = {10.1145/1376616.1376746},
  timestamp = {Tue, 27 Nov 2018 10:40:37 +0100},
  biburl    = {https://dblp.org/rec/conf/sigmod/BollackerEPST08.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

WordNet

网址： https://wordnet.princeton.edu/

中文简介：
WordNet是一个大型的英语词汇数据库。其中，名词、动词、形容词以及副词被按照认知上的同义词分组，称为synsets，每一个synset表征一个确定的概念。synset之间通过概念语义以及词汇关系链接。WordNet是计算机语言学和自然语言处理中常用的工具。
在汉语中，类似的有知网的HowNet。

论文中常用其子集WN11，详见：https://blog.csdn.net/qq_21097885/article/details/103519635；
以及WN18，详见：https://blog.csdn.net/qq_21097885/article/details/103519750

英文简介：
WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.

WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.

文献：

@article{DBLP:journals/cacm/Miller95,
  author    = {George A. Miller},
  title     = {WordNet: {A} Lexical Database for English},
  journal   = {Commun. {ACM}},
  volume    = {38},
  number    = {11},
  pages     = {39--41},
  year      = {1995},
  url       = {http://doi.acm.org/10.1145/219717.219748},
  doi       = {10.1145/219717.219748},
  timestamp = {Wed, 14 Nov 2018 10:22:30 +0100},
  biburl    = {https://dblp.org/rec/journals/cacm/Miller95.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

PDD

网址：http://pdd.wangmengsd.com/

中文简介：
PDD，全称Patient-Disease-Drug，是一个医疗相关的数据集，包含了患者、疾病和药物之间的连接关系。

英文简介：
What is PDD Graph (Patient-Disease-Drug Graph):

Electronic medical records contain multi-format electronic medical data that consist of an abundance of medical knowledge. Facing with patients symptoms, experienced caregivers make right medical decisions based on their professional knowledge that accurately grasps relationships between symptoms, diagnosis, and treatments. We aim to capture these relationships by constructing a large and high-quality heterogeneous graph linking patients, diseases, and drugs (PDD) in EMRs.

Specifically, we extract important medical entities from MIMIC-III (Medical Information Mart for Intensive Care III) and automatically link them with the existing biomedical knowledge graphs, including ICD-9 ontology and DrugBank. The PDD graph presented is accessible on the Web via the SPARQL endpoint, and provides a pathway for medical discovery and applications, such as effective treatment recommendations.

文献：

@inproceedings{DBLP:conf/semweb/WangZLHWLL17,
  author    = {Meng Wang and
               Jiaheng Zhang and
               Jun Liu and
               Wei Hu and
               Sen Wang and
               Xue Li and
               Wenqiang Liu},
  editor    = {Claudia d'Amato and
               Miriam Fern{\'{a}}ndez and
               Valentina A. M. Tamma and
               Freddy L{\'{e}}cu{\'{e}} and
               Philippe Cudr{\'{e}}{-}Mauroux and
               Juan F. Sequeda and
               Christoph Lange and
               Jeff Heflin},
  title     = {
    {PDD} Graph: Bridging Electronic Medical Records and Biomedical Knowledge
               Graphs via Entity Linking},
  booktitle = {The Semantic Web - {ISWC} 2017 - 16th International Semantic Web Conference,
               Vienna, Austria, October 21-25, 2017, Proceedings, Part {II}},
  series    = {Lecture Notes in Computer Science},
  volume    = {10588},
  pages     = {219--227},
  publisher = {Springer},
  year      = {2017},
  url       = {https://doi.org/10.1007/978-3-319-68204-4\_23},
  doi       = {10.1007/978-3-319-68204-4\_23},
  timestamp = {Tue, 14 May 2019 10:00:53 +0200},
  biburl    = {https://dblp.org/rec/conf/semweb/WangZLHWLL17.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

近些年，国内也推出了以中文为主的知识图谱。如清华大学的XLore、上海交通大学的zhishi.me和复旦大学的CNpedia。

清华大学的XLore

网址： https://xlore.org/

简介：
XLORE是融合中英文维基、法语维基和百度百科，对百科知识进行结构化和跨语言链接构建的多语言知识图谱，是中英文知识规模较平衡的大规模多语言知识图谱。XLORE包含16,284,901个的实例，2,466,956个概念，446,236个属性以及丰富的语义关系。

文献：

@inproceedings{DBLP:conf/semweb/WangLWLLZSLZT13,
  author    = {Zhigang Wang and
               Juanzi Li and
               Zhichun Wang and
               Shuangjie Li and
               Mingyang Li and
               Dongsheng Zhang and
               Yao Shi and
               Yongbin Liu and
               Peng Zhang and
               Jie Tang},
  editor    = {Eva Blomqvist and
               Tudor Groza},
  title     = {XLore: {A} Large-scale English-Chinese Bilingual Knowledge Graph},
  booktitle = {Proceedings of the {ISWC} 2013 Posters {\&} Demonstrations Track,
               Sydney, Australia, October 23, 2013},
  series    = {
    {CEUR} Workshop Proceedings},
  volume    = {1035},
  pages     = {121--124},
  publisher = {CEUR-WS.org},
  year      = {2013},
  url       = {http://ceur-ws.org/Vol-1035/iswc2013\_demo\_31.pdf},
  timestamp = {Wed, 12 Feb 2020 16:44:51 +0100},
  biburl    = {https://dblp.org/rec/conf/semweb/WangLWLLZSLZT13.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

上海交通大学的zhishi.me

网址： 无

简介：
Zhishi.me 通过从开放的百科数据中抽取结构化数据，首次尝试构建中文通用知识图谱。目前，已融合了三大中文百科，百度百科，互动百科以及*中的数据。

文献：

@inproceedings{DBLP:conf/semweb/NiuSWRQY11,
  author    = {Xing Niu and
               Xinruo Sun and
               Haofen Wang and
               Shu Rong and
               Guilin Qi and
               Yong Yu},
  editor    = {Lora Aroyo and
               Chris Welty and
               Harith Alani and
               Jamie Taylor and
               Abraham Bernstein and
               Lalana Kagal and
               Natasha Fridman Noy and
               Eva Blomqvist},
  title     = {Zhishi.me - Weaving Chinese Linking Open Data},
  booktitle = {The Semantic Web - {ISWC} 2011 - 10th International Semantic Web Conference,
               Bonn, Germany, October 23-27, 2011, Proceedings, Part {II}},
  series    = {Lecture Notes in Computer Science},
  volume    = {7032},
  pages     = {205--220},
  publisher = {Springer},
  year      = {2011},
  url       = {https://doi.org/10.1007/978-3-642-25093-4\_14},
  doi       = {10.1007/978-3-642-25093-4\_14},
  timestamp = {Thu, 28 Nov 2019 10:44:37 +0100},
  biburl    = {https://dblp.org/rec/conf/semweb/NiuSWRQY11.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

复旦大学的CN-DBpedia

网址： http://kw.fudan.edu.cn/cndbpedia/intro/

简介：
CN-DBpedia以通用百科知识沉淀为主线，以垂直纵深领域图谱积累为支线，致力于为机器语义理解提供了丰富的背景知识，为实现机器语言认知提供必要支撑。
CN-DBpedia已经从百科领域延伸至法律、工商、金融、文娱、科技、军事、教育、医疗等十多个垂直领域，为各类行业智能化应用提供支撑性知识服务，目前已有近百家单位在使用。

文献：

@inproceedings{DBLP:conf/ieaaie/XuXLXLCX17,
  author    = {Bo Xu and
               Yong Xu and
               Jiaqing Liang and
               Chenhao Xie and
               Bin Liang and
               Wanyun Cui and
               Yanghua Xiao},
  editor    = {Salem Benferhat and
               Karim Tabia and
               Moonis Ali},
  title     = {CN-DBpedia: {A} Never-Ending Chinese Knowledge Extraction System},
  booktitle = {Advances in Artificial Intelligence: From Theory to Practice - 30th
               International Conference on Industrial Engineering and Other Applications
               of Applied Intelligent Systems, {IEA/AIE} 2017, Arras, France, June
               27-30, 2017, Proceedings, Part {II}},
  series    = {Lecture Notes in Computer Science},
  volume    = {10351},
  pages     = {428--438},
  publisher = {Springer},
  year      = {2017},
  url       = {https://doi.org/10.1007/978-3-319-60045-1\_44},
  doi       = {10.1007/978-3-319-60045-1\_44},
  timestamp = {Tue, 14 May 2019 10:00:37 +0200},
  biburl    = {https://dblp.org/rec/conf/ieaaie/XuXLXLCX17.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

相关标签：实体抽取自然语言处理知识图谱

上一篇： pygcn/train.py

下一篇： php Protocol "https" not supported or disabled in libcurl

知识图谱-数据集

DBpedia

Yago

Freebase

WordNet

PDD

清华大学的XLore

上海交通大学的zhishi.me

复旦大学的CN-DBpedia

C#使用TensorFlow.NET训练自己的数据集的方法

Scala当中什么是RDD（Resilient Distributed Datasets）弹性分布式数据集

数据库基础知识介绍（MySQL）

vue双向数据绑定知识点总结

MySQL千万级大数据SQL查询优化知识点总结

单层GNN完成cora数据集节点分类任务

知识图谱在大数据中的应用

对sklearn的使用之数据集的拆分与训练详解(python3.6)

pandas数据集的端到端处理

详解Python数据分析--Pandas知识点