An Extensible Schema For Building Large Weakly-labeled Semantic Corpora
Main Article Content
Abstract
In NLP data drives research, as evidenced by the frequency with which seminal works of database engineering such as The Penn Treebank have been employed as a basis for experimentation. Traditionally large-scale expertly annotated corpora are expensive and time consuming to produce. This paradigm drove researchers to adopt automated methods for generating labelled data with available tools such as Freebase, DBpedia, and the "infoboxes" found on Wikipedia pages. These knowledge bases have been, or are in the process of being, subsumed by Wikidata, an initiative to concentrate such disparate data repositories in an organized machine readable format. This resource is an important research tool. In this paper, we review our experience using Wikidata in constructing a large annotated corpus under distant supervision, moreover we make the materials, the code used to generate our annotations, freely available to all interested parties.
Article Details
References
Intxaurrondo, Ander, Mihai surdeanu, oier Loêz de lacalle, and Eneko agirre.2013.Removing noisy mentions for distant suoervision. Procesamiento del lenguaje natural 51.,41-48.
Hoffmann, Raphael, congle zhang, xiao ling, luke Zettlemoyer , and Daniel S.weld. 2011.Knowledge-based weak supervision for information extraction of overlapping relations.
Association for computational linguistics.in proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies-volume 1.541-550.
Manning, Christopher D. and surdeanu, Mihai and bauer, John and Finkel, Jenny and Bethard , steven J.and L+MC Closky , David. 2014. The Stanford core NLP Natural Language Processing Toolkit. Proceedings of 52nd annual meeting of the Association for computational linguistics: System Demonstrations.55-60.
Marcus, Mitchell p.,Mary Ann Marcinkiewicz, and Beatrice Santorini.1993.Building a large annotated corpus of English: the penn Treebank. Cambridge university press, Cambridge, UK.Camputational linguistics 19.2,313-330.
Mintz,Mike , steven Bills, Rion Snow, and dan jurafsky.2009. Distant Supevision for relation extraction without labeled data. Association for computational Linguistics In proceeding of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP.Volume2.
Riedel, Sebastian , Limin Tao , and Andrew Mc callum.2010. Modeling relations and their mentions without labeled text. Springer Berlin Heidelberg.Machine learning and knowledge discovery in databases. 148-163.
Riedel, Sebastian , Limin Tao , and Andrew Mc callum, and Benjamin M.Marlin. 2013. In Naacl-hlt.Linguistic data consortium, Philadephia.74-84.
Sandhaus, Evan. 2008. The new York times annotated corpus.Linguistic Data Consortium, Philadelphia.
Schoenmackers, Stefan, Oren Etzioni, Daniel S.Weld, and jesse davis.2012. Learning first-order horn clauses from web text.Association for computational linguistics.In proceedings of the 2010 conference on empirical methods in natural language processing.1088-1098.
Surdeanu , Mihai , Julie Tibshirani, Ramesh Nallapati, and Christopher D.Manning.2012.Multi-instance multi-label learning for relation extraction.Association for computational linguistics.in proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning .455-465.
Vrandecic, Denny , and Markus Krotzsh.2014.Wikidata : Afree collaborative knowledgebase..comunications of the ACM57,no.10.78-85.
Erxleben, Fredo, Michael Gunther,Markus Krotzsch, Julian Mendez, and Denny Vrandecic.2014.Introduction wikidata to the linked data web.springer international publishing in the semantic web-iswcx2014.50-65.