An Extensible Schema For Building Large Weakly-labeled Semantic Corpora

S.Matthew English

doi:10.61850/allj.v22i2.368

pdf

Published: May 30, 2016

DOI: https://doi.org/10.61850/allj.v22i2.368

Keywords:

Wikidata - Semantic Corpora -

S.Matthew English

the university of Hong Kong

Abstract

In NLP data drives research, as evidenced by the frequency with which seminal works of database engineering such as The Penn Treebank have been employed as a basis for experimentation. Traditionally large-scale expertly annotated corpora are expensive and time consuming to produce. This paradigm drove researchers to adopt automated methods for generating labelled data with available tools such as Freebase, DBpedia, and the "infoboxes" found on Wikipedia pages. These knowledge bases have been, or are in the process of being, subsumed by Wikidata, an initiative to concentrate such disparate data repositories in an organized machine readable format. This resource is an important research tool. In this paper, we review our experience using Wikidata in constructing a large annotated corpus under distant supervision, moreover we make the materials, the code used to generate our annotations, freely available to all interested parties.

Plum Analytics

Artifact Widget

How to Cite

English, S. (2016). An Extensible Schema For Building Large Weakly-labeled Semantic Corpora. AL-Lisaniyyat, 22(2), 18-22. https://doi.org/10.61850/allj.v22i2.368

Issue

Vol. 22 No. 2 (2016): v22i22016

Section

Articles

In accordance with its open access publishing policy, AL-Lisaniyyat acknowledges and guarantees authors the full and exclusive ownership of copyright and intellectual property rights related to their scholarly contributions.

The publication of an article in the journal does not result in any transfer, assignment, or limitation of these rights. Authors retain full rights over their works, without the requirement to obtain prior written authorization from the journal.

References

Abad, azad, and alessandro Moschitti.2014.creating a standard for evaluating distant supervision for relation extraction.Italian conference on computational linguistics CLIC-IT.1.
Intxaurrondo, Ander, Mihai surdeanu, oier Loêz de lacalle, and Eneko agirre.2013.Removing noisy mentions for distant suoervision. Procesamiento del lenguaje natural 51.,41-48.
Hoffmann, Raphael, congle zhang, xiao ling, luke Zettlemoyer , and Daniel S.weld. 2011.Knowledge-based weak supervision for information extraction of overlapping relations.
Association for computational linguistics.in proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies-volume 1.541-550.
Manning, Christopher D. and surdeanu, Mihai and bauer, John and Finkel, Jenny and Bethard , steven J.and L+MC Closky , David. 2014. The Stanford core NLP Natural Language Processing Toolkit. Proceedings of 52nd annual meeting of the Association for computational linguistics: System Demonstrations.55-60.
Marcus, Mitchell p.,Mary Ann Marcinkiewicz, and Beatrice Santorini.1993.Building a large annotated corpus of English: the penn Treebank. Cambridge university press, Cambridge, UK.Camputational linguistics 19.2,313-330.
Mintz,Mike , steven Bills, Rion Snow, and dan jurafsky.2009. Distant Supevision for relation extraction without labeled data. Association for computational Linguistics In proceeding of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP.Volume2.
Riedel, Sebastian , Limin Tao , and Andrew Mc callum.2010. Modeling relations and their mentions without labeled text. Springer Berlin Heidelberg.Machine learning and knowledge discovery in databases. 148-163.
Riedel, Sebastian , Limin Tao , and Andrew Mc callum, and Benjamin M.Marlin. 2013. In Naacl-hlt.Linguistic data consortium, Philadephia.74-84.
Sandhaus, Evan. 2008. The new York times annotated corpus.Linguistic Data Consortium, Philadelphia.
Schoenmackers, Stefan, Oren Etzioni, Daniel S.Weld, and jesse davis.2012. Learning first-order horn clauses from web text.Association for computational linguistics.In proceedings of the 2010 conference on empirical methods in natural language processing.1088-1098.
Surdeanu , Mihai , Julie Tibshirani, Ramesh Nallapati, and Christopher D.Manning.2012.Multi-instance multi-label learning for relation extraction.Association for computational linguistics.in proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning .455-465.
Vrandecic, Denny , and Markus Krotzsh.2014.Wikidata : Afree collaborative knowledgebase..comunications of the ACM57,no.10.78-85.
Erxleben, Fredo, Michael Gunther,Markus Krotzsch, Julian Mendez, and Denny Vrandecic.2014.Introduction wikidata to the linked data web.springer international publishing in the semantic web-iswcx2014.50-65.

Article Sidebar

Main Article Content

Abstract

Article Details

References