Building a text corpus for automatic biographical facts extraction from Russian texts

A.V. Glazkova

Resumen

The tasks of computer linguistics and machine learning related to natural language processing (NLP) often require the use of text corpora. Text corpora are specially prepared collection of documents equipped with text markup containing morphological, syntactic, semantic or other information. The data received from the text corpora is used in supervised machine learning for building classifiers of texts written in natural language and in other tasks associated with natural language processing and computer linguistics. The specificity of the information presented in the corpus, as well as the type of texts, is determined by the aim and tasks of the particular study. This article presents a tool for building a corpus of biographical texts in Russian. The process of building a text corpus includes two stages: the collection of texts and their markup. At the first stage we collected texts suitable for markup. Thus, we included in the corpus biographical articles placed in Wikipedia in free access. For this purpose, we developed an automatic parser based on open Python libraries. The second stage is the semantic markup of the text sentences and the selection of biographical facts. This stage took place in a semi-automatic mode. The article describes the features of the process of building the corpus of biographical facts, taxonomy of biographical facts using in our work, software implementation for text collecting and markup, text representation in the corpus and the characteristics of the prepared corpus.

Acceso

PÁGINAS

pp. 97 - 103

NÚMERO

Volumen: 7 Número: 1 Parte: 0 (2019)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Building a text corpus for automatic biographical facts extraction from Russian texts

Artículos similares

Revistas destacadas