Developing free morphological data for Polish

Adam Radziszewski; Marek Maziarz

doi:10.11649/cs.2011.012

Autor

Adam Radziszewski Politechnika Wrocławska [Wrocław University of Technology], Wrocław
Marek Maziarz Politechnika Wrocławska [Wrocław University of Technology], Wrocław

DOI:

https://doi.org/10.11649/cs.2011.012

Słowa kluczowe:

Natural Language Processing, language corpora

Abstrakt

Developing free morphological data for Polish

A limiting factor in construction of Natural Language Processing (NLP) systems is often the availability of morphological resources. This indeed happens for Polish: the freely available corpus with manual morpho-syntactic annotation (part of the IPI PAN Corpus) is not coupled with any free morphological analyser. There exists a very large morphological dictionary of Polish available under a free licence – Morfologik. Unfortunately, its tagset differs significantly from the tagset of the corpus and, what is more, its morphological description lacks desired rigour. We amend this situation by performing a massive conversion of the dictionary into the tagset compliant with the corpus. The conversion results in a free dictionary containing entries for almost 3.5 million different word forms. In this article we report on our methodology, discuss some morphological and syntactic issues related to both tagsets and present the characteristics of the resulting dictionary.

Developing free morphological data for Polish

Autor

DOI:

Słowa kluczowe:

Abstrakt

Bibliografia

Opublikowane

Numer

Dział

Licencja

Podobne artykuły

Inne teksty tego samego autora

Prześlij pracę

Język / Language

Indeksowanie

Wskaźniki

Latest publications

Pozostałe czasopisma

Wydawca

Członkostwo

Współpraca