Testing word embeddings for Polish

Agnieszka Mykowiecka; Małgorzata Marciniak; Piotr Rychlik

doi:10.11649/cs.1468

Authors

Agnieszka Mykowiecka Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warsaw , Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warszawa
Małgorzata Marciniak Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warsaw , Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warszawa
Piotr Rychlik Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warsaw , Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warszawa

DOI:

https://doi.org/10.11649/cs.1468

Keywords:

distributional semantics, word embeddings, model evaluation, synonymy, analogy

Abstract

Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skip-gram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results.

References

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict!: A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of ACL 2014 (52nd Annual Meeting of the Association for Computational Linguistics) (pp. 238–247). East Stroudsburg, PA: Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/P14-1023

Baroni, M., & Lenci, A. (2010). A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673–721. https://doi.org/10.1162/coli_a_00016 DOI: https://doi.org/10.1162/coli_a_00016

Baroni, M., & Lenci, A. (2011). How we BLESSed Distributional Semantic Evaluation. In Proceedings of the GEMS 2011 Workshop on Geometrical Models of Natural Language Semantics (pp. 1–10). Edinburgh: Association for Computational Linguistics.

Basile, P., Caputo, A., & Semeraro, G. (2014). An enhanced Lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics. Dublin, Irleand: Association for Computational Linguistics.

Bellegarda, J. R. (2000). Large vocabulary speech recognition with multispan statistical language models. IEEE Transactions on Speech and Audio Processing, 8(1), 76–84. DOI: https://doi.org/10.1109/89.817455

Broda, B., & Piasecki, M. (2008). SuperMatrix: A general tool for lexical semantic knowledge acquisition. In Proceedings of the International Multiconference on Computer Science and Information Technology — 3rd International Symposium Advances in Artificial Intelligence and Applications (AAIA'08) (pp. 345–352). https://doi.org/10.1109/IMCSIT.2008.4747263 DOI: https://doi.org/10.1109/IMCSIT.2008.4747263

Broda, B., & Piasecki, M. (2013). Parallel, massive processing in SuperMatrix — a General tool for distrubutional semantic analysis. International Journal of Data Mining, Modelling and Management, 5(1), 1–19. https://doi.org/10.1504/IJDMMM.2013.051924 DOI: https://doi.org/10.1504/IJDMMM.2013.051924

Broniarek, W. (2010). Gdy Ci słowa zabraknie. Brwinów: Haroldson.

Budanitsky, A., & Hirst, G. (2006). Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 32(1), 13–47. https://doi.org/10.1162/coli.2006.32.1.13 DOI: https://doi.org/10.1162/coli.2006.32.1.13

Cheung, J. C., & Penn, G. (2012). Evaluating distributional models of semantics for syntactically invariant inference. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 33–45). Avignon: Association for Computational Linguistic.

Church, K. W., & Hanks, P. (1989). Word association norms, mutual information, and lexicography. In ACL’89 Proceedings of the 27th annual meeting on Association for Computational Linguistics (pp. 76–83). Stroudsburg, PA: Association for Computational Linguistics. https://doi.org/10.3115/981623.981633 DOI: https://doi.org/10.3115/981623.981633

Clark, S. (2015). Vector Space Models of Lexical Meaning. In S. Lappin & C. Fox, Handbook of contemporary semantics (2nd ed.). Willey-Blackwell. https://doi.org/10.1002/9781118882139.ch16 DOI: https://doi.org/10.1002/9781118882139.ch16

Coccaro, N., & Jurafsky, D. (1998). Towards better integration of semantic predictors in statistical language modeling. In Proceedings of ICSLP-98 (Vol. 6, pp. 2403–2406). DOI: https://doi.org/10.21437/ICSLP.1998-642

Dinu, G., & Baroni, M. (2014). How to make words with vectors: Phrase generation in distributional semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Vol. 1. Long papers (pp. 624–633). Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-1059 DOI: https://doi.org/10.3115/v1/P14-1059

Duyu, T., Wei, F., Yang, N., Ming, Z., Ting, L., & Bing, Q. (2014). Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Vol. 1. Long papers (pp. 1555–1565). Association for Computational Linguistics.

Faruqui, M., Tsvetkov, Y., & Rastogi, P. (2016). Problems with evaluation of word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP (pp. 30–35). Associacion of Computational Linguistics. https://doi.org/10.18653/v1/W16-2506 DOI: https://doi.org/10.18653/v1/W16-2506

Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114(2), 211–244. https://doi.org/10.1037/0033-295X.114.2.211 DOI: https://doi.org/10.1037/0033-295X.114.2.211

Harris, Z. S. (1954). Distributional structure. Word, 10(23), 146–162. https://doi.org/10.1080/00437956.1954.11659520 DOI: https://doi.org/10.1080/00437956.1954.11659520

Jastrzebski, S., Leśniak, D., & Czarnecki, W. M. (2017). How to evaluate word embeddings?: On importance of data efficiency and simple supervised tasks. Retrieved 23 July 2017, from https://arxiv.org/pdf/1702.02170

Kędzia, P., Czachor, G., Piasecki, M., & Kocoń, J. (2016). Vector representations of Polish words (Word2Vec method). CLARIN-PL digital repository. http://hdl.handle.net/11321/327

Kim, H. K., Kim, H., & Cho, S. (2017). Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing, 266, 336–352. https://doi.org/10.1016/j.neucom.2017.05.046 DOI: https://doi.org/10.1016/j.neucom.2017.05.046

Kovatchev, V., Salamo, M., & Marti, M. (2016). Comparing Distributional Semantics Models for identifying groups of semantically related words. Procesamiento del Lenguaje Natural, 2016(57), 109–116.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. https://doi.org/10.1037/0033-295X.104.2.211 DOI: https://doi.org/10.1037/0033-295X.104.2.211

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284. https://doi.org/10.1080/01638539809545028 DOI: https://doi.org/10.1080/01638539809545028

Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203–208. https://doi.org/10.3758/BF03204766 DOI: https://doi.org/10.3758/BF03204766

McDonald, S. (2000). Environmental determinants of lexical processing effort (Unpublished doctoral dissertation). University of Edinburgh.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013, 3111–3119. https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of NAACL (pp. 746–751). Atlanta, GA.

Palmer, F. R. (Ed.). (1968). Selected papers of J. R. Firth 1952–1959. London: Longman. (Reprinted from A synopsis of linguistic theory 1930–1955: Studies in linguistic analysis, pp. 1–32, by J. R. Firth, 1957, Oxford: Philological Society).

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162 DOI: https://doi.org/10.3115/v1/D14-1162

Przepiórkowski, A., Bańko, M., Górski, R. L., & Lewandowska-Tomaszczyk, B. (Eds.). (2012). Narodowy Korpus Języka Polskiego. Warszawa: Wydawnictwo Naukowe PWN.

Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Valetta, Malta: ELRA.

Rogalski, M., & Szczepaniak, P. S. (2016). Word embeddings for the Polish language. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. Zadeh, & J. Zurada (Eds.), Artificial Intelligence and Soft Computing, ICAISC 2016: Part I. LNAI 9692 (pp. 126–135). https://doi.org/10.1007/978-3-319-39378-0_12 DOI: https://doi.org/10.1007/978-3-319-39378-0_12

Sager, J. C. (1990). A practical course in terminology processing. Amsterdam: John Benjamins. https://doi.org/10.1075/z.44 DOI: https://doi.org/10.1075/z.44

Scheible, S., Schulte im Walde, S., & Springorum, S. (2013). Uncovering distributional differences between synonyms and antonyms in a word space model. In International Joint Conference on Natural Language Processing (pp. 489–497). Ngoya, Japan.

Schutze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–124.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x DOI: https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

Shaw, E. (2015). An SAT® Validity Primer. College Board Research. http://research.collegeboard.org/sites/default/files/publications/2015/2/research-report-sat-validity-primer.pdf

Shutova, E., Sun, L., Gutierrez, D., Lichtenstein, P., & Narayanan, S. (2017). Multilingual metaphor processing: Experiments with semi-supervised and unsupervised learning. Computational Linguistics, 43(1), 71–123. https://doi.org/10.1162/COLI_a_00275 DOI: https://doi.org/10.1162/COLI_a_00275

Spark Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21. https://doi.org/10.1108/eb026526 DOI: https://doi.org/10.1108/eb026526

Stokowiec, W. (2015). word2vec dla Polskiego Internetu. Retrieved 19 August 2017, from http://doczz.pl/doc/562319/word2vec-dla-polskiego-internetu

Tatjewski, M., Bańko, M., Kucińska, A., & Rączaszek-Leonardi, J. (2017). Computational distributional semantics and free associations: A comparison of two word-similarity models in a study of synonyms and lexical variants. In P. P. Waliński, Language, corpora and cognition. Frankfurt am Main: Peter Lang.

Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., & Dyer, C. (2015). Evaluation of Word Vector Representations by Subspace Alignment. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2049–2054). https://doi.org/10.18653/v1/D15-1243 DOI: https://doi.org/10.18653/v1/D15-1243

Turney, P. D. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning (pp. 491-502). Berlin: Springer-Verlag. DOI: https://doi.org/10.1007/3-540-44795-4_42

Turney, P., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 2010(37), 141–188. DOI: https://doi.org/10.1613/jair.2934

Waszczuk, J. (2012). Harnessing the CRF complexity with domain-specific constraints: The case of morphosyntactic tagging of a highly inflected language. In Proceedings of COLLING 2012 (pp. 2789–2804). Mumbai, India.

Weeds, J., Clark, D., Reffin, J., Weir, D., & Bill, K. (2014). Learning to distinguish hypernyms and co-hyponyms. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 2249–2259). Dublin: Dublin City University and Association for Computational Linguistics.

Wittgenstein, L. (1953). Philosophical investigations. Oxford: Basil Blackwell.

Woliński, M. (2014). Morfeusz reloaded. In N. Calzorali, K. Chourkri, T. Declerk, H. Loftsson, B. M. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation. Reykjavík: ELRA.

Testing word embeddings for Polish

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

Make a Submission

Language

Indexing

Metrics

Latest publications

Other Journals

Publisher

Membership

Partnership