Testing word embeddings for Polish
DOI:
https://doi.org/10.11649/cs.1468Keywords:
distributional semantics, word embeddings, model evaluation, synonymy, analogyAbstract
Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skip-gram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results.
References
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict!: A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of ACL 2014 (52nd Annual Meeting of the Association for Computational Linguistics) (pp. 238–247). East Stroudsburg, PA: Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/P14-1023
Baroni, M., & Lenci, A. (2010). A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673–721. https://doi.org/10.1162/coli_a_00016 DOI: https://doi.org/10.1162/coli_a_00016
Baroni, M., & Lenci, A. (2011). How we BLESSed Distributional Semantic Evaluation. In Proceedings of the GEMS 2011 Workshop on Geometrical Models of Natural Language Semantics (pp. 1–10). Edinburgh: Association for Computational Linguistics.
Basile, P., Caputo, A., & Semeraro, G. (2014). An enhanced Lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics. Dublin, Irleand: Association for Computational Linguistics.
Bellegarda, J. R. (2000). Large vocabulary speech recognition with multispan statistical language models. IEEE Transactions on Speech and Audio Processing, 8(1), 76–84. DOI: https://doi.org/10.1109/89.817455
Broda, B., & Piasecki, M. (2008). SuperMatrix: A general tool for lexical semantic knowledge acquisition. In Proceedings of the International Multiconference on Computer Science and Information Technology — 3rd International Symposium Advances in Artificial Intelligence and Applications (AAIA'08) (pp. 345–352). https://doi.org/10.1109/IMCSIT.2008.4747263 DOI: https://doi.org/10.1109/IMCSIT.2008.4747263
Broda, B., & Piasecki, M. (2013). Parallel, massive processing in SuperMatrix — a General tool for distrubutional semantic analysis. International Journal of Data Mining, Modelling and Management, 5(1), 1–19. https://doi.org/10.1504/IJDMMM.2013.051924 DOI: https://doi.org/10.1504/IJDMMM.2013.051924
Broniarek, W. (2010). Gdy Ci słowa zabraknie. Brwinów: Haroldson.
Budanitsky, A., & Hirst, G. (2006). Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 32(1), 13–47. https://doi.org/10.1162/coli.2006.32.1.13 DOI: https://doi.org/10.1162/coli.2006.32.1.13
Cheung, J. C., & Penn, G. (2012). Evaluating distributional models of semantics for syntactically invariant inference. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 33–45). Avignon: Association for Computational Linguistic.
Church, K. W., & Hanks, P. (1989). Word association norms, mutual information, and lexicography. In ACL’89 Proceedings of the 27th annual meeting on Association for Computational Linguistics (pp. 76–83). Stroudsburg, PA: Association for Computational Linguistics. https://doi.org/10.3115/981623.981633 DOI: https://doi.org/10.3115/981623.981633
Clark, S. (2015). Vector Space Models of Lexical Meaning. In S. Lappin & C. Fox, Handbook of contemporary semantics (2nd ed.). Willey-Blackwell. https://doi.org/10.1002/9781118882139.ch16 DOI: https://doi.org/10.1002/9781118882139.ch16
Coccaro, N., & Jurafsky, D. (1998). Towards better integration of semantic predictors in statistical language modeling. In Proceedings of ICSLP-98 (Vol. 6, pp. 2403–2406). DOI: https://doi.org/10.21437/ICSLP.1998-642
Dinu, G., & Baroni, M. (2014). How to make words with vectors: Phrase generation in distributional semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Vol. 1. Long papers (pp. 624–633). Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-1059 DOI: https://doi.org/10.3115/v1/P14-1059
Duyu, T., Wei, F., Yang, N., Ming, Z., Ting, L., & Bing, Q. (2014). Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Vol. 1. Long papers (pp. 1555–1565). Association for Computational Linguistics.
Faruqui, M., Tsvetkov, Y., & Rastogi, P. (2016). Problems with evaluation of word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP (pp. 30–35). Associacion of Computational Linguistics. https://doi.org/10.18653/v1/W16-2506 DOI: https://doi.org/10.18653/v1/W16-2506
Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114(2), 211–244. https://doi.org/10.1037/0033-295X.114.2.211 DOI: https://doi.org/10.1037/0033-295X.114.2.211
Harris, Z. S. (1954). Distributional structure. Word, 10(23), 146–162. https://doi.org/10.1080/00437956.1954.11659520 DOI: https://doi.org/10.1080/00437956.1954.11659520
Jastrzebski, S., Leśniak, D., & Czarnecki, W. M. (2017). How to evaluate word embeddings?: On importance of data efficiency and simple supervised tasks. Retrieved 23 July 2017, from https://arxiv.org/pdf/1702.02170
Kędzia, P., Czachor, G., Piasecki, M., & Kocoń, J. (2016). Vector representations of Polish words (Word2Vec method). CLARIN-PL digital repository. http://hdl.handle.net/11321/327
Kim, H. K., Kim, H., & Cho, S. (2017). Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing, 266, 336–352. https://doi.org/10.1016/j.neucom.2017.05.046 DOI: https://doi.org/10.1016/j.neucom.2017.05.046
Kovatchev, V., Salamo, M., & Marti, M. (2016). Comparing Distributional Semantics Models for identifying groups of semantically related words. Procesamiento del Lenguaje Natural, 2016(57), 109–116.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. https://doi.org/10.1037/0033-295X.104.2.211 DOI: https://doi.org/10.1037/0033-295X.104.2.211
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284. https://doi.org/10.1080/01638539809545028 DOI: https://doi.org/10.1080/01638539809545028
Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203–208. https://doi.org/10.3758/BF03204766 DOI: https://doi.org/10.3758/BF03204766
McDonald, S. (2000). Environmental determinants of lexical processing effort (Unpublished doctoral dissertation). University of Edinburgh.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013, 3111–3119. https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of NAACL (pp. 746–751). Atlanta, GA.
Palmer, F. R. (Ed.). (1968). Selected papers of J. R. Firth 1952–1959. London: Longman. (Reprinted from A synopsis of linguistic theory 1930–1955: Studies in linguistic analysis, pp. 1–32, by J. R. Firth, 1957, Oxford: Philological Society).
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162 DOI: https://doi.org/10.3115/v1/D14-1162
Przepiórkowski, A., Bańko, M., Górski, R. L., & Lewandowska-Tomaszczyk, B. (Eds.). (2012). Narodowy Korpus Języka Polskiego. Warszawa: Wydawnictwo Naukowe PWN.
Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Valetta, Malta: ELRA.
Rogalski, M., & Szczepaniak, P. S. (2016). Word embeddings for the Polish language. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. Zadeh, & J. Zurada (Eds.), Artificial Intelligence and Soft Computing, ICAISC 2016: Part I. LNAI 9692 (pp. 126–135). https://doi.org/10.1007/978-3-319-39378-0_12 DOI: https://doi.org/10.1007/978-3-319-39378-0_12
Sager, J. C. (1990). A practical course in terminology processing. Amsterdam: John Benjamins. https://doi.org/10.1075/z.44 DOI: https://doi.org/10.1075/z.44
Scheible, S., Schulte im Walde, S., & Springorum, S. (2013). Uncovering distributional differences between synonyms and antonyms in a word space model. In International Joint Conference on Natural Language Processing (pp. 489–497). Ngoya, Japan.
Schutze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–124.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x DOI: https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Shaw, E. (2015). An SAT® Validity Primer. College Board Research. http://research.collegeboard.org/sites/default/files/publications/2015/2/research-report-sat-validity-primer.pdf
Shutova, E., Sun, L., Gutierrez, D., Lichtenstein, P., & Narayanan, S. (2017). Multilingual metaphor processing: Experiments with semi-supervised and unsupervised learning. Computational Linguistics, 43(1), 71–123. https://doi.org/10.1162/COLI_a_00275 DOI: https://doi.org/10.1162/COLI_a_00275
Spark Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21. https://doi.org/10.1108/eb026526 DOI: https://doi.org/10.1108/eb026526
Stokowiec, W. (2015). word2vec dla Polskiego Internetu. Retrieved 19 August 2017, from http://doczz.pl/doc/562319/word2vec-dla-polskiego-internetu
Tatjewski, M., Bańko, M., Kucińska, A., & Rączaszek-Leonardi, J. (2017). Computational distributional semantics and free associations: A comparison of two word-similarity models in a study of synonyms and lexical variants. In P. P. Waliński, Language, corpora and cognition. Frankfurt am Main: Peter Lang.
Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., & Dyer, C. (2015). Evaluation of Word Vector Representations by Subspace Alignment. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2049–2054). https://doi.org/10.18653/v1/D15-1243 DOI: https://doi.org/10.18653/v1/D15-1243
Turney, P. D. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning (pp. 491-502). Berlin: Springer-Verlag. DOI: https://doi.org/10.1007/3-540-44795-4_42
Turney, P., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 2010(37), 141–188. DOI: https://doi.org/10.1613/jair.2934
Waszczuk, J. (2012). Harnessing the CRF complexity with domain-specific constraints: The case of morphosyntactic tagging of a highly inflected language. In Proceedings of COLLING 2012 (pp. 2789–2804). Mumbai, India.
Weeds, J., Clark, D., Reffin, J., Weir, D., & Bill, K. (2014). Learning to distinguish hypernyms and co-hyponyms. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 2249–2259). Dublin: Dublin City University and Association for Computational Linguistics.
Wittgenstein, L. (1953). Philosophical investigations. Oxford: Basil Blackwell.
Woliński, M. (2014). Morfeusz reloaded. In N. Calzorali, K. Chourkri, T. Declerk, H. Loftsson, B. M. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation. Reykjavík: ELRA.
Downloads
Published
Issue
Section
License
Copyright (c) 2017 Agnieszka Mykowiecka, Małgorzata Marciniak, Piotr Rychlik

This work is licensed under a Creative Commons Attribution 3.0 Unported License.



