Application of Topic Modelling for the Construction of Semantic Frames for Named Rivers

Juan Rojas García

Autores/as

Juan Rojas García Universidad de Granada http://orcid.org/0000-0002-7611-1386

Resumen

EcoLexicon is a terminological knowledge base on environmental science, whose design permits the geographic contextualization of data. For the geographic contextualization of concepts related to named landforms, this paper presents a semiâ€‘automatic method of extracting terms associated with named rivers (e.g., Salinas River). Terms were extracted from a specialized corpus on Coastal Engineering, where named rivers were automatically identified. Statistical procedures were applied for selecting both terms and rivers in distributional semantic models to construct the conceptual structures underlying the usage of named rivers. The rivers sharing associated terms were also clustered and represented in the same conceptual network. The results showed that the method successfully described the semantic frames of named rivers with explanatory adequacy, according to the premises of Frameâ€‘based Terminology. Furthermore, the semantic networks unveiled that the named rivers were thematically related to sediment concentration in rivers, sediment discharge into bays, and the negative effects of sediment supply decrease on coastal erosion.

Biografía del autor/a

Juan Rojas García, Universidad de Granada

SPECIAL ISSUE

(Department of Translation and Interpreting, University of Granada, Spain)

Citas

Alrabia, M., Alhelewh, N., Al Salman, A. & Atwell, E. (2014). An empirical study on the Holy Quran based on a large classical Arabic corpus. International Journal of Computational Linguistics, 5(1), 1 13.

Asr, F., Willits, J. & Jones, M. (2016). Comparing predictive and co occurrence based models of lexical semantics trained on child directed speech. In Proceedings of the 38th Annual Conference of the Cognitive Science Society (pp. 1092 1097). Philadelphia (Pennsylvania): CogSci.

Baroni, M., Dinu, G. & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context counting vs. context predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1 (pp. 238 247). Baltimore: ACL.

Bernier Colborne, G. & Drouin, P. (2016). Evaluation of distributional semantic models: A holistic approach. In Proceedings of the 5th International Workshop on Computational Terminology (pp. 52 61). Osaka: Computerm.

Bertels, A. & Speelman, D. (2014). Clustering for semantic purposes: Exploration of semantic similarity in a technical corpus. Terminology, 20(2), 279 303.

Blei, D.M., Ng, A.Y. & Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993 1022.

Blei, D.M. (2012). Probabilistic topic models. Communications of the ACM, 55 (4), 77 84.

Bullinaria, J.A. & Levy, J.P. (2007). Extracting semantic representations from word co occurrence statistics: A computational study. Behavior Research Methods, 39(3), 510 526.

Cabezas García, M., & Faber, P. (2018). Phraseology in specialized resources: An approach to complex nominals. Lexicography, 5(1), 55 83.

Csiszár, I., & Shields, P.C. (2004). Information theory and statistics: A tutorial. Foundations and Trends in Communications and Information Theory, 1(4), 417 528.

Derungs, C. & Purves, R.S. (2014). From text to landscape: Locating, identifying and mapping the use of landscape features in a Swiss Alpine corpus. International Journal of Geographical Information Science, 28(6), 1272 1293.

Derungs, C. & SamardÅ¾iÄ‡, T. (2018). Are prominent mountains frequently mentioned in text? Exploring the spatial expressiveness of text frequency. International Journal of Geographical Information Science, 32(5), 856 873.

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61 74.

Evert, S. (2007). Corpora and collocations. Extended manuscript of chapter 58 in A. Lüdeling & M. KytÃ¶ (Eds.) (2008), Corpus Linguistics. An International Handbook. Berlin: Mouton de Gruyter. Retrieved from http://www.stefan-evert.de/PUB/Evert2007HSK_extended_manuscript.pdf (last access: 2020 02 03).

Faber, P. (2009). The cognitive shift in terminology and specialized translation. MonTI. Monografías de Traducción e Interpretación, 1, 107 134.

Faber, P. (2011). The dynamics of specialized knowledge representation: Simulational reconstruction or the perception action interface. Terminology, 17(1), 9 29.

Faber, P. (Ed.). (2012). A Cognitive Linguistics View of Terminology and Specialized Language. Berlin and Boston: De Gruyter Mouton.

Faber, P., León Araúz, P. & Prieto, J.A. (2009). Semantic relations, dynamicity, and terminological knowledge bases. Current Issues in Language Studies, 1, 1 23.

Gries, S. & Stefanowitsch, A. (2010). Cluster analysis and the identification of collexeme classes. In S. Rice & J. Newman (Eds.), Empirical and Experimental Methods in Cognitive/Functional Research (pp. 73 90). Stanford (California): CSLI.

Griffiths, T.L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(1), 5228 5235.

Jockers, M.L., & Mimno, D. (2013). Significant themes in 19 century literature. Poetics, 41(6), 750 769.

Jurafsky, D., & Martin, J.H. (2019). Vector semantics and embeddings. In Speech and Language Processing. Draft of October 2, 2019. Retrieved from https://web.stanford.edu/~jurafsky/slp3/6.pdf (last access: 2020 02 03).

Kaufman, L. & Rousseeuw, P. (1990). Finding Groups in Data. Hoboken (New Jersey): Wiley Interscience.

Kiela, D. & Clark, S. (2014). A systematic study of semantic vector space model parameters. In Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (pp. 21 30). Gothenburg (Sweden): EACL.

Kilgarriff, A., Rychly, P., Smrz, P. & Tugwell, D. (2004). The Sketch Engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th EURALEX International Congress (pp. 105 116). Lorient: EURALEX.

Krenn, B. (2000). The Usual Suspects: Data Oriented Models for the Identification and Representation of Lexical Collocations. Saarbrücken: DFKI & University of Saarland, vol. 7, Saarbrücken Dissertations in Computational Linguistics and Language Technology.

Landauer, T.K., McNamara, D.S., Dennis, S. & Kintsch, W. (Eds.). (2011). Handbook of Latent Semantic Analysis. New York: Routledge.

Lapesa, G., Evert, S. & Schulte im Walde, S. (2014). Contrasting syntagmatic and paradigmatic relations: insights from distributional semantic models. In Proceedings of the 3rd Joint Conference on Lexical and Computational Semantics (pp. 160 170). Dublin: SEM.

Lee, L. (1999). Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 25â€“32). College Park (Maryland): ACL.

León Araúz, P., Reimerink, A. & Faber, P. (2013). Multidimensional and multimodal information in EcoLexicon. In A. Przepiórkowski, M. Piasecki, K. Jassem & P. Fuglewicz (Eds.), Computational Linguistics (pp. 143 161). Berlin: Springer.

León-Araúz, P., San Martín, A. & Faber, P. (2016). Pattern based word sketches for the extraction of semantic relations. In Proceedings of the 5th International Workshop on Computational Terminology (pp. 73â€“82). Osaka (Japan): Computerm.

León-Araúz, P., San Martín, A. & Reimerink, A. (2018). The EcoLexicon English corpus as an open corpus in Sketch Engine. In J. ÄŒibej, V. Gorjanc, I. Kosem & S. Krek (Eds.), Proceedings of the 18th EURALEX International Congress (pp. 893 901). Ljubljana: Euralex.

Levi, J. (1978). The Syntax and Semantics of Complex Nominals. New York: Academic Press.

Luhn, H. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4), 309 317.

Manning, C.D., Raghavan, P. & Schütze, H. (1998). Introduction to Information Retrieval. Cambridge (England): Cambridge University Press.

Meyer, I. (2001). Extracting knowledge rich contexts for terminography: A conceptual and methodogical framework. In D. Bourigault, C. Jacquemin & M.C. L’Homme (Eds), Recent Advances in Computational Terminology (279 302). Amsterdam and Philadelphia: John Benjamins.

Meyer, I., & Mackintosh, K. (1996). Refining the terminographer’s concept-analysis methods: How can phraseology help? Terminology, 3, 1 26.

Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. In Workshop Proceedings of International Conference on Learning Representations. Scottsdale (Arizona): ICLR.

Miller, G.A. & Charles, W.G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1 28.

Moisl, H. (2011). Finding the minimum document length for reliable clustering of multi document natural language corpora. Journal of Quantitative Linguistics, 18 (1), 23 52.

Moisl, H. (2015). Cluster Analysis for Corpus Linguistics (pp. 77 93). Berlin, Munich and Boston: De Gruyter Mouton.

Moisl, H., Maguire, W. & Allen, W. (2006). Phonetic variation in Tyneside: Exploratory multivariate analysis of the Newcastle Electronic Corpus of Tyneside English. In F. Hinskens (Ed.), Language Variation â€“ European Perspectives (pp. 127 141). Amsterdam: John Benjamins.

Moskalski, S. & Torres, R. (2012). Influences of tides, weather, and discharge on suspended sediment concentration. Continental Shelf Research, 37, 36 45. Retrieved from https://www.sciencedirect.com/science/article/pii/S0278434312000180 (last access: 2020 02 03).

Murakami, A., Thompson, P., Hunston, S., & Vajn, D. (2017). «What is this corpus about?’: Using topic modelling to explore a specialised corpus. Corpora, 12(2), 243 277.

Nakov, P. (2013). On the interpretation of noun compounds: Syntax, semantics, and entailment. Natural Language Engineering, 19(3), 291 330.

Navarro Colorado, B., & Tomás, D. (2015). A fully unsupervised topic modeling approach to metaphor identification. In Actas del XXXI Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural (without pagination). Alicante (Spain): SEPLN.

Pantel, P. & Lin, D. (2002). Discovering word senses from text. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (pp. 613 619). Edmonton (Canada): KDD 02.

Ritter, A., Mausam, & Etzioni, O. (2010). A latent Dirichlet allocation method for selectional preferences. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 424 434). Uppsala (Sweden): ACL.

Rohde, D., Gonnerman, L. & Plaut, D. (2006). An improved model of semantic similarity based on lexical co occurrence. Communications of the ACM, 8, 627 633.

Sager, J.C., Dungworth, D., & McDonald, P.F. (1980). English Special Languages. Principles and Practice in Science and Technology. Wiesbaden: Brandstetter Verlag.

Sahlgren, M. & Lenci, A. (2016). The effects of data size and frequency range on distributional semantic models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 975 980). Austin (Texas): ACL.

Salton, G. & Lesk, M.E. (1968). Computer evaluation of indexing and text processing. Journal of the ACM, 15(1), 8 36.

Shutova, E., Sun, L. & Korhonen, A. (2010). Metaphor identification using verb and noun clustering. In Proceedings of the 23rd International Conference on Computational Linguistics, vol. 2 (pp.1002 1010). Beijing (China): COLING.

SpÃ¤rck J.K., Walker, S. & Robertson, S. (2000). A probabilistic model of information retrieval: Development and comparative experiments, part 2. In Information Processing and Management, 36, 809 840.

Spies, M. (2018). Probabilistic topic models for small corpora â€“ An empirical study. In C. Roche (Ed.), TOTh 2017 â€“ Terminologie & Ontologie: Théories et Applications (pp. 137 160). Chambéry (France): Ã‰ditions de l'Université Savoie Mont Blanc.

Suzuki, R. & Shimodaira, H. (2006). Pvclust: An R package for assessing the uncertainty in hierarchical clustering. Bioinformatics, 22(12), 1540 1542.

Thabet, N. (2005). Understanding the thematic structure of the Qur’an: An exploratory multivariate approach. In Proceedings of the ACL Student Research Workshop (pp. 7 12). Michigan: ACL.

Wartmann, F.M., Acheson, E. & Purves, R.S. (2018). Describing and comparing landscapes using tags, texts, and free lists: An interdisciplinary approach. International Journal of Geographical Information Science, 32(8), 1572 1592.

Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. In Proceedings of the 22nd International Conference on World Wide Web (pp. 1445 1456). Rio de Janeiro (Brazil): WWW.