The shape of word embeddings: Quantifying non-isometry with topological data analysis

Draganov O, Skiena S. 2024. The shape of word embeddings: Quantifying non-isometry with topological data analysis. Findings of the Association for Computational Linguistics: EMNLP 2024. EMNLP: Conference on Empirical Methods in Natural Language Processing, 12080–12099.

Download
OA 2024_EMNLP_Draganov.pdf 1.31 MB [Published Version]

Conference Paper | Published | English

Scopus indexed
Author
Draganov, OndrejISTA ; Skiena, Steven

Corresponding author has ISTA affiliation

Abstract
Word embeddings represent language vocabularies as clouds of d-dimensional points. We investigate how information is conveyed by the general shape of these clouds, instead of representing the semantic meaning of each token. Specifically, we use the notion of persistent homology from topological data analysis (TDA) to measure the distances between language pairs from the shape of their unlabeled embeddings. These distances quantify the degree of non-isometry of the embeddings. To distinguish whether these differences are random training errors or capture real information about the languages, we use the computed distance matrices to construct language phylogenetic trees over 81 Indo-European languages. Careful evaluation shows that our reconstructed trees exhibit strong and statistically-significant similarities to the reference.
Publishing Year
Date Published
2024-11-01
Proceedings Title
Findings of the Association for Computational Linguistics: EMNLP 2024
Publisher
Association for Computational Linguistics
Page
12080-12099
Conference
EMNLP: Conference on Empirical Methods in Natural Language Processing
Conference Location
Miami, FL, United States
Conference Date
2024-11-12 – 2024-11-16
IST-REx-ID

Cite this

Draganov O, Skiena S. The shape of word embeddings: Quantifying non-isometry with topological data analysis. In: Findings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics; 2024:12080-12099. doi:10.18653/v1/2024.findings-emnlp.705
Draganov, O., & Skiena, S. (2024). The shape of word embeddings: Quantifying non-isometry with topological data analysis. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 12080–12099). Miami, FL, United States: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-emnlp.705
Draganov, Ondrej, and Steven Skiena. “The Shape of Word Embeddings: Quantifying Non-Isometry with Topological Data Analysis.” In Findings of the Association for Computational Linguistics: EMNLP 2024, 12080–99. Association for Computational Linguistics, 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.705.
O. Draganov and S. Skiena, “The shape of word embeddings: Quantifying non-isometry with topological data analysis,” in Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, United States, 2024, pp. 12080–12099.
Draganov O, Skiena S. 2024. The shape of word embeddings: Quantifying non-isometry with topological data analysis. Findings of the Association for Computational Linguistics: EMNLP 2024. EMNLP: Conference on Empirical Methods in Natural Language Processing, 12080–12099.
Draganov, Ondrej, and Steven Skiena. “The Shape of Word Embeddings: Quantifying Non-Isometry with Topological Data Analysis.” Findings of the Association for Computational Linguistics: EMNLP 2024, Association for Computational Linguistics, 2024, pp. 12080–99, doi:10.18653/v1/2024.findings-emnlp.705.
All files available under the following license(s):
Creative Commons Attribution 4.0 International Public License (CC-BY 4.0):
Main File(s)
File Name
Access Level
OA Open Access
Date Uploaded
2025-02-10
MD5 Checksum
f4416a5962194f0181ab0dc7f9ef93c0


Export

Marked Publications

Open Data ISTA Research Explorer

Sources

arXiv 2404.00500

Search this title in

Google Scholar