---
OA_place: publisher
OA_type: gold
_id: '18998'
abstract:
- lang: eng
  text: Word embeddings represent language vocabularies as clouds of d-dimensional
    points. We investigate how information is conveyed by the general shape of these
    clouds, instead of representing the semantic meaning of each token. Specifically,
    we use the notion of persistent homology from topological data analysis (TDA)
    to measure the distances between language pairs from the shape of their unlabeled
    embeddings. These distances quantify the degree of non-isometry of the embeddings.
    To distinguish whether these differences are random training errors or capture
    real information about the languages, we use the computed distance matrices to
    construct language phylogenetic trees over 81 Indo-European languages. Careful
    evaluation shows that our reconstructed trees exhibit strong and statistically-significant
    similarities to the reference.
article_processing_charge: No
arxiv: 1
author:
- first_name: Ondrej
  full_name: Draganov, Ondrej
  id: 2B23F01E-F248-11E8-B48F-1D18A9856A87
  last_name: Draganov
  orcid: 0000-0003-0464-3823
- first_name: Steven
  full_name: Skiena, Steven
  last_name: Skiena
citation:
  ama: 'Draganov O, Skiena S. The shape of word embeddings: Quantifying non-isometry
    with topological data analysis. In: <i>Findings of the Association for Computational
    Linguistics: EMNLP 2024</i>. Association for Computational Linguistics; 2024:12080-12099.
    doi:<a href="https://doi.org/10.18653/v1/2024.findings-emnlp.705">10.18653/v1/2024.findings-emnlp.705</a>'
  apa: 'Draganov, O., &#38; Skiena, S. (2024). The shape of word embeddings: Quantifying
    non-isometry with topological data analysis. In <i>Findings of the Association
    for Computational Linguistics: EMNLP 2024</i> (pp. 12080–12099). Miami, FL, United
    States: Association for Computational Linguistics. <a href="https://doi.org/10.18653/v1/2024.findings-emnlp.705">https://doi.org/10.18653/v1/2024.findings-emnlp.705</a>'
  chicago: 'Draganov, Ondrej, and Steven Skiena. “The Shape of Word Embeddings: Quantifying
    Non-Isometry with Topological Data Analysis.” In <i>Findings of the Association
    for Computational Linguistics: EMNLP 2024</i>, 12080–99. Association for Computational
    Linguistics, 2024. <a href="https://doi.org/10.18653/v1/2024.findings-emnlp.705">https://doi.org/10.18653/v1/2024.findings-emnlp.705</a>.'
  ieee: 'O. Draganov and S. Skiena, “The shape of word embeddings: Quantifying non-isometry
    with topological data analysis,” in <i>Findings of the Association for Computational
    Linguistics: EMNLP 2024</i>, Miami, FL, United States, 2024, pp. 12080–12099.'
  ista: 'Draganov O, Skiena S. 2024. The shape of word embeddings: Quantifying non-isometry
    with topological data analysis. Findings of the Association for Computational
    Linguistics: EMNLP 2024. EMNLP: Conference on Empirical Methods in Natural Language
    Processing, 12080–12099.'
  mla: 'Draganov, Ondrej, and Steven Skiena. “The Shape of Word Embeddings: Quantifying
    Non-Isometry with Topological Data Analysis.” <i>Findings of the Association for
    Computational Linguistics: EMNLP 2024</i>, Association for Computational Linguistics,
    2024, pp. 12080–99, doi:<a href="https://doi.org/10.18653/v1/2024.findings-emnlp.705">10.18653/v1/2024.findings-emnlp.705</a>.'
  short: 'O. Draganov, S. Skiena, in:, Findings of the Association for Computational
    Linguistics: EMNLP 2024, Association for Computational Linguistics, 2024, pp.
    12080–12099.'
conference:
  end_date: 2024-11-16
  location: Miami, FL, United States
  name: 'EMNLP: Conference on Empirical Methods in Natural Language Processing'
  start_date: 2024-11-12
corr_author: '1'
date_created: 2025-02-04T16:19:28Z
date_published: 2024-11-01T00:00:00Z
date_updated: 2025-02-10T08:21:37Z
day: '01'
ddc:
- '500'
department:
- _id: GradSch
- _id: HeEd
doi: 10.18653/v1/2024.findings-emnlp.705
external_id:
  arxiv:
  - '2404.00500'
file:
- access_level: open_access
  checksum: f4416a5962194f0181ab0dc7f9ef93c0
  content_type: application/pdf
  creator: dernst
  date_created: 2025-02-10T08:20:34Z
  date_updated: 2025-02-10T08:20:34Z
  file_id: '19016'
  file_name: 2024_EMNLP_Draganov.pdf
  file_size: 1312638
  relation: main_file
  success: 1
file_date_updated: 2025-02-10T08:20:34Z
has_accepted_license: '1'
language:
- iso: eng
license: https://creativecommons.org/licenses/by/4.0/
month: '11'
oa: 1
oa_version: Published Version
page: 12080-12099
publication: 'Findings of the Association for Computational Linguistics: EMNLP 2024'
publication_status: published
publisher: Association for Computational Linguistics
quality_controlled: '1'
scopus_import: '1'
status: public
title: 'The shape of word embeddings: Quantifying non-isometry with topological data
  analysis'
tmp:
  image: /images/cc_by.png
  legal_code_url: https://creativecommons.org/licenses/by/4.0/legalcode
  name: Creative Commons Attribution 4.0 International Public License (CC-BY 4.0)
  short: CC BY (4.0)
type: conference
user_id: 2DF688A6-F248-11E8-B48F-1D18A9856A87
year: '2024'
...
