Session 4-Leveraging LLMs in Interdisciplinary Study of Speech and Language

Session: Leveraging LLMs in Interdisciplinary Study of Speech and Language

Moderator:

Prof. Tan Lee, School of Data Science, The Chinese University of Hong Kong, Shenzhen

Speakers:

Jiahong Yuan
Interdisciplinary Research Center for Linguistic Sciences
University of Science and Technology of China
https://fusep.ustc.edu.cn/2025/01/15/jiahong-yuan/

Title: Normalization through Fine-tuning: Understanding Wav2vec2.0 Embeddings for Phonetic Analysis

Phonetic normalization is essential in speech recognition and analysis, ensuring the comparability of features extracted from raw audio data. In the current paradigm of fine-tuning pre-trained large transformer models, however, it is not treated as a necessary preprocessing step; rather, it is implicitly carried out within the models themselves. This study examines how normalization emerges in transformer-based speech models, with a focus on Wav2vec2.0. Understanding this process is a crucial step toward effectively leveraging model embeddings as phonetic features for speech analysis. Our results show that fine-tuning Wav2vec2.0 can achieve phonetic normalization by selectively suppressing irrelevant information, such as speaker sex, while enhancing task-relevant information, such as tones and finals. In multi-task learning, however, the embeddings of fine-tuned models retain information for each task without compromising performance, suggesting that suppressing task-irrelevant information is not strictly required for effective classification. These results demonstrate that phonetic normalization can be flexibly achieved in speech models, providing insights into potential parallels in human speech perception.

Mark Liberman
Department of Linguistics
University of Pennsylvania, USA
https://www.ling.upenn.edu/~myl/

Title: Speech-To-Tech in Clinical and Educational Applications

Automatic Speech-To-Tech systems now work very well — for certain kinds of speech and certain kinds of applications. Since today’s systems are mostly trained on read or rehearsed speech, and are meant to produce readable transcripts, they don’t do well with disfluencies, mispronunciations, and other features that are frequent in spontaneous speech or non-fluent reading. In some clinical and educational applications, correct recognition and classification of these features is crucial. This talk will describe the issues and suggest next steps.

Chi-Chun Lee
Department of Electrical Engineering
National Tsing Hua University
https://biic.ee.nthu.edu.tw/biicers.php#

Title: Enabling Internationalization of Affective Speech Technology using LLMs

Affective speech technology aspires to equip machines with the ability to sense, interpret, and generate emotionally expressive speech, enabling empathetic assistants, social robots, and digital health companions. Large Audio/Speech Language Models (LALMs/SpeechLMs) now dominate this space: a single model can perform speech recognition, affect detection, and emotion-controlled synthesis, achieving impressive zero-shot generalization. However, we argue that LALMs are not yet internationalized: culturally grounded affect is misread when training data are skewed, leading to mis-recognition of affect, culturally inappropriate responses, and uneven user experiences. This paper surveys the current state of affective speech processing with LALMs, cataloging leading models, their sensing-to-synthesis capabilities, and the databases and metrics used for evaluation. We identify the key obstacle to responsible deployment: the heterogeneity of human vocal expression across cultures, which manifests as data scarcity, model bias, and evaluation blind spots. To address this gap, we propose a research agenda comprising: (i) systematic analysis of cultural variation in vocal affect, (ii) computational strategies for contextualizing LALMs toward culturally sensitive emotion processing, and (iii) benchmarks featuring balanced corpora and culture-aware metrics. By charting these directions, we aim to advance affective speech technology that is globally robust, socially responsible, and truly inclusive

Emmanuele Chersoni
Department of Chinese and Bilingual Studies
The Hong Kong Polytechnic University, Hong Kong
https://research.polyu.edu.hk/en/persons/emmanuele-chersoni

Title: Can Large Language Models Help in the Psycholinguistic Data Collection?

In linguistics and Natural Language Processing (NLP), it is a common practice to collect linguistic data from speakers via surveys and interviews. The recent wave of interest for the performance of Large Language Models (LLMs) sparked a debate about whether such data can be replaced with the automatic generations of a machine participants (Kim and Lee, 2023; Kuzman et al., 2023; Pavlovic and Poesio, 2024; Kapania et al., 2025), using prompts that closely resemble the questions asked to human participants.

In our contribution, we would like to discuss the usage of LLMs to collect a commonly-used type of annotation in linguistic research: psycholinguistic norms. If norms can be automatically acquired via LLM prompting, this could spare researchers the need of extensive data collections and simplify the acquisition of new datasets for low-resource languages. We aim at describing a few recent studies that reported slightly contrasting results (Brysbaert et al., 2024; Xu et al., 2025; Peng et al., 2025), and we will try to explain the contrast in light of the current debate about advantages and limitations of LLMs.