Grand Challenge

Title: City and Time-Aware Semi-supervised Acoustic Scene Classification

Host organization

  1. Xi’an University of Posts & Telecommunications
  2. Xi’an Lianfeng Acoustic Technologies Co., Ltd., China
  3. Institute of Acoustics, Chinese Academy of Sciences, China
  4. University of Surrey, UK
  5. Northwestern Polytechnical University, China
  6. Singapore Institute of Technology, Singapore
  7. Nanyang Technological University, Singapore

Organizer

  • Dr.Jisheng Bai 1,2,∗ (baijs@xupt.edu.cn
  • Bin Xiang 2
  • Dr. Mou Wang 3
  • Haohe Liu 4
  • Prof. Ying Liu 1
  • Prof. Jianfeng Chen 5
  • Prof. Dongyuna Shi 5
  • Prof. Mark D. Plumbley 4
  • Prof. Susanto Rahardja 6
  • Prof. Woon-Seng Gan 7

Website and Baseline

The official website is now available at: https://ascchallenge.xshengyun.com/ .
The baseline system has been updated on GitHub:  https://github.com/JishengBai/APSIPA2025GC-ASC.

Challenge Schedule

Challenge Description

Acoustic scene classification (ASC) is a crucial research problem in computational audition that aims to recognize the unique acoustic characteristics of an environment. Despite substantial progress in ASC technology, existing approaches often treat acoustic scenes as static environments, ignoring the significant variations that occur across different cities and temporal contexts (time of day, day of week).

Acoustic characteristics of the same scene category can vary dramatically between different cities and times. For example, a public square can change between weekday mornings and week-end evenings, or between different cities with distinct cultural characteristics. Current ASC systems that ignore these contextual factors struggle to generalize across these variations.

While the ICME 2024 challenge ”Semi-supervised Acoustic Scene Classification under Domain Shift” made progress in addressing geographic domain shift, it did not explicitly incorporate city and time information. In this APSIPA ASC 2025 challenge, we provide participants with city-level location data and precise timestamps that accompany each audio sample, encouraging the development of classification approaches that can leverage this additional contextual information to improve the accuracy.

This challenge maintains the semi-supervised learning approach from the previous challenge, as real-world applications often have access to abundant unlabeled data alongside limited labeled examples. Participants are encouraged to develop innovative methods that effectively utilize both labeled and unlabeled audio data in conjunction with their associated city and temporal metadata.

Overview of the APSIPA ASC 2025 Grand Challenge framework. The approach combines pre-training on existing datasets to learn general acoustic features, followed by semi-supervised learning that leverages both labeled and unlabeled data along with spatiotemporal metadata to improve classification performance across different spatial and temporal contexts.

Expected Impact and Significance

The challenge addresses a critical gap in current ASC research by introducing city and temporal information into the classification process. This advancement is particularly significant for developing more adaptable environmental sound analysis systems. This challenge explores how acoustic scenes vary across different Chinese cities and time periods, which are crucial factors often overlooked in real-world applications. The incorporation of city-level location data and timestamp metadata not only promotes novel semi-supervised learning approaches but also advances domain adaptation methods for handling acoustic variations across diverse urban environments and temporal contexts. From an industrial perspective, the outcomes of this challenge will directly benefit various real-world applications, including smart city monitoring, intelligent devices, and urban planning tools.

The challenge’s emphasis on semi-supervised learning addresses the practical constraint of limited labeled data, making the developed solutions more applicable to industry deployment. Moreover, by considering both city-specific characteristics and temporal patterns, the resulting models will be better equipped to handle the dynamic nature of acoustic environments, leading to more reliable and context-aware acoustic monitoring systems that can adapt to different cities and times of day.

Dataset

For the APSIPA ASC 2025 grand challenge “City and Time-Aware Semi-supervised Acoustic Scene Classification”, we provide a development dataset comprising approximately 24 hours of audio recordings from the Chinese Acoustic Scene (CAS) 2023 collection. This challenge introduces previously unutilized contextual metadata that accompanies each recording:

City information: Identification of the recording location among 22 diverse Chinese cities (e.g., Xi’an, Beijing, Shanghai)

Timestamp information: Precise recording time accurate to year, month, day, hour, minute, and second

The CAS 2023 dataset was systematically collected from April to September 2023 across cities that span diverse geographical regions, urban scales, and cultural characteristics of China. It in- cludes recordings from 10 distinct acoustic scene categories: Bus, Airport, Metro, Restaurant, Shopping mall, Public square, Urban park, Traffic street, Construction site, and Bar. While the audio files are the same as those used in the ICME 2024 challenge, the city and time metadata creates new opportunities for participants to develop models that better understand contextual variations in acoustic environments. Following a semi-supervised learning paradigm, the development dataset contains a limited amount of labeled data (approximately 4 hours) and a larger portion of unlabeled data (approximately 20 hours).

Additional Information

  1. Pre-training dataset restriction: Only TAU Urban Acoustic Scenes 2020 Mobile Development dataset and CochlScene dataset are allowed for model pre-training. This ensures fair comparison between approaches by standardizing the external data sources and prevents participants from using proprietary or unreleased datasets.
  2. No model ensembles: Model ensembles are NOT allowed in this competition. This focuses the challenge on developing individual models that effectively incorporate city and time information rather than boosting performance through ensemble techniques.
  3. No large audio and audio-language models: Large pre-trained models such as Qwen-Audio, Whisper, LTU, etc., are NOT allowed. This ensures that improvements come from the effective use of city and time information rather than leveraging massive pre-trained models.