Tutorial #3

Title: From Detection to Direction: An Overview of Sound Event Localization and Detection

Presenter: Jun Wei Yeow and Ee-Leng TAN

Jun Wei Yeow
Ee-Leng TAN

Part I: Overview of Sound Event Localization and Detection (SELD) (30 minutes)

  • Introduction to SELD and its applications
  • History of SELD and its component tasks (Sound Event Detection and Sound Source Localization)
  • Recent advances and challenges in SELD
  • Publicly available SELD datasets

Part II: Core Technical Components of SELD (60 minutes)

  • Spatial audio formats used for SELD, including First Order Ambisonics, microphone array signals, and binaural recordings.
  • Contemporary feature extraction techniques that capture spatiotemporal cues needed for robust event detection and localization.
  • Deep learning architectures designed for SELD, including convolutional recurrent networks (CRNNs), transformer-based models, and multi-branch or multi-task setups.
  • Training strategies, such as multi-task learning (joint DOA and event classification), data augmentation for spatial audio, and domain adaptation techniques.
  • Benchmark datasets and metrics, including a deep dive into the DCASE Challenge series as well as evaluation criteria such as localization errors, detection accuracies, and combined SELD scores.

Coffee Break (30 minutes)

Part III: Advanced and Emerging Topics (60 minutes)

  • Semi-supervised and weakly labelled learning approaches.
  • Robustness to reverberation, overlapping events, and unseen acoustic scenes.
  • Multi-modal SELD systems that integrate complementary modalities, such as video recordings or motion sensors.
  • Complementary performance using acoustic scene classification (ASC)

Coffee Break (30 minutes)

Part IV: Real-Time Implementation of SELD (40 minutes)

  • Real-time constraints and considerations
  • Lightweight models suitable for real-time and edge applications.
  • Discussion and Q&A session

Venue: Hibiscus III