Tutorial #3

Title: From Detection to Direction: An Overview of Sound Event Localization and Detection

Presenter: Jun Wei Yeow and Ee-Leng TAN

Part I: Overview of Sound Event Localization and Detection (SELD) (30 minutes)

Introduction to SELD and its applications
History of SELD and its component tasks (Sound Event Detection and Sound Source Localization)
Recent advances and challenges in SELD
Publicly available SELD datasets

Part II: Core Technical Components of SELD (60 minutes)

Spatial audio formats used for SELD, including First Order Ambisonics, microphone array signals, and binaural recordings.
Contemporary feature extraction techniques that capture spatiotemporal cues needed for robust event detection and localization.
Deep learning architectures designed for SELD, including convolutional recurrent networks (CRNNs), transformer-based models, and multi-branch or multi-task setups.
Training strategies, such as multi-task learning (joint DOA and event classification), data augmentation for spatial audio, and domain adaptation techniques.
Benchmark datasets and metrics, including a deep dive into the DCASE Challenge series as well as evaluation criteria such as localization errors, detection accuracies, and combined SELD scores.

Coffee Break (30 minutes)

Part III: Advanced and Emerging Topics (60 minutes)

Semi-supervised and weakly labelled learning approaches.
Robustness to reverberation, overlapping events, and unseen acoustic scenes.
Multi-modal SELD systems that integrate complementary modalities, such as video recordings or motion sensors.
Complementary performance using acoustic scene classification (ASC)

Coffee Break (30 minutes)

Part IV: Real-Time Implementation of SELD (40 minutes)