M2 Internship - On device Speaker Diarization @ Ava France

Introduction

When joining a conversation with several persons, deaf and hearing-impaired people experience hard times trying to follow who is saying what. As a result, these people feel most of the time excluded from the daily social interactions, whether it is in a work meeting with their colleagues (in-person or remote) or in a bar with their friends.

Ava aims to help 450M deaf and hearing-impaired people live a fully accessible life. We provide them an app that gives the information in real-time of who is speaking and what they are saying. For this, the app relies on a hybrid combination of AI and audio signal processing which makes it able to work in different situations.

The core of the app is based on a speaker diarization system, i.e. a system that determines who is speaking and when they are speaking, followed by a speech-to-text step which provides the transcriptions to the user. The system that deterines who speaks when currently requires heavy computations and we would like to study the effects of quantizing and distilling machine learning models empowering this system on both the diarization error rate and the inference speed.

**← All Open Positions and More About Ava • Apply**

About this Internship

Internship topic - On device Speaker Diarization

Recent advances in hardware acceleration allowed to bring machine learning models on mobile and edge devices. As such, popular machine learning frameworks Tensorflow and PyTorch added support for running inference on such devices. Also, there has been a growing demand for running machine learning models that deal with sensitive data (image, speech, etc…) on edge and mobile devices especially with the rising concerns of internet users regarding privacy issues. Moreover, running AI models at the edge devices leads to improved processing latency and overall energy consumption.

Most of state of the art online speaker diarization systems (see, e.g., [1, 2]) that allow to tell “who spoke when (among multiple conversation participants)” are resource-consuming, thus might be run, e.g., in the cloud, but not on low-resources edge and mobile devices. This internship aims at bringing such systems to mobile devices. Running machine learning models on mobile devices usually requires quantizing and/or pruning [3] the models in order to speed up the computations on the device’s cpu or gpu without draining the battery. This internship will explore different quantization and pruning techniques and the tradeoffs that these techniques will bring to both the accuracy of the models as well as the inference speed. The tasks will be a mix of engineering and research and the student will mainly:

Quantize and prune already trained segmentation and embedding neural networks [4, 5] (post-training static and dynamic quantization) and deploy them to mobile.
Train from scratch quantized and pruned versions of the segmentation and the embedding neural networks (quantization aware training).
Explore knowledge distillation-based training of so-called slimmable neural networks [6, 7] to allow on-device models to instantly tradeoff between the resources and the performance.
Bonus: If time permit and if the student is interested, they will write primitives for running diarization on mobile devices (Android or iOs - depending on the programming skills of the student).

References:

[1] Coria, Juan M., Hervé Bredin, Sahar Ghannay, et Sophie Rosset. « Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation ». arXiv:2109.06483

[2] Bredin, Hervé, et Antoine Laurent. « End-to-end speaker segmentation for overlap-aware resegmentation ». arXiv:2104.04045