In 2019 we launched Recorder, an audio recording app for Pixel telephones that helps customers create, handle, and edit audio recordings. It leverages current developments in on-device machine studying to transcribe speech, acknowledge audio occasions, recommend tags for titles, and assist customers navigate transcripts.
Nonetheless, some Recorder customers discovered it tough to navigate lengthy recordings which have a number of audio system as a result of it is not clear who stated what. Through the Made By Google occasion this 12 months, we introduced the “speaker labels” function for the Recorder app. This opt-in function annotates a recording transcript with distinctive and nameless labels for every speaker (e.g., “Speaker 1”, “Speaker 2”, and so on.) in actual time through the recording. It considerably improves the readability and value of the recording transcripts. This function is powered by Google’s new speaker diarization system named Flip-to-Diarize, which was first offered at ICASSP 2022.
Left: Recorder transcript with out speaker labels. Proper: Recorder transcript with speaker labels. |
System Structure
Our speaker diarization system leverages a number of extremely optimized machine studying fashions and algorithms to permit diarizing hours of audio in a real-time streaming vogue with restricted computational assets on cellular units. The system primarily consists of three elements: a speaker flip detection mannequin that detects a change of speaker within the enter speech, a speaker encoder mannequin that extracts voice traits from every speaker flip, and a multi-stage clustering algorithm that annotates speaker labels to every speaker flip in a extremely environment friendly method. All elements run absolutely on the machine.
Structure of the Flip-to-Diarize system. |
Detecting Speaker Turns
The primary element of our system is a speaker flip detection mannequin based mostly on a Transformer Transducer (T-T), which converts the acoustic options into textual content transcripts augmented with a particular token <st>
representing a speaker flip. Not like previous personalized techniques that use role-specific tokens (e.g., <physician>
and <affected person>
) for conversations, this mannequin is extra generic and might be educated on and deployed to varied utility domains.
In most purposes, the output of a diarization system just isn’t immediately proven to customers, however mixed with a separate automated speech recognition (ASR) system that’s educated to have smaller phrase errors. Due to this fact, for the diarization system, we’re comparatively extra tolerant to phrase token errors than errors of the <st>
token. Based mostly on this instinct, we suggest a brand new token-level loss perform that enables us to coach a small speaker flip detection mannequin with excessive accuracy on predicted <st>
tokens. Mixed with edit-based minimal Bayes danger (EMBR) coaching, this new loss perform considerably improved the interval-based F1 rating on seven analysis datasets.
Extracting Voice Traits
As soon as the audio recording has been segmented into homogeneous speaker turns, we use a speaker encoder mannequin to extract an embedding vector (i.e., d-vector) to characterize the voice traits of every speaker flip. This strategy has a number of benefits over prior work that extracts embedding vectors from small fixed-length segments. First, it avoids extracting an embedding from a phase containing speech from a number of audio system. On the identical time, every embedding covers a comparatively massive time vary that incorporates enough alerts from the speaker. It additionally reduces the full variety of embeddings to be clustered, thus making the clustering step inexpensive. These embeddings are processed totally on-device till speaker labeling of the transcript is accomplished, after which deleted.
Multi-Stage Clustering
After the audio recording is represented by a sequence of embedding vectors, the final step is to cluster these embedding vectors, and assign a speaker label to every. Nonetheless, since audio recordings from the Recorder app might be as quick as a couple of seconds, or so long as as much as 18 hours, it’s essential for the clustering algorithm to deal with sequences of drastically totally different lengths.
For this we suggest a multi-stage clustering technique to leverage the advantages of various clustering algorithms. First, we use the speaker flip detection outputs to find out whether or not there are a minimum of two totally different audio system within the recording. For brief sequences, we use agglomerative hierarchical clustering (AHC) because the fallback algorithm. For medium-length sequences, we use spectral clustering as our major algorithm, and use the eigen-gap criterion for correct speaker rely estimation. For lengthy sequences, we scale back computational price through the use of AHC to pre-cluster the sequence earlier than feeding it to the primary algorithm. Through the streaming, we preserve a dynamic cache of earlier AHC cluster centroids that may be reused for future clustering calls. This mechanism permits us to implement an higher sure on all the system with fixed time and area complexity.
This multi-stage clustering technique is a essential optimization for on-device purposes the place the price range for CPU, reminiscence, and battery may be very small, and permits the system to run in a low energy mode even after diarizing hours of audio. As a tradeoff between high quality and effectivity, the higher sure of the computational price might be flexibly configured for units with totally different computational assets.
Diagram of the multi-stage clustering technique. |
Correction and Customization
In our real-time streaming speaker diarization system, because the mannequin consumes extra audio enter, it accumulates confidence on predicted speaker labels, and will sometimes make corrections to beforehand predicted low-confidence speaker labels. The Recorder app routinely updates the speaker labels on the display throughout recording to mirror the most recent and most correct predictions.
On the identical time, the Recorder app’s UI permits the person to rename the nameless speaker labels (e.g., “Speaker 2”) to personalized labels (e.g., “automotive vendor”) for higher readability and simpler memorization for the person inside every recording.
Recorder permits the person to rename the speaker labels for higher readability. |
Future Work
At the moment, our diarization system principally runs on the CPU block of Google Tensor, Google’s custom-built chip that powers newer Pixel telephones. We’re engaged on delegating extra computations to the TPU block, which is able to additional scale back the general energy consumption of the diarization system. One other future work route is to leverage multilingual capabilities of speaker encoder and speech recognition fashions to broaden this function to extra languages.
Acknowledgments
The work described on this put up represents joint efforts from a number of groups inside Google. Contributors embody Quan Wang, Yiling Huang, Evan Clark, Qi Cao, Han Lu, Guanlong Zhao, Wei Xia, Hasim Sak, Alvin Zhou, Jason Pelecanos, Luiza Timariu, Allen Su, Fan Zhang, Hugh Love, Kristi Bradford, Vincent Peng, Raff Tsai, Richard Chou, Yitong Lin, Ann Lu, Kelly Tsai, Hannah Bowman, Tracy Wu, Taral Joglekar, Dharmesh Mokani, Ajay Dudani, Ignacio Lopez Moreno, Diego Melendo Casado, Nino Tasca, Alex Gruenstein.