Note of the paper Diart https://arxiv.org/abs/2109.06483

Context

Speaker diarization is the task of solving “who speaks when”.

multi-stage vs end-to-end

  1. multi-stage
    • voice activity detection to discard non-speech segments
    • speaker embedding extraction to get the speaker representation
    • clustering to group the speaker embeddings
  2. end-to-end
    • BLSTM (bi-directional LSTM, two processing, one from past to future, one from future to past)
    • Self-attention mechanism

The challenge of multi-stage approach is the overlapping speech. The challenge of end-to-end approach is unkown number of speakers and super hard to scale.

Diart

This diart paper is a hybrid approach, which combines the advantages of both multi-stage and end-to-end. It’s more towards the multi-stage approach with over-lapping speech as the first class citizen.

  1. split the audio into small chuncks, in each chunck, do end-to-end.
  2. apply global constrained clustering on top of the resulting local speaker.

Segmentation

Every few hundred milliseconds, the segmentation module do a fine-grained overlap-aware speaker diarization. it’s a end-to-end model.

Incremental Clustering

The local diarization is then ingested by the incremental clustering module

  1. TDNN (time delay neural network) is used to extract the speaker embeddings.
  2. Given the initial content of the rolling buffer, the segmentation and embedding steps are combined to extract one embedding for each speaker in the first 5 seconds of the audio.
  3. Every time, when the rolling buffer is updated, the incremental clustering module is triggered to update the speaker embeddings and the speaker cluster assignments.
  4. Detecting the new speakers, have hyper-parameters to control if it’s a return speaker or new speaker.