# UIS-RNN

**Repository Path**: mirrors/UIS-RNN

## Basic Information

- **Project Name**: UIS-RNN
- **Description**: Google 人工智能研究部门在语音识别方面取得了新的进展，能从嘈杂的环境中分辨声音
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: https://www.oschina.net/p/uis-rnn
- **GVP Project**: No

## Statistics

- **Stars**: 8
- **Forks**: 6
- **Created**: 2018-11-13
- **Last Updated**: 2025-12-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# UIS-RNN
[![Python application](https://github.com/google/uis-rnn/workflows/Python%20application/badge.svg)](https://github.com/google/uis-rnn/actions/workflows/pythonapp.yml)
[![PyPI Version](https://img.shields.io/pypi/v/uisrnn.svg)](https://pypi.python.org/pypi/uisrnn)
[![Python Versions](https://img.shields.io/pypi/pyversions/uisrnn.svg)](https://pypi.org/project/uisrnn)
[![Downloads](https://static.pepy.tech/badge/uisrnn)](https://pepy.tech/project/uisrnn)
[![codecov](https://codecov.io/gh/google/uis-rnn/branch/master/graph/badge.svg)](https://codecov.io/gh/google/uis-rnn)
[![Documentation](https://img.shields.io/badge/api-documentation-blue.svg)](https://google.github.io/uis-rnn)

## Overview

This is the library for the
*Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN)* algorithm.
UIS-RNN solves the problem of segmenting and clustering sequential data
by learning from examples.

This algorithm was originally proposed in the paper
[Fully Supervised Speaker Diarization](https://arxiv.org/abs/1810.04719).

The work has been introduced by
[Google AI Blog](https://ai.googleblog.com/2018/11/accurate-online-speaker-diarization.html).

![gif](https://raw.githubusercontent.com/google/uis-rnn/master/resources/uisrnn.gif)

## Disclaimer

This open source implementation is slightly different than the internal one
which we used to produce the results in the
[paper](https://arxiv.org/abs/1810.04719), due to dependencies on
some internal libraries.

We CANNOT share the data, code, or model for the speaker recognition system
([d-vector embeddings](https://google.github.io/speaker-id/publications/GE2E/))
used in the paper, since the speaker recognition system
heavily depends on Google's internal infrastructure and proprietary data.

**This library is NOT an official Google product.**

We welcome community contributions ([guidelines](CONTRIBUTING.md))
to the [`uisrnn/contrib`](uisrnn/contrib) folder.
But we won't be responsible for the correctness of any community contributions.

## Dependencies

This library depends on:

* python 3.5+
* numpy 1.15.1
* pytorch 1.3.0
* scipy 1.1.0 (for evaluation only)

## Getting Started

[![YouTube](resources/YouTube_lecture_screenshot.png)](https://www.youtube.com/watch?v=pGkqwRPzx9U)

### Install the package

Without downloading the repository, you can install the
[package](https://pypi.org/project/uisrnn/) by:

```
pip3 install uisrnn
```

or

```
python3 -m pip install uisrnn
```

### Run the demo

To get started, simply run this command:

```bash
python3 demo.py --train_iteration=1000 -l=0.001
```

This will train a UIS-RNN model using `data/toy_training_data.npz`,
then store the model on disk, perform inference on `data/toy_testing_data.npz`,
print the inference results, and save the averaged accuracy in a text file.

PS. The files under `data/` are manually generated *toy data*,
for demonstration purpose only.
These data are very simple, so we are supposed to get 100% accuracy on the
testing data.

### Run the tests

You can also verify the correctness of this library by running:

```bash
bash run_tests.sh
```

If you fork this library and make local changes, be sure to use these tests
as a sanity check.

Besides, these tests are also great examples for learning
the APIs, especially `tests/integration_test.py`.

## Core APIs

### Glossary

| General Machine Learning | Speaker Diarization    |
|--------------------------|------------------------|
| Sequence                 | Utterance              |
| Observation / Feature    | Embedding / d-vector   |
| Label / Cluster ID       | Speaker                |

### Arguments

In your main script, call this function to get the arguments:

```python
model_args, training_args, inference_args = uisrnn.parse_arguments()
```

### Model construction

All algorithms are implemented as the `UISRNN` class. First, construct a
`UISRNN` object by:

```python
model = uisrnn.UISRNN(args)
```

The definitions of the args are described in `uisrnn/arguments.py`.
See `model_parser`.

### Training

Next, train the model by calling the `fit()` function:

```python
model.fit(train_sequences, train_cluster_ids, args)
```

The definitions of the args are described in `uisrnn/arguments.py`.
See `training_parser`.

The `fit()` function accepts two types of input, as described below.

#### Input as list of sequences (recommended)

Here, `train_sequences` is a list of observation sequences.
Each observation sequence is a 2-dim numpy array of type `float`.

* The first dimension is the length of this sequence. And the length
  can vary from one sequence to another.
* The second dimension is the size of each observation. This
  must be consistent among all sequences. For speaker diarization,
  the observation could be the
  [d-vector embeddings](https://google.github.io/speaker-id/publications/GE2E/).

`train_cluster_ids` is also a list, which has the same length as
`train_sequences`. Each element of `train_cluster_ids` is a 1-dim list or
numpy array of strings, containing the ground truth labels for the
corresponding sequence in `train_sequences`.
For speaker diarization, these labels are the speaker identifiers for each
observation.

When calling `fit()` in this way, please be very careful with the argument
`--enforce_cluster_id_uniqueness`.

For example, assume:

```python
train_cluster_ids = [['a', 'b'], ['a', 'c']]
```

If the label `'a'` from the two sequences refers to the same cluster across
the entire dataset, then we should have `enforce_cluster_id_uniqueness=False`;
otherwise, if `'a'` is only a local indicator to distinguish from `'b'` in the
1st sequence, and to distinguish from `'c'` in the 2nd sequence, then we should
have `enforce_cluster_id_uniqueness=True`.

Also, please note that, when calling `fit()` in this way, we are going to
concatenate all sequences and all cluster IDs, and delegate to
the next section below.

#### Input as single concatenated sequence

Here, `train_sequences` should be a single 2-dim numpy array of type `float`,
for the **concatenated** observation sequences.

For example, if you have *M* training utterances,
and each utterance is a sequence of *L* embeddings. Each embedding is
a vector of *D* numbers. Then the shape of `train_sequences` is *N * D*,
where *N = M * L*.

`train_cluster_ids` is a 1-dim list or numpy array of strings, of length *N*.
It is the **concatenated** ground truth labels of all training data.

Since we are concatenating observation sequences, it is important to note that,
ground truth labels in `train_cluster_id` across different sequences are
supposed to be **globally unique**.

For example, if the set of labels in the first
sequence is `{'A', 'B', 'C'}`, and the set of labels in the second sequence
is `{'B', 'C', 'D'}`. Then before concatenation, we should rename them to
something like `{'1_A', '1_B', '1_C'}` and `{'2_B', '2_C', '2_D'}`,
unless `'B'` and `'C'` in the two sequences are meaningfully identical
(in speaker diarization, this means they are the same speakers across
utterances). This part will be automatically taken care of by the argument
`--enforce_cluster_id_uniqueness` for the previous section.

The reason we concatenate all training sequences is that, we will be resampling
and *block-wise* shuffling the training data as a **data augmentation**
process, such that we result in a robust model even when there is insufficient
number of training sequences.

#### Training on large datasets

For large datasets, the data usually could not be loaded into memory at once.
In such cases, the `fit()` function needs to be called multiple times.

Here we provide a few guidelines as our suggestions:

1. Do not feed different datasets into different calls of `fit()`. Instead,
   for each call of `fit()`, the input should cover sequences from
   different datasets.
2. For each call to the `fit()` function, make the size of input roughly the
   same. And, don't make the input size too small.

### Prediction

Once we are done with training, we can run the trained model to perform
inference on new sequences by calling the `predict()` function:

```python
predicted_cluster_ids = model.predict(test_sequences, args)
```

Here `test_sequences` should be a list of 2-dim numpy arrays of type `float`,
corresponding to the observation sequences for testing.

The returned `predicted_cluster_ids` is a list of the same size as
`test_sequences`. Each element of `predicted_cluster_ids` is a list of integers,
with the same length as the corresponding test sequence.

You can also use a single test sequence for `test_sequences`. Then the returned
`predicted_cluster_ids` will also be a single list of integers.

The definitions of the args are described in `uisrnn/arguments.py`.
See `inference_parser`.

## Citations

Our paper is cited as:

```
@inproceedings{zhang2019fully,
  title={Fully supervised speaker diarization},
  author={Zhang, Aonan and Wang, Quan and Zhu, Zhenyao and Paisley, John and Wang, Chong},
  booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={6301--6305},
  year={2019},
  organization={IEEE}
}
```

## References

### Baseline diarization system

To learn more about our baseline diarization system based on
*unsupervised clustering* algorithms, check out
[this site](https://google.github.io/speaker-id/publications/LstmDiarization/).

A Python re-implementation of the *spectral clustering* algorithm used in this
paper is available [here](https://github.com/wq2012/SpectralCluster).

The ground truth labels for the
[NIST SRE 2000](https://catalog.ldc.upenn.edu/LDC2001S97)
dataset (Disk6 and Disk8) can be found
[here](https://github.com/google/speaker-id/tree/master/publications/LstmDiarization/evaluation/NIST_SRE2000).

For more public resources on speaker diarization, check out [awesome-diarization](https://github.com/wq2012/awesome-diarization).

### Speaker recognizer/encoder

To learn more about our speaker embedding system, check out
[this site](https://google.github.io/speaker-id/publications/GE2E/).

We are aware of several third-party implementations of this work:

* [Resemblyzer: PyTorch implementation by resemble-ai](https://github.com/resemble-ai/Resemblyzer)
* [TensorFlow implementation by Janghyun1230](https://github.com/Janghyun1230/Speaker_Verification)
* [PyTorch implementaion by HarryVolek](https://github.com/HarryVolek/PyTorch_Speaker_Verification) - with UIS-RNN integration
* [PyTorch implementation as part of SV2TTS](https://github.com/CorentinJ/Real-Time-Voice-Cloning)

Please use your own judgement to decide whether you want to use these
implementations.

**We are NOT responsible for the correctness of any third-party implementations.**

## Variants

Here we list the repositories that are based on UIS-RNN, but integrated with
other technologies or added some improvements.

| Link | Description |
| ---- | ----------- |
| [taylorlu/Speaker-Diarization](https://github.com/taylorlu/Speaker-Diarization) ![GitHub stars](https://img.shields.io/github/stars/taylorlu/Speaker-Diarization?style=social) | Speaker diarization using UIS-RNN and GhostVLAD. An easier way to support openset speakers. |
| [DonkeyShot21/uis-rnn-sml](https://github.com/DonkeyShot21/uis-rnn-sml) ![GitHub stars](https://img.shields.io/github/stars/DonkeyShot21/uis-rnn-sml?style=social) | A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data. |