# pytorch-ge2e

**Repository Path**: create_future/pytorch-ge2e

## Basic Information

- **Project Name**: pytorch-ge2e
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-03-01
- **Last Updated**: 2022-03-01

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# D-vector

This is a PyTorch implementation of speaker embedding trained with GE2E loss.
The original paper about GE2E loss could be found here: [Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)

## Usage

```python
import torch
import torchaudio

wav2mel = torch.jit.load("wav2mel.pt")
dvector = torch.jit.load("dvector.pt").eval()

wav_tensor, sample_rate = torchaudio.load("example.wav")
mel_tensor = wav2mel(wav_tensor, sample_rate)  # shape: (frames, mel_dim)
emb_tensor = dvector.embed_utterance(mel_tensor)  # shape: (emb_dim)
```

You can also embed multiple utterances of a speaker at once:

```python
emb_tensor = dvector.embed_utterances([mel_tensor_1, mel_tensor_2])  # shape: (emb_dim)
```

There are 2 modules in this example:
- `wav2mel.pt` is the preprocessing module which is composed of 2 modules:
    - `sox_effects.pt` is used to normalize volume, remove silence, resample audio to 16 KHz, 16 bits, and remix all channels to single channel
    - `log_melspectrogram.pt` is used to transform waveforms to log mel spectrograms
- `dvector.pt` is the speaker encoder

Since all the modules are compiled with [TorchScript](https://pytorch.org/docs/stable/jit.html), you can simply load them and use anywhere **without any dependencies**.

### Pretrianed models & preprocessing modules

You can download them from the page of [*Releases*](https://github.com/yistLin/dvector/releases).

## Evaluate model performance

You can evaluate the performance of the model with equal error rate.
For example, download the official test splits (`veri_test.txt` and `veri_test2.txt`) from [The VoxCeleb1 Dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) and run the following command: 
```bash
python equal_error_rate.py VoxCeleb1/test VoxCeleb1/test/veri_test.txt -w wav2mel.pt -c dvector.pt
```

So far, the released checkpoint was only trained on VoxCeleb1 without any data augmentation.
Its performance on the official test splits of VoxCeleb1 are as following:
| Test Split | Equal Error Rate | Threshold |
| :-:        |:-:               |:-:        |
| veri_test.txt  | 12.0% | 0.222 |
| veri_test2.txt | 11.9% | 0.223 |

## Train from scratch

### Preprocess training data

To use the script provided here, you have to organize your raw data in this way:

- all utterances from a speaker should be put under a directory (**speaker directory**)
- all speaker directories should be put under a directory (**root directory**)
- **speaker directory** can have subdirectories and utterances can be placed under subdirectories

And you can extract utterances from multiple **root directories**, e.g.

```bash
python preprocess.py VoxCeleb1/dev LibriSpeech/train-clean-360 -o preprocessed
```

If you need to modify some audio preprocessing hyperparameters, directly modify `data/wav2mel.py`.
After preprocessing, 3 preprocessing modules will be saved in the output directory:
1. `wav2mel.pt`
2. `sox_effects.pt`
3. `log_melspectrogram.pt`

> The first module `wav2mel.pt` is composed of the second and the third modules.
> These modules were compiled with TorchScript and can be used anywhere to preprocess audio data.

### Train a model

You have to specify where to store checkpoints and logs, e.g.

```bash
python train.py preprocessed <model_dir>
```

During training, logs will be put under `<model_dir>/logs` and checkpoints will be placed under `<model_dir>/checkpoints`.
For more details, check the usage with `python train.py -h`.

### Use different speaker encoders

By default I'm using 3-layerd LSTM with attentive pooling as the speaker encoder, but you can use speaker encoders of different architecture.
For more information, please take a look at `modules/dvector.py`.

## Visualize speaker embeddings

You can visualize speaker embeddings using a trained d-vector.
Note that you have to structure speakers' directories in the same way as for preprocessing.
e.g.

```bash
python visualize.py LibriSpeech/dev-clean -w wav2mel.pt -c dvector.pt -o tsne.jpg
```

The following plot is the dimension reduction result (using t-SNE) of some utterances from LibriSpeech.

![TSNE result](images/tsne.png)