Deep Learning & 3D Convolutional Neural Networks for Speaker Verification

Last update: Dec 17, 2022

Overview

TensorFlow implementation of 3D Convolutional Neural Networks for Speaker Verification - Official Project Page - Pytorch Implementation

https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat

https://badges.frapsoft.com/os/v2/open-source.svg?v=102

https://img.shields.io/twitter/follow/amirsinatorfi.svg?label=Follow&style=social

This repository contains the code release for our paper titled as "Text-Independent Speaker Verification Using 3D Convolutional Neural Networks". The link to the paper is provided as well.

The code has been developed using TensorFlow. The input pipeline must be prepared by the users. This code is aimed to provide the implementation for Speaker Verification (SR) by using 3D convolutional neural networks following the SR protocol.

Citation

If you used this code, please kindly consider citing the following paper:

@article{torfi2017text,
  title={Text-independent speaker verification using 3d convolutional neural networks},
  author={Torfi, Amirsina and Nasrabadi, Nasser M and Dawson, Jeremy},
  journal={arXiv preprint arXiv:1705.09422},
  year={2017}
}

DEMO

For running a demo, after forking the repository, run the following scrit:

./run.sh

General View

We leveraged 3D convolutional architecture for creating the speaker model in order to simultaneously capturing the speech-related and temporal information from the speakers' utterances.

Speaker Verification Protocol(SVP)

In this work, a 3D Convolutional Neural Network (3D-CNN) architecture has been utilized for text-independent speaker verification in three phases.

1. At the development phase, a CNN is trained to classify speakers at the utterance-level.

2. In the enrollment stage, the trained network is utilized to directly create a speaker model for each speaker based on the extracted features.

3. Finally, in the evaluation phase, the extracted features from the test utterance will be compared to the stored speaker model to verify the claimed identity.

The aforementioned three phases are usually considered as the SV protocol. One of the main challenges is the creation of the speaker models. Previously-reported approaches create speaker models based on averaging the extracted features from utterances of the speaker, which is known as the d-vector system.

How to leverage 3D Convolutional Neural Networks?

In our paper, we propose the implementation of 3D-CNNs for direct speaker model creation in which, for both development and enrollment phases, an identical number of speaker utterances is fed to the network for representing the spoken utterances and creation of the speaker model. This leads to simultaneously capturing the speaker-related information and building a more robust system to cope with within-speaker variation. We demonstrate that the proposed method significantly outperforms the d-vector verification system.

Code Implementation

The input pipeline must be provided by the user. Please refer to ``code/0-input/input_feature.py`` for having an idea about how the input pipeline works.

Input Pipeline for this work

The MFCC features can be used as the data representation of the spoken utterances at the frame level. However, a drawback is their non-local characteristics due to the last DCT 1 operation for generating MFCCs. This operation disturbs the locality property and is in contrast with the local characteristics of the convolutional operations. The employed approach in this work is to use the log-energies, which we call MFECs. The extraction of MFECs is similar to MFCCs by discarding the DCT operation. The temporal features are overlapping 20ms windows with the stride of 10ms, which are used for the generation of spectrum features. From a 0.8- second sound sample, 80 temporal feature sets (each forms a 40 MFEC features) can be obtained which form the input speech feature map. Each input feature map has the dimen- sionality of ζ × 80 × 40 which is formed from 80 input frames and their corresponding spectral features, where ζ is the number of utterances used in modeling the speaker during the development and enrollment stages.

The speech features have been extracted using [SpeechPy] package.

Implementation of 3D Convolutional Operation

The Slim high-level API made our life very easy. The following script has been used for our implementation:

net = slim.conv2d(inputs, 16, [3, 1, 5], stride=[1, 1, 1], scope='conv11')
net = PReLU(net, 'conv11_activation')
net = slim.conv2d(net, 16, [3, 9, 1], stride=[1, 2, 1], scope='conv12')
net = PReLU(net, 'conv12_activation')
net = tf.nn.max_pool3d(net, strides=[1, 1, 1, 2, 1], ksize=[1, 1, 1, 2, 1], padding='VALID', name='pool1')

############ Conv-2 ###############
############ Conv-1 ###############
net = slim.conv2d(net, 32, [3, 1, 4], stride=[1, 1, 1], scope='conv21')
net = PReLU(net, 'conv21_activation')
net = slim.conv2d(net, 32, [3, 8, 1], stride=[1, 2, 1], scope='conv22')
net = PReLU(net, 'conv22_activation')
net = tf.nn.max_pool3d(net, strides=[1, 1, 1, 2, 1], ksize=[1, 1, 1, 2, 1], padding='VALID', name='pool2')

############ Conv-3 ###############
############ Conv-1 ###############
net = slim.conv2d(net, 64, [3, 1, 3], stride=[1, 1, 1], scope='conv31')
net = PReLU(net, 'conv31_activation')
net = slim.conv2d(net, 64, [3, 7, 1], stride=[1, 1, 1], scope='conv32')
net = PReLU(net, 'conv32_activation')
# net = slim.max_pool2d(net, [1, 1], stride=[4, 1], scope='pool1')

############ Conv-4 ###############
net = slim.conv2d(net, 128, [3, 1, 3], stride=[1, 1, 1], scope='conv41')
net = PReLU(net, 'conv41_activation')
net = slim.conv2d(net, 128, [3, 7, 1], stride=[1, 1, 1], scope='conv42')
net = PReLU(net, 'conv42_activation')
# net = slim.max_pool2d(net, [1, 1], stride=[4, 1], scope='pool1')

############ Conv-5 ###############
net = slim.conv2d(net, 128, [4, 3, 3], stride=[1, 1, 1], normalizer_fn=None, scope='conv51')
net = PReLU(net, 'conv51_activation')

# net = slim.conv2d(net, 256, [1, 1], stride=[1, 1], scope='conv52')
# net = PReLU(net, 'conv52_activation')

# Last layer which is the logits for classes
logits = tf.contrib.layers.conv2d(net, num_classes, [1, 1, 1], activation_fn=None, scope='fc')

As it can be seen, slim.conv2d has been used. However, simply by using 3D kernels as [k_x, k_y, k_z] and stride=[a, b, c] it can be turned into a 3D-conv operation. The base of the slim.conv2d is tf.contrib.layers.conv2d. Please refer to official Documentation for further details.

Disclaimer

The code architecture part has been heavily inspired by Slim and Slim image classification library. Please refer to this link for further details.

Citation

If you used this code please kindly cite the following paper:

@article{torfi2017text,
  title={Text-Independent Speaker Verification Using 3D Convolutional Neural Networks},
  author={Torfi, Amirsina and Nasrabadi, Nasser M and Dawson, Jeremy},
  journal={arXiv preprint arXiv:1705.09422},
  year={2017}
}

License

The license is as follows:

APPENDIX: How to apply the Apache License to your work.

   To apply the Apache License to your work, attach the following
   boilerplate notice, with the fields enclosed by brackets "{}"
   replaced with your own identifying information. (Don't include the brackets!)  The text should be enclosed in the appropriate
   comment syntax for the file format. We also recommend that a
   file or class name and description of purpose be included on the
   same "printed page" as the copyright notice for easier
   identification within third-party archives.

Copyright {2017} {Amirsina Torfi}

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Please refer to LICENSE file for further detail.

Contribution

We are looking forward to your kind feedback. Please help us to improve the code and make our work better. For contribution, please create the pull request and we will investigate it promptly. Once again, we appreciate your feedback and code inspections.

references

[SpeechPy]

Amirsina Torfi. 2017. astorfi/speech_feature_extraction: SpeechPy. Zenodo. doi:10.5281/zenodo.810392.

Comments

ValueError: Convolution expects input with rank 4, got 5

When I run the run.sh, it shows something wrong:

/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
  DeprecationWarning)
Train data shape: (12, 80, 40, 20)Train label shape: (12,)Test data shape: (12, 80, 40, 20)
Test label shape: (12,)
Traceback (most recent call last):  File "./code/1-development/train_softmax.py", line 602, in <module>    tf.app.run()
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "./code/1-development/train_softmax.py", line 414, in main
    logits, end_points_speech = model_speech_fn(batch_speech[i * step: (i + 1) * step])  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/code/1-development/nets/nets_factory.py", line 59, in network_fn
    return func(images, num_classes, is_training=is_training)
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/code/1-development/nets/cnn_speech.py", line 118, in speech_cnn
    net = slim.conv2d(inputs, 16, [3, 1, 5], stride=[1, 1, 1], scope='conv11')
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
    return func(*args, **current_args)
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
    conv_dims=2)
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
    return func(*args, **current_args)
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1025, in convolution
    (conv_dims + 2, input_rank))
ValueError: Convolution expects input with rank 4, got 5
Closing remaining open files:data/development_sample_dataset_speaker.hdf5...done
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
Enrollment data shape: (108, 80, 40, 1)
Enrollment label shape: (108,)
Evaluation data shape: (12, 80, 40, 1)
Evaluation label shape: (12,)
Traceback (most recent call last):
  File "./code/2-enrollment/enrollment.py", line 330, in <module>
    tf.app.run()
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "./code/2-enrollment/enrollment.py", line 201, in main
    for i in xrange(FLAGS.num_clones):
NameError: name 'xrange' is not defined
Closing remaining open files:data/development_sample_dataset_speaker.hdf5...donedata/enrollment-evaluation_sample_dataset.hdf5...done
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
Enrollment data shape: (108, 80, 40, 1)
Enrollment label shape: (108,)
Evaluation data shape: (12, 80, 40, 1)
Evaluation label shape: (12,)
Traceback (most recent call last):
  File "./code/3-evaluation/evaluation.py", line 380, in <module>
    tf.app.run()
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "./code/3-evaluation/evaluation.py", line 202, in main
    for i in xrange(FLAGS.num_clones):
NameError: name 'xrange' is not defined
Closing remaining open files:data/enrollment-evaluation_sample_dataset.hdf5...donedata/development_sample_dataset_speaker.hdf5...done
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
  DeprecationWarning)
Traceback (most recent call last):
  File "./code/4-ROC_PR_curve/calculate_roc.py", line 23, in <module>
    score = np.load(os.path.join(FLAGS.evaluation_dir,'score_vector.npy'))
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/numpy/lib/npyio.py", line 384, in load
    fid = open(file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy'
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
  DeprecationWarning)
Traceback (most recent call last):
  File "./code/4-ROC_PR_curve/PlotROC.py", line 73, in <module>
    score = np.load(os.path.join(FLAGS.evaluation_dir,'score_vector.npy'))
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/numpy/lib/npyio.py", line 384, in load
    fid = open(file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy'
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
  DeprecationWarning)
Traceback (most recent call last):
  File "./code/4-ROC_PR_curve/PlotPR.py", line 58, in <module>
    score = np.load(os.path.join(FLAGS.evaluation_dir,'score_vector.npy'))
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/numpy/lib/npyio.py", line 384, in load
    fid = open(file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy'
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
  DeprecationWarning)
Traceback (most recent call last):
  File "./code/4-ROC_PR_curve/PlotHIST.py", line 53, in <module>
    score = np.load(os.path.join(FLAGS.evaluation_dir,'score_vector.npy'))
  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/numpy/lib/npyio.py", line 384, in load
    fid = open(file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy'

opened by 8rV1n 42

Problem with evaluation.

Hi @astorfi ,thank for your great work, i also use all the same settings but use hdf5 to store training data instead of Audio Dataset. However, my evaluation result is low, EER is up to 40%. I think there is something wrong with my work. Do you have any idea to fix this? I use VoxCeleb dataset for background model and only use 1 sample per speaker. 50 people for enrollment, 50 for un-enrollment (reject). 4 samples for evaluation.

Thank for your help.

opened by duynguyen5896 16
Regarding input data

Hi @astorfi I have gone through your code. While extracting mfcc features for sample audio file it contains shape (420,40) here, 420 is number of frames and 40 is number of features.But In sample data of your code youre applying mfec feature file contains shape (3209,40,3). As per my understanding 3209 is Number of Frames,40 is Number of Features,3 Is number of Channels. I didn't understand the number of channels usage.can you please suggest how to create Feature_mfec.npy file in your format.

opened by abhishekkritarth 14
How to generate data

Hello,

I have a dataset of voices. I want to generate development and enrollment hdf5 file. The input_feature.py file seams to generate development files (nx80x40x20). How can I generate the enrollment file?

opened by xav12358 13
Retrain on new and own dataset

Hi,

First, that is a great job and ver well done :) Now I am trying to use your source code and maybe contribute to it, I am working on a speaker recognition problem to detect if a teacher tutorial is recorded by his voic. I have about 10 hours of historical recordings for 6 teachers. First I used the speechpy to get 3d npy from wav files and used your create_development.py to create the hdf5 files for train and eval. Is that correct? Specially I got 13 instead of 40 regarding the features vector length in the npy files! I ran the run.bash file and it gave me also error saying something like that: ValueError: Negative dimension size caused by subtracting 2 from 1 for 'MaxPool_7' (op: 'MaxPool') with input shapes: [?,1,112,128].

opened by Ahmed-Abouzeid 10
Convolution expects input with rank 4, got 5

Traceback (most recent call last): File "./code/1-development/train_softmax.py", line 602, in tf.app.run() File "/opt/tensorflow/python2.7/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "./code/1-development/train_softmax.py", line 414, in main logits, end_points_speech = model_speech_fn(batch_speech[i * step: (i + 1) * step]) File "/opt/speaker-recognition/code/1-development/nets/nets_factory.py", line 59, in network_fn return func(images, num_classes, is_training=is_training) File "/opt/speaker-recognition/code/1-development/nets/cnn_speech.py", line 118, in speech_cnn net = slim.conv2d(inputs, 16, [3, 1, 5], stride=[1, 1, 1], scope='conv11') File "/opt/tensorflow/python2.7/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(*args, **current_args) File "/opt/tensorflow/python2.7/local/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d conv_dims=2) File "/opt/tensorflow/python2.7/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(*args, **current_args) File "/opt/tensorflow/python2.7/local/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1025, in convolution (conv_dims + 2, input_rank)) ValueError: Convolution expects input with rank 4, got 5

opened by hungryquiter 9
loss = 0,train acc = 0

Hi astorfi, I'm trying to use your code to train a model with 31 labels, 60 samples for each label. However, when i use train_softmax.py, last minibatches return loss = 0 while train acc = 0. Do you have any idea to fix it?

Thank you.

opened by duynguyen5896 8
shape mismatch problem in input_feature.py

Hello! We are trying to make our own input pipeline. However, when we follow the getitem method in Audioset (with the setting that cube_shape is (20,80,40)), there is a shape mismatch when the model tries to feed data for batch_speech (placeholder with the shape of (20,80,.40,1)).

After carefully review the code in train_softmax.py, we find that the input shape will conflict with the transpose operation in following code:

speech_train = np.transpose(speech_train[None, :, :, :, :], axes=(1, 4, 2, 3, 0))

What is the solution? Could you give us any help?

opened by yangalan123 8
prediction difference between batch=1 and batch=16
Any ideas why I'm receiving different prediction values when running with batch_size=1,16? find code below: Thanks!

def predict(self,speech_input): labels = np.empty(0, int) labels = np.append(labels, range(speech_input.shape[0]), axis=0) feature,logits,_ = self.session.run( [self.features,self.logits,self.end_points_speech], feed_dict={self.is_training: False, self.batch_dynamic: labels.shape[0], self.margin_imp_tensor: 50, self.batch_speech: speech_input}) #self.batch_labels: labels.reshape([labels.shape[0], 1])})

# Extracting the associated numpy array. #print (feature[0]) return feature,logits
opened by alanbekker 8
.wav inputs specifics

Hi guys, I have a question regarding the input wav files used for training. What are the audio format specifications? I used voxceleb ( http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ ) as dataset, but it is giving me some troubles. Do you know about any other usable dataset?

Thank you ;)

opened by loregagliard 6
Run time error in the demo

When I ran the run.sh, the execution terminated saying: FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy'

Where do i get this score file from? Do I need to create one? I just ran the run.sh for demo.

Can you please help?

Regards!

opened by sivagururaman 6
　All Dependency of This Project

Hi, astorfi First of all, thanks for your contribution to this project. I have some doubts: 　１. Is it necessary to isolate all dependencies of this project so that all human cloning repository 　　do not need to install various dependencies separately？ 2. Is it necessary to further describe the operation steps of the demo　to facilitate others to repair 　　and match their own needs？ Looking forward to your early reply and best wish！ Yours, Vickey

opened by Vickey-ZWQ 1
How to deal with .hdfs5 files ?
Hi, astorfi First of all, thanks for your contribution to this project. I have some questions:

Are acoustic features or voice files stored in HDFS files?

How to write the header file of the HDFS file?

Looking forward to your reply！ Yours, yy
opened by yy835055664 2
Does anyone know the EER of this repo?

Hi. I am looking for some speaker verification repo that gives decent EER. I have tried some open source speaker verifications but I haven't been able to find anything that gives below 10 % EER with VoxCeleb1 DB. The speaker verification models availablie online are usually trained with smaller and cleaner DBs and tend not to give reasonable EER on VoxCeleb1. Can anybody please tell me if there is any decent one? I would really appreciate it :D

opened by hash2430 1
Demo video recording link is broken

The demo video recording link seems to be broken. Would help to view things in action before trying it out locally.

Link attached to the demo video: https://asciinema.org/a/yfy6FryUAWWMl1vgylrRagMdw

opened by JudeVJoseph 0
where is score_vector.npy

fid = open(os_fspath(file), "rb") FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy' Traceback (most recent call last): File "./code/4-ROC_PR_curve/PlotROC.py", line 73, in

opened by azuryl 2

Releases(1.1)

1.1(Aug 9, 2017)

This project is aimed to provide the implementation for Speaker Verification (SR) by using 3D convolutional neural networks following the SR protocol.
Source code(tar.gz)
Source code(zip)
1.0(Jul 30, 2017)

This project is aimed to provide the implementation for Speaker Verification (SR) by using 3D convolutional neural networks following the SR protocol.
Source code(tar.gz)
Source code(zip)

Owner

Amirsina Torfi

PhD & Developer working on Deep Learning, Computer Vision & NLP

GitHub Repository

Range Image-based LiDAR Localization for Autonomous Vehicles Using Mesh Maps

Range Image-based 3D LiDAR Localization This repo contains the code for our ICRA2021 paper: Range Image-based LiDAR Localization for Autonomous Vehicl

208 Dec 15, 2022

This repository contains code to train and render Mixture of Volumetric Primitives (MVP) models

Mixture of Volumetric Primitives -- Training and Evaluation This repository contains code to train and render Mixture of Volumetric Primitives (MVP) m

125 Dec 29, 2022

Image Processing, Image Smoothing, Edge Detection and Transforms

opevcvdl-hw1 This project uses openCV and Qt to achieve the requirements. Version Python 3.7 opencv-contrib-python 3.4.2.17 Matplotlib 3.1.1 pyqt5 5.1

3 Aug 17, 2022

Deeprl - Standard DQN and dueling network for simple games

DeepRL This code implements the standard deep Q-learning and dueling network with experience replay (memory buffer) for playing simple games. DQN algo

6 Apr 12, 2020

Surrogate-Assisted Genetic Algorithm for Wrapper Feature Selection

SAGA Surrogate-Assisted Genetic Algorithm for Wrapper Feature Selection Please refer to the Jupyter notebook (Example.ipynb) for an example of using t

9 Dec 28, 2022

This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

About Repository This repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.' About Code

1 Nov 09, 2021

Multi-task yolov5 with detection and segmentation based on yolov5

YOLOv5DS Multi-task yolov5 with detection and segmentation based on yolov5(branch v6.0) decoupled head anchor free segmentation head README中文 Ablation

150 Dec 30, 2022

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

ByteTrack ByteTrack is a simple, fast and strong multi-object tracker. ByteTrack: Multi-Object Tracking by Associating Every Detection Box Yifu Zhang,

2.9k Jan 04, 2023

The materials used in the SaxonJS tutorial presented at Declarative Amsterdam, 2021

SaxonJS-Tutorial-2021, version 1.0.4 Last updated on 4 November, 2021. Table of contents Background Prerequisites Starting a web server Running a Java

11 Oct 23, 2022

It is the assignment for COMP 576 in Rice University

COMP-576 It is the assignment for COMP 576 in Rice University There are two programming assignments and one Final Project. Assignment 1: It is a MLP a

1 Nov 25, 2021

Scalable machine learning based time series forecasting

mlforecast Scalable machine learning based time series forecasting. Install PyPI pip install mlforecast Optional dependencies If you want more functio

145 Dec 24, 2022

Open source Python implementation of the HDR+ photography pipeline

hdrplus-python Open source Python implementation of the HDR+ photography pipeline, originally developped by Google and presented in a 2016 article. Th

77 Jan 05, 2023

Research on Event Accumulator Settings for Event-Based SLAM

Research on Event Accumulator Settings for Event-Based SLAM This is the source code for paper "Research on Event Accumulator Settings for Event-Based

26 Dec 21, 2022

3D ResNets for Action Recognition (CVPR 2018)

3D ResNets for Action Recognition Update (2020/4/13) We published a paper on arXiv. Hirokatsu Kataoka, Tenga Wakamiya, Kensho Hara, and Yutaka Satoh,

3.5k Jan 06, 2023

Streamlit App For Product Analysis - Streamlit App For Product Analysis

Streamlit_App_For_Product_Analysis Здравствуйте! Перед вами дашборд, позволяющий

1 Jan 10, 2022

Official PyTorch Implementation of HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning (NeurIPS 2021 Spotlight)

[NeurIPS 2021 Spotlight] HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning [Paper] This is Official PyTorch implementatio

42 Nov 01, 2022

Deep Learning & 3D Convolutional Neural Networks for Speaker Verification

Related tags

Overview

TensorFlow implementation of 3D Convolutional Neural Networks for Speaker Verification - Official Project Page - Pytorch Implementation

Citation

DEMO

General View

Speaker Verification Protocol(SVP)

How to leverage 3D Convolutional Neural Networks?

Code Implementation

Input Pipeline for this work

Implementation of 3D Convolutional Operation

Disclaimer

Citation

License

Contribution

Comments

Releases(1.1)

1.1(Aug 9, 2017)

1.0(Jul 30, 2017)

Owner

Amirsina Torfi

Range Image-based LiDAR Localization for Autonomous Vehicles Using Mesh Maps

This repository contains code to train and render Mixture of Volumetric Primitives (MVP) models

Image Processing, Image Smoothing, Edge Detection and Transforms

Deeprl - Standard DQN and dueling network for simple games

Surrogate-Assisted Genetic Algorithm for Wrapper Feature Selection

This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

Multi-task yolov5 with detection and segmentation based on yolov5

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

The materials used in the SaxonJS tutorial presented at Declarative Amsterdam, 2021

It is the assignment for COMP 576 in Rice University

Scalable machine learning based time series forecasting

Open source Python implementation of the HDR+ photography pipeline

Research on Event Accumulator Settings for Event-Based SLAM

3D ResNets for Action Recognition (CVPR 2018)

Streamlit App For Product Analysis - Streamlit App For Product Analysis

Official PyTorch Implementation of HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning (NeurIPS 2021 Spotlight)

[ICCV 2021] Focal Frequency Loss for Image Reconstruction and Synthesis

Vignette is a face tracking software for characters using osu!framework.

Pretrained Cost Model for Distributed Constraint Optimization Problems

FinEAS: Financial Embedding Analysis of Sentiment 📈