Deep Learning & 3D Convolutional Neural Networks for Speaker Verification

Overview

TensorFlow implementation of 3D Convolutional Neural Networks for Speaker Verification - Official Project Page - Pytorch Implementation

https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat https://badges.frapsoft.com/os/v2/open-source.svg?v=102 https://img.shields.io/twitter/follow/amirsinatorfi.svg?label=Follow&style=social

This repository contains the code release for our paper titled as "Text-Independent Speaker Verification Using 3D Convolutional Neural Networks". The link to the paper is provided as well.

The code has been developed using TensorFlow. The input pipeline must be prepared by the users. This code is aimed to provide the implementation for Speaker Verification (SR) by using 3D convolutional neural networks following the SR protocol.

readme_images/conv_gif.gif

Citation

If you used this code, please kindly consider citing the following paper:

@article{torfi2017text,
  title={Text-independent speaker verification using 3d convolutional neural networks},
  author={Torfi, Amirsina and Nasrabadi, Nasser M and Dawson, Jeremy},
  journal={arXiv preprint arXiv:1705.09422},
  year={2017}
}

DEMO

For running a demo, after forking the repository, run the following scrit:

./run.sh

speakerrecognition

General View

We leveraged 3D convolutional architecture for creating the speaker model in order to simultaneously capturing the speech-related and temporal information from the speakers' utterances.

Speaker Verification Protocol(SVP)

In this work, a 3D Convolutional Neural Network (3D-CNN) architecture has been utilized for text-independent speaker verification in three phases.

1. At the development phase, a CNN is trained to classify speakers at the utterance-level.

2. In the enrollment stage, the trained network is utilized to directly create a speaker model for each speaker based on the extracted features.

3. Finally, in the evaluation phase, the extracted features from the test utterance will be compared to the stored speaker model to verify the claimed identity.

The aforementioned three phases are usually considered as the SV protocol. One of the main challenges is the creation of the speaker models. Previously-reported approaches create speaker models based on averaging the extracted features from utterances of the speaker, which is known as the d-vector system.

How to leverage 3D Convolutional Neural Networks?

In our paper, we propose the implementation of 3D-CNNs for direct speaker model creation in which, for both development and enrollment phases, an identical number of speaker utterances is fed to the network for representing the spoken utterances and creation of the speaker model. This leads to simultaneously capturing the speaker-related information and building a more robust system to cope with within-speaker variation. We demonstrate that the proposed method significantly outperforms the d-vector verification system.

Code Implementation

The input pipeline must be provided by the user. Please refer to ``code/0-input/input_feature.py`` for having an idea about how the input pipeline works.

Input Pipeline for this work

readme_images/Speech_GIF.gif

The MFCC features can be used as the data representation of the spoken utterances at the frame level. However, a drawback is their non-local characteristics due to the last DCT 1 operation for generating MFCCs. This operation disturbs the locality property and is in contrast with the local characteristics of the convolutional operations. The employed approach in this work is to use the log-energies, which we call MFECs. The extraction of MFECs is similar to MFCCs by discarding the DCT operation. The temporal features are overlapping 20ms windows with the stride of 10ms, which are used for the generation of spectrum features. From a 0.8- second sound sample, 80 temporal feature sets (each forms a 40 MFEC features) can be obtained which form the input speech feature map. Each input feature map has the dimen- sionality of ζ × 80 × 40 which is formed from 80 input frames and their corresponding spectral features, where ζ is the number of utterances used in modeling the speaker during the development and enrollment stages.

The speech features have been extracted using [SpeechPy] package.

Implementation of 3D Convolutional Operation

The Slim high-level API made our life very easy. The following script has been used for our implementation:

net = slim.conv2d(inputs, 16, [3, 1, 5], stride=[1, 1, 1], scope='conv11')
net = PReLU(net, 'conv11_activation')
net = slim.conv2d(net, 16, [3, 9, 1], stride=[1, 2, 1], scope='conv12')
net = PReLU(net, 'conv12_activation')
net = tf.nn.max_pool3d(net, strides=[1, 1, 1, 2, 1], ksize=[1, 1, 1, 2, 1], padding='VALID', name='pool1')

############ Conv-2 ###############
############ Conv-1 ###############
net = slim.conv2d(net, 32, [3, 1, 4], stride=[1, 1, 1], scope='conv21')
net = PReLU(net, 'conv21_activation')
net = slim.conv2d(net, 32, [3, 8, 1], stride=[1, 2, 1], scope='conv22')
net = PReLU(net, 'conv22_activation')
net = tf.nn.max_pool3d(net, strides=[1, 1, 1, 2, 1], ksize=[1, 1, 1, 2, 1], padding='VALID', name='pool2')

############ Conv-3 ###############
############ Conv-1 ###############
net = slim.conv2d(net, 64, [3, 1, 3], stride=[1, 1, 1], scope='conv31')
net = PReLU(net, 'conv31_activation')
net = slim.conv2d(net, 64, [3, 7, 1], stride=[1, 1, 1], scope='conv32')
net = PReLU(net, 'conv32_activation')
# net = slim.max_pool2d(net, [1, 1], stride=[4, 1], scope='pool1')

############ Conv-4 ###############
net = slim.conv2d(net, 128, [3, 1, 3], stride=[1, 1, 1], scope='conv41')
net = PReLU(net, 'conv41_activation')
net = slim.conv2d(net, 128, [3, 7, 1], stride=[1, 1, 1], scope='conv42')
net = PReLU(net, 'conv42_activation')
# net = slim.max_pool2d(net, [1, 1], stride=[4, 1], scope='pool1')

############ Conv-5 ###############
net = slim.conv2d(net, 128, [4, 3, 3], stride=[1, 1, 1], normalizer_fn=None, scope='conv51')
net = PReLU(net, 'conv51_activation')

# net = slim.conv2d(net, 256, [1, 1], stride=[1, 1], scope='conv52')
# net = PReLU(net, 'conv52_activation')

# Last layer which is the logits for classes
logits = tf.contrib.layers.conv2d(net, num_classes, [1, 1, 1], activation_fn=None, scope='fc')

As it can be seen, slim.conv2d has been used. However, simply by using 3D kernels as [k_x, k_y, k_z] and stride=[a, b, c] it can be turned into a 3D-conv operation. The base of the slim.conv2d is tf.contrib.layers.conv2d. Please refer to official Documentation for further details.

Disclaimer

The code architecture part has been heavily inspired by Slim and Slim image classification library. Please refer to this link for further details.

Citation

If you used this code please kindly cite the following paper:

@article{torfi2017text,
  title={Text-Independent Speaker Verification Using 3D Convolutional Neural Networks},
  author={Torfi, Amirsina and Nasrabadi, Nasser M and Dawson, Jeremy},
  journal={arXiv preprint arXiv:1705.09422},
  year={2017}
}

License

The license is as follows:

APPENDIX: How to apply the Apache License to your work.

   To apply the Apache License to your work, attach the following
   boilerplate notice, with the fields enclosed by brackets "{}"
   replaced with your own identifying information. (Don't include the brackets!)  The text should be enclosed in the appropriate
   comment syntax for the file format. We also recommend that a
   file or class name and description of purpose be included on the
   same "printed page" as the copyright notice for easier
   identification within third-party archives.

Copyright {2017} {Amirsina Torfi}

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Please refer to LICENSE file for further detail.

Contribution

We are looking forward to your kind feedback. Please help us to improve the code and make our work better. For contribution, please create the pull request and we will investigate it promptly. Once again, we appreciate your feedback and code inspections.

references

[SpeechPy] Amirsina Torfi. 2017. astorfi/speech_feature_extraction: SpeechPy. Zenodo. doi:10.5281/zenodo.810392.
Comments
  • ValueError: Convolution expects input with rank 4, got 5

    ValueError: Convolution expects input with rank 4, got 5

    When I run the run.sh, it shows something wrong:

    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88  return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
      "This module will be removed in 0.20.", DeprecationWarning)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
      from numpy.core.umath_tests import inner1d/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
      DeprecationWarning)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
      DeprecationWarning)
    Train data shape: (12, 80, 40, 20)Train label shape: (12,)Test data shape: (12, 80, 40, 20)
    Test label shape: (12,)
    Traceback (most recent call last):  File "./code/1-development/train_softmax.py", line 602, in <module>    tf.app.run()
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
        _sys.exit(main(argv))
      File "./code/1-development/train_softmax.py", line 414, in main
        logits, end_points_speech = model_speech_fn(batch_speech[i * step: (i + 1) * step])  File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/code/1-development/nets/nets_factory.py", line 59, in network_fn
        return func(images, num_classes, is_training=is_training)
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/code/1-development/nets/cnn_speech.py", line 118, in speech_cnn
        net = slim.conv2d(inputs, 16, [3, 1, 5], stride=[1, 1, 1], scope='conv11')
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
        return func(*args, **current_args)
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
        conv_dims=2)
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
        return func(*args, **current_args)
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1025, in convolution
        (conv_dims + 2, input_rank))
    ValueError: Convolution expects input with rank 4, got 5
    Closing remaining open files:data/development_sample_dataset_speaker.hdf5...done
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    Enrollment data shape: (108, 80, 40, 1)
    Enrollment label shape: (108,)
    Evaluation data shape: (12, 80, 40, 1)
    Evaluation label shape: (12,)
    Traceback (most recent call last):
      File "./code/2-enrollment/enrollment.py", line 330, in <module>
        tf.app.run()
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
        _sys.exit(main(argv))
      File "./code/2-enrollment/enrollment.py", line 201, in main
        for i in xrange(FLAGS.num_clones):
    NameError: name 'xrange' is not defined
    Closing remaining open files:data/development_sample_dataset_speaker.hdf5...donedata/enrollment-evaluation_sample_dataset.hdf5...done
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    Enrollment data shape: (108, 80, 40, 1)
    Enrollment label shape: (108,)
    Evaluation data shape: (12, 80, 40, 1)
    Evaluation label shape: (12,)
    Traceback (most recent call last):
      File "./code/3-evaluation/evaluation.py", line 380, in <module>
        tf.app.run()
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
        _sys.exit(main(argv))
      File "./code/3-evaluation/evaluation.py", line 202, in main
        for i in xrange(FLAGS.num_clones):
    NameError: name 'xrange' is not defined
    Closing remaining open files:data/enrollment-evaluation_sample_dataset.hdf5...donedata/development_sample_dataset_speaker.hdf5...done
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
      "This module will be removed in 0.20.", DeprecationWarning)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
      from numpy.core.umath_tests import inner1d
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
      DeprecationWarning)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
      DeprecationWarning)
    Traceback (most recent call last):
      File "./code/4-ROC_PR_curve/calculate_roc.py", line 23, in <module>
        score = np.load(os.path.join(FLAGS.evaluation_dir,'score_vector.npy'))
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/numpy/lib/npyio.py", line 384, in load
        fid = open(file, "rb")
    FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy'
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
      "This module will be removed in 0.20.", DeprecationWarning)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
      from numpy.core.umath_tests import inner1d
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
      DeprecationWarning)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
      DeprecationWarning)
    Traceback (most recent call last):
      File "./code/4-ROC_PR_curve/PlotROC.py", line 73, in <module>
        score = np.load(os.path.join(FLAGS.evaluation_dir,'score_vector.npy'))
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/numpy/lib/npyio.py", line 384, in load
        fid = open(file, "rb")
    FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy'
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
      "This module will be removed in 0.20.", DeprecationWarning)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
      from numpy.core.umath_tests import inner1d
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
      DeprecationWarning)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
      DeprecationWarning)
    Traceback (most recent call last):
      File "./code/4-ROC_PR_curve/PlotPR.py", line 58, in <module>
        score = np.load(os.path.join(FLAGS.evaluation_dir,'score_vector.npy'))
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/numpy/lib/npyio.py", line 384, in load
        fid = open(file, "rb")
    FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy'
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
      return f(*args, **kwds)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
      "This module will be removed in 0.20.", DeprecationWarning)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
      from numpy.core.umath_tests import inner1d
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
      DeprecationWarning)
    /home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/sklearn/learning_curve.py:22: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
      DeprecationWarning)
    Traceback (most recent call last):
      File "./code/4-ROC_PR_curve/PlotHIST.py", line 53, in <module>
        score = np.load(os.path.join(FLAGS.evaluation_dir,'score_vector.npy'))
      File "/home/jovyan/Documents/git/3D-convolutional-speaker-recognition/speacker-rec-py35/lib/python3.5/site-packages/numpy/lib/npyio.py", line 384, in load
        fid = open(file, "rb")
    FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy'
    
    opened by 8rV1n 42
  • Problem with evaluation.

    Problem with evaluation.

    Hi @astorfi ,thank for your great work, i also use all the same settings but use hdf5 to store training data instead of Audio Dataset. However, my evaluation result is low, EER is up to 40%. I think there is something wrong with my work. Do you have any idea to fix this? I use VoxCeleb dataset for background model and only use 1 sample per speaker. 50 people for enrollment, 50 for un-enrollment (reject). 4 samples for evaluation.

    Thank for your help.

    opened by duynguyen5896 16
  • Regarding input data

    Regarding input data

    Hi @astorfi I have gone through your code. While extracting mfcc features for sample audio file it contains shape (420,40) here, 420 is number of frames and 40 is number of features.But In sample data of your code youre applying mfec feature file contains shape (3209,40,3). As per my understanding 3209 is Number of Frames,40 is Number of Features,3 Is number of Channels. I didn't understand the number of channels usage.can you please suggest how to create Feature_mfec.npy file in your format.

    opened by abhishekkritarth 14
  • How to generate data

    How to generate data

    Hello,

    I have a dataset of voices. I want to generate development and enrollment hdf5 file. The input_feature.py file seams to generate development files (nx80x40x20). How can I generate the enrollment file?

    opened by xav12358 13
  • Retrain on new and own dataset

    Retrain on new and own dataset

    Hi,

    First, that is a great job and ver well done :) Now I am trying to use your source code and maybe contribute to it, I am working on a speaker recognition problem to detect if a teacher tutorial is recorded by his voic. I have about 10 hours of historical recordings for 6 teachers. First I used the speechpy to get 3d npy from wav files and used your create_development.py to create the hdf5 files for train and eval. Is that correct? Specially I got 13 instead of 40 regarding the features vector length in the npy files! I ran the run.bash file and it gave me also error saying something like that: ValueError: Negative dimension size caused by subtracting 2 from 1 for 'MaxPool_7' (op: 'MaxPool') with input shapes: [?,1,112,128].

    opened by Ahmed-Abouzeid 10
  • Convolution expects input with rank 4, got 5

    Convolution expects input with rank 4, got 5

    Traceback (most recent call last): File "./code/1-development/train_softmax.py", line 602, in tf.app.run() File "/opt/tensorflow/python2.7/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "./code/1-development/train_softmax.py", line 414, in main logits, end_points_speech = model_speech_fn(batch_speech[i * step: (i + 1) * step]) File "/opt/speaker-recognition/code/1-development/nets/nets_factory.py", line 59, in network_fn return func(images, num_classes, is_training=is_training) File "/opt/speaker-recognition/code/1-development/nets/cnn_speech.py", line 118, in speech_cnn net = slim.conv2d(inputs, 16, [3, 1, 5], stride=[1, 1, 1], scope='conv11') File "/opt/tensorflow/python2.7/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(*args, **current_args) File "/opt/tensorflow/python2.7/local/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d conv_dims=2) File "/opt/tensorflow/python2.7/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(*args, **current_args) File "/opt/tensorflow/python2.7/local/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1025, in convolution (conv_dims + 2, input_rank)) ValueError: Convolution expects input with rank 4, got 5

    opened by hungryquiter 9
  • loss = 0,train acc = 0

    loss = 0,train acc = 0

    Hi astorfi, I'm trying to use your code to train a model with 31 labels, 60 samples for each label. However, when i use train_softmax.py, last minibatches return loss = 0 while train acc = 0. Do you have any idea to fix it?

    image

    Thank you.

    opened by duynguyen5896 8
  • shape mismatch problem in input_feature.py

    shape mismatch problem in input_feature.py

    Hello! We are trying to make our own input pipeline. However, when we follow the getitem method in Audioset (with the setting that cube_shape is (20,80,40)), there is a shape mismatch when the model tries to feed data for batch_speech (placeholder with the shape of (20,80,.40,1)).

    After carefully review the code in train_softmax.py, we find that the input shape will conflict with the transpose operation in following code:

    speech_train = np.transpose(speech_train[None, :, :, :, :], axes=(1, 4, 2, 3, 0))

    What is the solution? Could you give us any help?

    opened by yangalan123 8
  • prediction difference between batch=1 and batch=16

    prediction difference between batch=1 and batch=16

    Any ideas why I'm receiving different prediction values when running with batch_size=1,16? find code below: Thanks!

    def predict(self,speech_input): labels = np.empty(0, int) labels = np.append(labels, range(speech_input.shape[0]), axis=0) feature,logits,_ = self.session.run( [self.features,self.logits,self.end_points_speech], feed_dict={self.is_training: False, self.batch_dynamic: labels.shape[0], self.margin_imp_tensor: 50, self.batch_speech: speech_input}) #self.batch_labels: labels.reshape([labels.shape[0], 1])})

        # Extracting the associated numpy array.
        #print (feature[0])
    
        return  feature,logits
    
    opened by alanbekker 8
  • .wav inputs specifics

    .wav inputs specifics

    Hi guys, I have a question regarding the input wav files used for training. What are the audio format specifications? I used voxceleb ( http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ ) as dataset, but it is giving me some troubles. Do you know about any other usable dataset?

    Thank you ;)

    opened by loregagliard 6
  • Run time error in the demo

    Run time error in the demo

    When I ran the run.sh, the execution terminated saying: FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy'

    Where do i get this score file from? Do I need to create one? I just ran the run.sh for demo.

    Can you please help?

    Regards!

    opened by sivagururaman 6
  •  All Dependency of This Project

     All Dependency of This Project

    Hi, astorfi First of all, thanks for your contribution to this project. I have some doubts:  1. Is it necessary to isolate all dependencies of this project so that all human cloning repository   do not need to install various dependencies separately? 2. Is it necessary to further describe the operation steps of the demo to facilitate others to repair   and match their own needs? Looking forward to your early reply and best wish! Yours, Vickey

    opened by Vickey-ZWQ 1
  • How to deal with .hdfs5 files ?

    How to deal with .hdfs5 files ?

    Hi, astorfi First of all, thanks for your contribution to this project. I have some questions:

    1. Are acoustic features or voice files stored in HDFS files?
    2. How to write the header file of the HDFS file?

    Looking forward to your reply! Yours, yy

    opened by yy835055664 2
  • Does anyone know the EER of this repo?

    Does anyone know the EER of this repo?

    Hi. I am looking for some speaker verification repo that gives decent EER. I have tried some open source speaker verifications but I haven't been able to find anything that gives below 10 % EER with VoxCeleb1 DB. The speaker verification models availablie online are usually trained with smaller and cleaner DBs and tend not to give reasonable EER on VoxCeleb1. Can anybody please tell me if there is any decent one? I would really appreciate it :D

    opened by hash2430 1
  • Demo video recording link is broken

    Demo video recording link is broken

    The demo video recording link seems to be broken. Would help to view things in action before trying it out locally.

    Link attached to the demo video: https://asciinema.org/a/yfy6FryUAWWMl1vgylrRagMdw

    opened by JudeVJoseph 0
  • where is score_vector.npy

    where is score_vector.npy

    fid = open(os_fspath(file), "rb") FileNotFoundError: [Errno 2] No such file or directory: 'results/SCORES/score_vector.npy' Traceback (most recent call last): File "./code/4-ROC_PR_curve/PlotROC.py", line 73, in

    opened by azuryl 2
Releases(1.1)
  • 1.1(Aug 9, 2017)

    This project is aimed to provide the implementation for Speaker Verification (SR) by using 3D convolutional neural networks following the SR protocol.

    Source code(tar.gz)
    Source code(zip)
  • 1.0(Jul 30, 2017)

    This project is aimed to provide the implementation for Speaker Verification (SR) by using 3D convolutional neural networks following the SR protocol.

    Source code(tar.gz)
    Source code(zip)
Owner
Amirsina Torfi
PhD & Developer working on Deep Learning, Computer Vision & NLP
Amirsina Torfi
Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness

Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness This repository contains the code used for the exper

H.R. Oosterhuis 28 Nov 29, 2022
Lenia - Mathematical Life Forms

For full version list, see Timeline in Lenia portal [2020-10-13] Update Python version with multi-kernel and multi-channel extensions (v3.4 LeniaNDK.p

Bert Chan 3.1k Dec 28, 2022
A scientific and useful toolbox, which contains practical and effective long-tail related tricks with extensive experimental results

Bag of tricks for long-tailed visual recognition with deep convolutional neural networks This repository is the official PyTorch implementation of AAA

Yong-Shun Zhang 181 Dec 28, 2022
A python-image-classification web application project, written in Python and served through the Flask Microframework

A python-image-classification web application project, written in Python and served through the Flask Microframework. This Project implements the VGG16 covolutional neural network, through Keras and

Gerald Maduabuchi 19 Dec 12, 2022
A Factor Model for Persistence in Investment Manager Performance

Factor-Model-Manager-Performance A Factor Model for Persistence in Investment Manager Performance I apply methods and processes similar to those used

Omid Arhami 1 Dec 01, 2021
Code for the paper "SmoothMix: Training Confidence-calibrated Smoothed Classifiers for Certified Robustness" (NeurIPS 2021)

SmoothMix: Training Confidence-calibrated Smoothed Classifiers for Certified Robustness (NeurIPS2021) This repository contains code for the paper "Smo

Jongheon Jeong 17 Dec 27, 2022
Open source Python implementation of the HDR+ photography pipeline

hdrplus-python Open source Python implementation of the HDR+ photography pipeline, originally developped by Google and presented in a 2016 article. Th

77 Jan 05, 2023
CLIP+FFT text-to-image

Aphantasia This is a text-to-image tool, part of the artwork of the same name. Based on CLIP model, with FFT parameterizer from Lucent library as a ge

vadim epstein 690 Jan 02, 2023
CVPR 2021: "The Spatially-Correlative Loss for Various Image Translation Tasks"

Spatially-Correlative Loss arXiv | website We provide the Pytorch implementation of "The Spatially-Correlative Loss for Various Image Translation Task

Chuanxia Zheng 89 Jan 04, 2023
Tools for robust generative diffeomorphic slice to volume reconstruction

RGDSVR Tools for Robust Generative Diffeomorphic Slice to Volume Reconstructions (RGDSVR) This repository provides tools to implement the methods in t

Lucilio Cordero-Grande 0 Oct 29, 2021
A Large Scale Benchmark for Individual Treatment Effect Prediction and Uplift Modeling

large-scale-ITE-UM-benchmark This repository contains code and data to reproduce the results of the paper "A Large Scale Benchmark for Individual Trea

10 Nov 19, 2022
Code for paper " AdderNet: Do We Really Need Multiplications in Deep Learning?"

AdderNet: Do We Really Need Multiplications in Deep Learning? This code is a demo of CVPR 2020 paper AdderNet: Do We Really Need Multiplications in De

HUAWEI Noah's Ark Lab 915 Jan 01, 2023
GenshinMapAutoMarkTools - Tools To add/delete/refresh resources mark in Genshin Impact Map

使用说明 适配 windows7以上 64位 原神1920x1080窗口(其他分辨率后续适配) 待更新渊下宫 English version is to be

Zero_Circle 209 Dec 28, 2022
天勤量化开发包, 期货量化, 实时行情/历史数据/实盘交易

TqSdk 天勤量化交易策略程序开发包 TqSdk 是一个由信易科技发起并贡献主要代码的开源 python 库. 依托快期多年积累成熟的交易及行情服务器体系, TqSdk 支持用户使用极少的代码量构建各种类型的量化交易策略程序, 并提供包含期货、期权、股票的 历史数据-实时数据-开发调试-策略回测-

信易科技 2.8k Dec 30, 2022
Instance-conditional Knowledge Distillation for Object Detection

Instance-conditional Knowledge Distillation for Object Detection This is a MegEngine implementation of the paper "Instance-conditional Knowledge Disti

MEGVII Research 47 Nov 17, 2022
MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

Microsoft 5.7k Jan 09, 2023
Using deep learning model to detect breast cancer.

Breast-Cancer-Detection Breast cancer is the most frequent cancer among women, with around one in every 19 women at risk. The number of cases of breas

1 Feb 13, 2022
BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment

BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment

Holy Wu 35 Jan 01, 2023
This is the official code for the paper "Tracker Meets Night: A Transformer Enhancer for UAV Tracking".

SCT This is the official code for the paper "Tracker Meets Night: A Transformer Enhancer for UAV Tracking" The spatial-channel Transformer (SCT) enhan

Intelligent Vision for Robotics in Complex Environment 27 Nov 23, 2022
Code for the paper "Training GANs with Stronger Augmentations via Contrastive Discriminator" (ICLR 2021)

Training GANs with Stronger Augmentations via Contrastive Discriminator (ICLR 2021) This repository contains the code for reproducing the paper: Train

Jongheon Jeong 174 Dec 29, 2022