symjax.data

Utilities

symjax.data.patchify_1d(x, window_length, stride) extract patches from a numpy array
symjax.data.patchify_2d(x, window_length, stride)
symjax.data.train_test_split(*args[, …]) split given data into two non overlapping sets
symjax.data.batchify(*args, batch_size[, …])
symjax.data.resample_images(images, target_shape)
symjax.data.download_dataset(path, dataset, …) dataset downlading utility
symjax.data.extract_file(filename, target)

Images

symjax.data.mnist.load([path]) The MNIST database of handwritten digits, available from this page has a training set of 60,000 examples, and a test set of 10,000 examples.
symjax.data.emnist.load([option, path]) Grayscale digit/letter classification.
symjax.data.fashionmnist.load([path]) Grayscale image classification
symjax.data.dsprites.load([path]) greyscale image classification and disentanglement
symjax.data.svhn.load([path]) Street number classification.
symjax.data.cifar10.load([path]) Image classification.
symjax.data.cifar100.load([path]) Image classification.
symjax.data.celebA.load
symjax.data.ibeans.load
symjax.data.cassava.load
symjax.data.stl10.load([path]) Image classification with extra unlabeled images.
symjax.data.tinyimagenet.load

Audio

symjax.data.audiomnist.load([path]) digit recognition
symjax.data.univariate_timeseries.load
symjax.data.dcase_2019_task4.load([path]) synthetic data for polyphonic event detection
symjax.data.groove_MIDI.load([path]) The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming.
symjax.data.speech_commands.load([path])
symjax.data.picidae.load([path])
param path:default ($DATASET_PATH), the path to look for the data and
symjax.data.esc.load([path]) ESC-10/50: Environmental Sound Classification
symjax.data.warblr.load
symjax.data.gtzan.load([path]) music genre classification
symjax.data.dclde.load
symjax.data.irmas.load([path]) music instrument classification
symjax.data.vocalset.load
symjax.data.freefield1010.load([path]) Audio binary classification, presence or absence of bird songs.
symjax.data.birdvox_70k.load([path]) a dataset for avian flight call detection in half-second clips
symjax.data.birdvox_dcase_20k.load
symjax.data.seizures_neonatal.load
symjax.data.sonycust.load([path]) multilabel urban sound classification
symjax.data.gtzan.load([path]) music genre classification
symjax.data.FSDKaggle2018.load
symjax.data.TUTacousticscenes2017.load([path]) Acoustic Scene classification

Detailed description

symjax.data.patchify_1d(x, window_length, stride)[source]

extract patches from a numpy array

Parameters:
  • x (array-like) – the input data to extract patches from, any shape, the last dimension is the one being patched
  • window_length (int) – the length of the patches
  • stride (int) – the amount of stride (bins separating two consecutive patches
Returns:

x_patches – the number of patches is put in the pre-last dimension (-2)

Return type:

array-like

symjax.data.patchify_2d(x, window_length, stride)[source]
symjax.data.train_test_split(*args, train_size=0.8, stratify=None, seed=None)[source]

split given data into two non overlapping sets

Parameters:
  • *args (inputs) – the sets to be split by the function
  • train_size (scalar) – the amount of data to put in the first set, either an integer value being the actual number of data to keep, or a ratio (0 to 1 number)
  • stratify (array (optional)) – the optimal stratify guide to spit the array s.t. the same proportion based on the stratify array is kep in both set based on the proportion of the split
  • seed (integer (optional)) – the seed for the random number generator for reproducibility
Returns:

  • train_set (list) – returns the train data, the list has the members of *args split
  • test_set (list) – returns the test data, the list has the members of *args split

Example

x = numpy.random.randn(100, 4)
y = numpy.random.randn(100)

train, test = train_test_split(x, y, train_size=0.5)
print(train[0].shape, train[1].shape)
# (50, 4) (50,)
print(test[0].shape, test[1].shape)
# (50, 4) (50,)
class symjax.data.batchify(*args, batch_size, option='random', load_func=None, extra_process=0, n_batches=None)[source]
symjax.data.resample_images(images, target_shape, ratio='same', order=1, mode='nearest', data_format='channels_first')[source]
symjax.data.download_dataset(path, dataset, urls_names, baseurl='', extract=False)[source]

dataset downlading utility

Args:

path: string
the path where the dataset should be download
dataset: string
the name of the dataset, used as the folder name
urls_names: dict
dictionnary mapping urls to filename. If the urls have a common root, then it can be omited from this variable and put into the baseurl argument
baseurl: string
the common url to prepend onto each url in urls_names
symjax.data.extract_file(filename, target)[source]
symjax.data.mnist.load(path=None)[source]

The MNIST database of handwritten digits, available from this page has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

Parameters:path (str (optional)) – default ($DATASET_PATH), the path to look for the data and where the data will be downloaded if not present
Returns:
  • train_images (array)
  • train_labels (array)
  • valid_images (array)
  • valid_labels (array)
  • test_images (array)
  • test_labels (array)
symjax.data.emnist.load(option='byclass', path=None)[source]

Grayscale digit/letter classification.

The EMNIST Dataset

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre van Schaik

The MARCS Institute for Brain, Behaviour and Development Western Sydney University Penrith, Australia 2751

Email: g.cohen@westernsydney.edu.au

The EMNIST dataset is a set of handwritten character digits derived from the NIST Special Database 19 (https://www.nist.gov/srd/nist-special-database-19) and converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset (http://yann.lecun.com/exdb/mnist/). Further information on the dataset contents and conversion process can be found in the paper available at https://arxiv.org/abs/1702.05373v1.

The dataset is provided in two file formats. Both versions of the dataset contain identical information, and are provided entirely for the sake of convenience. The first dataset is provided in a Matlab format that is accessible through both Matlab and Python (using the scipy.io.loadmat function). The second version of the dataset is provided in the same binary format as the original MNIST dataset as outlined in http://yann.lecun.com/exdb/mnist/

There are six different splits provided in this dataset. A short summary of the dataset is provided below:

EMNIST ByClass:EMNIST814,255 characters. 62 unbalanced classes EMNIST ByMerge: 814,255 characters. 47 unbalanced classes EMNIST Balanced:Balanced131,600 characters. 47 balanced classes. EMNIST Letters:EMNIST145,600 characters. 26 balanced classes. EMNIST Digits:EMNIST280,000 characters. 10 balanced classes. EMNIST MNIST:EMNIST 70,000 characters. 10 balanced classes.

The full complement of the NIST Special Database 19 is available in the ByClass and ByMerge splits. The EMNIST Balanced dataset contains a set of characters with an equal number of samples per class. The EMNIST Letters dataset merges a balanced set of the uppercase and lowercase letters into a single 26-class task. The EMNIST Digits and EMNIST MNIST dataset provide balanced handwritten digit datasets directly compatible with the original MNIST dataset.

Please refer to the EMNIST paper (available at https://arxiv.org/abs/1702.05373v1) for further details of the dataset structure.

Please cite the following paper when using or referencing the dataset:

Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: an extension of MNIST to handwritten letters. Retrieved from http://arxiv.org/abs/1702.05373

The dataset consists of the following files:

. +– gzip.zip ¦ +– emnist-balanced-mapping.txt ¦ +– emnist-balanced-test-images-idx3-ubyte.gz ¦ +– emnist-balanced-test-labels-idx1-ubyte.gz ¦ +– emnist-balanced-train-images-idx3-ubyte.gz ¦ +– emnist-balanced-train-labels-idx1-ubyte.gz ¦ +– emnist-byclass-mapping.txt ¦ +– emnist-byclass-test-images-idx3-ubyte.gz ¦ +– emnist-byclass-test-labels-idx1-ubyte.gz ¦ +– emnist-byclass-train-images-idx3-ubyte.gz ¦ +– emnist-byclass-train-labels-idx1-ubyte.gz ¦ +– emnist-bymerge-mapping.txt ¦ +– emnist-bymerge-test-images-idx3-ubyte.gz ¦ +– emnist-bymerge-test-labels-idx1-ubyte.gz ¦ +– emnist-bymerge-train-images-idx3-ubyte.gz ¦ +– emnist-bymerge-train-labels-idx1-ubyte.gz ¦ +– emnist-digits-mapping.txt ¦ +– emnist-digits-test-images-idx3-ubyte.gz ¦ +– emnist-digits-test-labels-idx1-ubyte.gz ¦ +– emnist-digits-train-images-idx3-ubyte.gz ¦ +– emnist-digits-train-labels-idx1-ubyte.gz ¦ +– emnist-letters-mapping.txt ¦ +– emnist-letters-test-images-idx3-ubyte.gz ¦ +– emnist-letters-test-labels-idx1-ubyte.gz ¦ +– emnist-letters-train-images-idx3-ubyte.gz ¦ +– emnist-letters-train-labels-idx1-ubyte.gz ¦ +– emnist-mnist-mapping.txt ¦ +– emnist-mnist-test-images-idx3-ubyte.gz ¦ +– emnist-mnist-test-labels-idx1-ubyte.gz ¦ +– emnist-mnist-train-images-idx3-ubyte.gz ¦ +– emnist-mnist-train-labels-idx1-ubyte.gz +– matlab.zip

+– emnist-balanced.mat
+– emnist-byclass.mat
+– emnist-bymerge.mat
+– emnist-digits.mat
+– emnist-letters.mat
+– emnist-mnist.mat +– Readme.txt
symjax.data.fashionmnist.load(path=None)[source]

Grayscale image classification

Zalando ‘s article image classification. Fashion-MNIST is a dataset of Zalando ‘s article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

symjax.data.dsprites.load(path=None)[source]

greyscale image classification and disentanglement

This dataset consists of 737,280 images of 2D shapes, procedurally generated from 5 ground truth independent latent factors, controlling the shape, scale, rotation and position of a sprite. This data can be used to assess the disentanglement properties of unsupervised learning methods.

dSprites is a dataset of 2D shapes procedurally generated from 6 ground truth independent latent factors. These factors are color, shape, scale, rotation, x and y positions of a sprite.

All possible combinations of these latents are present exactly once, generating N = 737280 total images.

https://github.com/deepmind/dsprites-dataset

path: str (optional)
default ($DATASET_PATH), the path to look for the data and where the data will be downloaded if not present

images: array

latent: array

classes: array

symjax.data.svhn.load(path=None)[source]

Street number classification.

The SVHN dataset is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.

Parameters:path (str (optional)) – default $DATASET_PATH, the path to look for the data and where the data will be downloaded if not present
Returns:
  • train_images (array)
  • train_labels (array)
  • test_images (array)
  • test_labels (array)
symjax.data.cifar10.load(path=None)[source]

Image classification. The `CIFAR-10 < https: // www.cs.toronto.edu/~kriz/cifar.html >`_ dataset was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. :param path: default ($DATASET_PATH), the path to look for the data and

where the data will be downloaded if not present
Returns:
  • train_images (array)
  • train_labels (array)
  • test_images (array)
  • test_labels (array)
symjax.data.cifar100.load(path=None)[source]

Image classification.

The `CIFAR-100 < https: // www.cs.toronto.edu/~kriz/cifar.html >`_ dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label(the class to which it belongs) and a “coarse” label(the superclass to which it belongs).

symjax.data.stl10.load(path=None)[source]

Image classification with extra unlabeled images.

The STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. In particular, each class has fewer labeled training examples than in CIFAR-10, but a very large set of unlabeled examples is provided to learn image models prior to supervised training. The primary challenge is to make use of the unlabeled data (which comes from a similar but different distribution from the labeled data) to build a useful prior. We also expect that the higher resolution of this dataset (96x96) will make it a challenging benchmark for developing more scalable unsupervised learning methods.

Parameters:path (str (optional)) – the path to look for the data and where it will be downloaded if not present
Returns:
  • train_images (array) – the training images
  • train_labels (array) – the training labels
  • test_images (array) – the test images
  • test_labels (array) – the test labels
  • extra_images (array) – the unlabeled additional images

..autofunction:: symjax.data.audiomnist.load ..autofunction:: symjax.data.univariate_timeseries.load ..autofunction:: symjax.data.speech_commands.load ..autofunction:: symjax.data.picidae.load ..autofunction:: symjax.data.esc.load ..autofunction:: symjax.data.warblr.load ..autofunction:: symjax.data.gtzan.load ..autofunction:: symjax.data.dclde.load ..autofunction:: symjax.data.irmas.load ..autofunction:: symjax.data.vocalset.load ..autofunction:: symjax.data.freefield1010.load ..autofunction:: symjax.data.birdvox_70k.load ..autofunction:: symjax.data.birdvox_dcase_20k.load ..autofunction:: symjax.data.seizures_neonatal.load ..autofunction:: symjax.data.sonycust.load ..autofunction:: symjax.data.gtzan.load ..autofunction:: symjax.data.FSDKaggle2018.load ..autofunction:: symjax.data.TUTacousticscences2017.load