Skip to content

Dataset

A dataset (or data set) is a collection of data that is used for training a machine learning model.

Machine learning typically works with three datasets:

  • Training dataset

    The actual dataset that we use to train the model. The model learns weights and parameters from this data.

  • Validation dataset

    The validation set is used to evaluate a given model during the training process. It helps machine learning engineers to fine-tune the HyperParameters at model development stage. The model doesn't learn from validation dataset; and validation dataset is optional.

  • Test dataset

    The Test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained. The test dataset should more accurately evaluate how the model will be performed on new data.

See Jason Brownlee’s article for more detail.

Basic Dataset

DJL provides a number of built-in basic and standard datasets. These datasets are used to train deep learning models. This module contains the following datasets:

CV

Image Classification

  • MNIST - A small and fast handwritten digits dataset
  • Fashion MNIST - A small and fast clothing type detection dataset
  • CIFAR10 - A dataset consisting of 60,000 32x32 color images in 10 classes
  • ImageNet - An image database organized according to the WordNet hierarchy

    Note: You have to manually download the ImageNet dataset due to licensing requirements.

Object Detection

  • Pikachu - 1000 Pikachu images of different angles and sizes created using an open source 3D Pikachu model
  • Banana Detection - A testing single object detection dataset

Other CV

  • Captcha - A dataset for a grayscale 6-digit CAPTCHA task
  • Coco - A large-scale object detection, segmentation, and captioning dataset that contains 1.5 million object instances
  • You have to manually add com.twelvemonkeys.imageio:imageio-jpeg:3.11.0 dependency to your project

NLP

Text Classification and Sentiment Analysis

  • AmazonReview - A sentiment analysis dataset of Amazon Reviews with their ratings
  • Stanford Movie Review - A sentiment analysis dataset of movie reviews and sentiments sourced from IMDB
  • GoEmotions - A dataset classifying 50k curated reddit comments into either 27 emotion categories or neutral

Unlabeled Text

  • Penn Treebank Text - The text (not POS tags) from the Penn Treebank, a collection of Wall Street Journal stories
  • WikiText2 - A collection of over 100 million tokens extracted from good and featured articles on wikipedia

Other NLP

Tabular

Time Series