Research Guides: Business Analytics: Large Data Sets

Autonomous Vehicle Datasets

Waymo Open Dataset
Waymo Open Dataset, is a high-quality multimodal sensor dataset for autonomous driving. It is comprised of high-resolution sensor data collected by Waymo self-driving vehicles. The dataset covers a wide variety of environments, from dense urban centers to suburban landscapes, as well as data collected during day and night, at dawn and dusk, in sunshine and rain.
ArgoAI Argoverse
Argoverse is a public repository of self-driving-car development data, including high-definition maps. Argoverse includes 3D tracking annotations for 113 scenes and 324,557 interesting vehicle trajectories for motion forecasting.
nuScenes
The nuScenes dataset is a public large-scale dataset for autonomous driving developed by Aptiv Autonomous Mobility (formerly nuTonomy). In covers 1000 driving scenes in Boston and Singapore, two cities that are known for their dense traffic and highly challenging driving situations. The scenes of 20 second length are manually selected to show a diverse and interesting set of driving maneuvers, traffic situations and unexpected behaviors. The full dataset includes approximately 1.4M camera images, 390k LIDAR sweeps, 1.4M RADAR sweeps and 1.4M object bounding boxes in 40k keyframes.
The KITTI Vision Benchmark Suite - Andreas Geiger
Dataset and benchmarks for computer vision research in the context of autonomous driving. The dataset has been recorded in and around the city of Karlsruhe, Germany using the mobile platform AnnieWay (VW station wagon) which has been equipped with several RGB and monochrome cameras, a Velodyne HDL 64 laser scanner as well as an accurate RTK corrected GPS/IMU localization unit. The dataset has been created for computer vision and machine learning research on stereo, optical flow, visual odometry, semantic segmentation, semantic instance segmentation, road segmentation, single image depth prediction, depth map completion, 2D and 3D object detection and object tracking. In addition, several raw data recordings are provided. The datasets are captured by driving around the mid-size city of Karlsruhe, in rural areas and on highways. Up to 15 cars and 30 pedestrians are visible per image.

Collections of Large Datasets

Open Data on AWS Powered by AWS Cloud Computing
This registry exists to help people discover and share datasets that are available via AWS resources. Datasets include: Landsat 8, IRS 990 Filings,Terrain Tiles, NEXRAD, SpaceNet, Global Database of Events, Language and Tone (GDELT), New York City Taxi and Limousine Commission (TLC) Trip Record Data, Amazon Customer Reviews Dataset,, etc.
Stanford Large Network Dataset Collection
Datasets include: Social networks, Networks with ground-truth communities, Communication networks, Citation networks, Collaboration networks, Amazon networks, Internet networks, Road networks, Autonomous systems, Signed networks, Location-based online social networks, Wikipedia networks, articles, and metadata, Twitter and Memetracker, etc.
Kaggle
Kaggle offers a no-setup, customizable, Jupyter Notebooks environment. Access free GPUs and a huge repository of community published data & code. Kaggle provides all the code & data you need to do your data science work. Use over 19,000 public datasets and 200,000 public notebooks to conquer any analysis in no time.
Arizona State University Network Data Collection
UC Irvine Network Data Depository
Library of datasets personally compiled by Tore Opsahl
Library of datasets personally compiled by Mark Newman

Image Large Datasets

MNIST
MNIST is one of the most popular deep learning datasets out there. It’s a dataset of handwritten digits and contains a training set of 60,000 examples and a test set of 10,000 examples. It’s a good database for trying learning techniques and deep recognition patterns on real-world data while spending minimum time and effort in data preprocessing.
MS-COCO
COCO is a large-scale and rich for object detection, segmentation and captioning dataset. It has several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with keypoints.
ImageNet
ImageNet is a dataset of images that are organized according to the WordNet hierarchy. WordNet contains approximately 100,000 phrases and ImageNet has provided around 1000 images on average to illustrate each phrase.
Open Images Dataset
Open Images is a dataset of almost 9 million URLs for images. These images have been annotated with image-level labels bounding boxes spanning thousands of classes. The dataset contains a training set of 9,011,219 images, a validation set of 41,260 images and a test set of 125,436 images.
VisualQA
VQA is a dataset containing open-ended questions about images. These questions require an understanding of vision and language.
The Street View House Numbers (SVHN)
This is a real-world image dataset for developing object detection algorithms. This requires minimum data preprocessing. It is similar to the MNIST dataset mentioned in this list, but has more labelled data (over 600,000 images). The data has been collected from house numbers viewed in Google Street View.
CIFAR-10
This dataset is another one for image classification. It consists of 60,000 images of 10 classes (each class is represented as a row in the above image). In total, there are 50,000 training images and 10,000 test images. The dataset is divided into 6 parts – 5 training batches and 1 test batch. Each batch has 10,000 images.
Fashion-MNIST
Fashion-MNIST consists of 60,000 training images and 10,000 test images. It is a MNIST-like fashion product database. The developers believe MNIST has been overused so they created this as a direct replacement for that dataset. Each image is in greyscale and associated with a label from 10 classes.

Natural Language Processing Large Datasets

IMDB Reviews
This is a dream dataset for movie lovers. It is meant for binary sentiment classification and has far more data than any previous datasets in this field. Apart from the training and test review examples, there is further unlabeled data for use as well. Raw text and preprocessed bag of words formats have also been included.
Sentiment140
Sentiment140 is a dataset that can be used for sentiment analysis. A popular dataset, it is perfect to start off your NLP journey. Emotions have been pre-removed from the data. The final dataset has the below 6 features: polarity of the tweet, id of the tweet, date of the tweet, the query, username of the tweeter, text of the tweet.
Yelp Reviews
This is an open dataset released by Yelp for learning purposes. It consists of millions of user reviews, businesses attributes and over 200,000 pictures from multiple metropolitan areas. This is a very commonly used dataset for NLP challenges globally.
The Wikipedia Corpus
This dataset is a collection of a the full text on Wikipedia. It contains almost 1.9 billion words from more than 4 million articles. What makes this a powerful NLP dataset is that you search by word, phrase or part of a paragraph itself.
The Blog Authorship Corpus
This dataset consists of blog posts collected from thousands of bloggers and has been gathered from blogger.com. Each blog is provided as a separate file. Each blog contains a minimum of 200 occurrences of commonly used English words.

Audio/Speech Datasets

Free Spoken Digit Dataset
Free Spoken Digit Dataset was created to solve the task of identifying spoken digits in audio samples. It’s an open dataset so the hope is that it will keep growing as people keep contributing more samples. Currently, it contains the below characteristics: 3 speakers, 1,500 recordings (50 of each digit per speaker), English pronunciations
Free Music Archive (FMA)
FMA is a dataset for music analysis. The dataset consists of full-length and HQ audio, pre-computed features, and track and user-level metadata. It an an open dataset created for evaluating several tasks in MIR.
Million Song Dataset
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.The dataset does not include any audio, only the derived features. The sample audio can be fetched from services like 7digital, using code provided by Columbia University.
LibriSpeech
This dataset is a large-scale corpus of around 1000 hours of English speech. The data has been sourced from audiobooks from the LibriVox project. It has been segmented and aligned properly. If you’re looking for a starting point, check out already prepared Acoustic models that are trained on this data set at kaldi-asr.org and language models, suitable for evaluation, at http://www.openslr.org/11/.
VoxCeleb
VoxCeleb is a large-scale speaker identification dataset. It contains around 100,000 utterances by 1,251 celebrities, extracted from YouTube videos. The data is mostly gender balanced (males comprise of 55%). The celebrities span a diverse range of accents, professions and age. There is no overlap between the development and test sets. It’s an intriguing use case for isolating and identifying which superstar the voice belongs to.

Analytics Vidhya Practice Problems

Twitter Sentiment Analysis
Hate Speech in the form of racism and sexism has become a nuisance on twitter and it is important to segregate these sort of tweets from the rest. In this Practice problem, we provide Twitter data that has both normal and hate tweets. Your task as a Data Scientist is to identify the tweets which are hate tweets and which are not.
Urban Sound Classification
This dataset consists of more than 8000 sound excerpts of urban sounds from 10 classes. This practice problem is meant to introduce you to audio processing in the usual classification scenario.