Sample datasets for AI, ML, and DL development
Earlier it was difficult to find datasets for AI projects since everyone was worried about sharing the data because it may be used for the wrong purpose. In recent years, lots of communities and organizations have made datasets available online which is free is used by anyone. Datasets play a key role in AI, ML, and DL projects.
As lots of datasets are available now, finding the right dataset is a difficult task. You can look for a dataset using a dataset finder. Kaggle which started in 2010 is a subsidiary of Google LLC is one of the popular dataset locations, with each dataset having a small community to discuss the specifics of data and code if available. A search box with options to search on size, file types, licenses, tags, last update will make it easy to find required datasets. UCI machine learning repository, created by the University of California, School of Information and Computer Science. It is one of the oldest sources of datasets available online and has some interesting datasets. Kdnuggets was started as a newsletter in 1993 by Gregory Piatetsky-Shapiro. It is a trusted site in business and scientific communities. It has datasets from government agencies, exchanges, and research centers. Google Datasets was created in 2018 to bring around 25 million datasets with a toolbox to search the datasets by name and find links to where the data is located. Microsoft Datasets created by Microsoft and the external research community have lots of curated datasets that can be used for research studies. VisualData contains datasets for building computer vision models and allows to search datasets by category. You can search by using computer vision-related model patterns like segmentation, image captioning, image generation, object detection, face recognition, and so on.
The datasets selected should be clean and they should have huge amounts of data and relevant data. ImageNet is one of the popular datasets for image processing, it has around 1 million images with around 1000 images for each class of images. Breast Cancer Wisconsin (Diagnostic) Dataset is a breast cancer diagnostic dataset that can be used for classifier models. It contains digitized images of a fine needle aspirate of a breast mass. Twitter Sentiment Analysis Dataset, is a popular dataset for natural language processing models. MNIST Dataset is the best dataset for binary images of handwritten digits with around 70,000 images. Spam SMS Classifier Dataset which is a dataset of around 5000 SMS can be used for spam detection, spam classification, and spam analysis. YouTube Dataset, provided by Google has around 8M classified YouTube videos and its IDs. This can be used for video classification and analysis. Boston Housing Price Dataset has around 506 cases with 14 attributes in each case based on Boston real estate and ideal for the regression problem. Pima Indians Diabetics Dataset contains details of diabetics of female patients who are at least 21 years old. It has around 768 data points with several medical predictor variables like the number of pregnancies, BMI, insulin level, and age. MS-COCO Dataset, has 330K images, 80 object categories, 5 captions per image, with around 25GB of data. It can be used for building image captioning, objection detection, and segmentation models. VisualQA Dataset has around 265K images, 3 questions per image, 10 ground-truth answers per image. This can be used for building visual Q&A models to answer open-ended questions about images. Fashion-MNIST Dataset has around 70K images with 10 classes which can be used similar to MNIST for building fashion-related models. IMDB Reviews Dataset has around 50K highly polar movie reviews which can be used for sentiment classification and analysis. The Wikipedia Corpus Dataset has around 0.4 M articles with 1.9 million words which is a huge dataset that can be used for natural language processing-related problems. Free Music Archive (FMA) has around 100K tracks and Million Song Dataset has around a million songs that are good datasets for building audio analysis models.
Following are various links for the above datasets:
· https://www.kaggle.com/datasets
· http://yann.lecun.com/exdb/mnist/
· https://archive.ics.uci.edu/ml/datasets.php
· https://www.kdnuggets.com/datasets/index.html
· https://datasetsearch.research.google.com/
· https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
· https://www.kaggle.com/kazanova/sentiment140
· https://www.kaggle.com/uciml/sms-spam-collection-dataset
· https://www.kaggle.com/datasnaek/youtube-new
· https://www.kaggle.com/kumargh/pimaindiansdiabetescsv
Hope this was helpful.