20 Newsgroups: 20,000 documents from over 20 different newsgroups. The content covers a variety of topics with some closely related for reference. There are three versions, one in its original form, one with dates removed, and one with duplicates removed.
The WikiQA Corpus: Contains question and sentence pairs. It’s robust and compiled from Bing query logs. There are over 3000 questions and over 29,000 answer sentences with just under 1500 labeled as answer sentences.
Jeopardy: Over 200,000 questions from the famed tv show. It includes category and value designations as well as other descriptors like question and answer fields and rounds.
Legal Case Reports Dataset: Text summaries of legal cases. It contains wrapups of over 4000 legal cases and could be great for training for automatic text summarization.
The Artists dataset contains 15,668 records, representing all the artists who have work in MoMA's collection and have been cataloged in our database. It includes basic metadata for each artist, including name, nationality, gender, birth year, death year, Wiki QID, and Getty ULAN ID.
At this time, both datasets are available in CSV format, encoded in UTF-8. While UTF-8 is the standard for multilingual character encodings, it is not correctly interpreted by Excel on a Mac. Users of Excel on a Mac can convert the UTF-8 to UTF-16 so the file can be imported correctly. The datasets are also available in JSON.
Holopix50k is a large-scale dataset of 49,368 (~50k) stereoscopic image pairs collected from the popular Lightfield social media app Holopix™. This is the largest dataset of stereoscopic image pairs ever released publicly that contain in-the-wild scenarios captured from mobile phones. For context, the second-largest dataset in this category consists of only 1024 stereoscopic image pairs — almost 50 times less! The dataset is available for download immediately on the project page and also has an associated research paper.
补充一些NLP数据集:
20 Newsgroups: 20,000 documents from over 20 different newsgroups. The content covers a variety of topics with some closely related for reference. There are three versions, one in its original form, one with dates removed, and one with duplicates removed.
The WikiQA Corpus: Contains question and sentence pairs. It’s robust and compiled from Bing query logs. There are over 3000 questions and over 29,000 answer sentences with just under 1500 labeled as answer sentences.
European Parliament Proceedings Parallel Corpus: Sentence pairs from Parliament proceedings. There are entries from 21 European languages including some less common entries for ML corpus.
Jeopardy: Over 200,000 questions from the famed tv show. It includes category and value designations as well as other descriptors like question and answer fields and rounds.
Legal Case Reports Dataset: Text summaries of legal cases. It contains wrapups of over 4000 legal cases and could be great for training for automatic text summarization.
新冠疫情Twitter数据集:https://github.com/thepanacealab/covid19_twitter
现代艺术博物馆(MoMA)藏品/艺术家数据集(不包含图片):https://github.com/MuseumofModernArt/collection
The Artists dataset contains 15,668 records, representing all the artists who have work in MoMA's collection and have been cataloged in our database. It includes basic metadata for each artist, including name, nationality, gender, birth year, death year, Wiki QID, and Getty ULAN ID.
At this time, both datasets are available in CSV format, encoded in UTF-8. While UTF-8 is the standard for multilingual character encodings, it is not correctly interpreted by Excel on a Mac. Users of Excel on a Mac can convert the UTF-8 to UTF-16 so the file can be imported correctly. The datasets are also available in JSON.
中英文OCR数据集:https://github.com/WenmuZhou/OCR_DataSet
Google发布的语音分离数据集与深度学习模型
dataset:O网页链接
model:O网页链接
《Google Open-Sources FUSS: The Free Universal Sound Separation Dataset | MarkTechPost》 O网页链接
Holopix50k:大规模自然场景立体影像数据集:
Holopix50k is a large-scale dataset of 49,368 (~50k) stereoscopic image pairs collected from the popular Lightfield social media app Holopix™. This is the largest dataset of stereoscopic image pairs ever released publicly that contain in-the-wild scenarios captured from mobile phones. For context, the second-largest dataset in this category consists of only 1024 stereoscopic image pairs — almost 50 times less! The dataset is available for download immediately on the project page and also has an associated research paper.
瑜伽-人体姿态细粒度分类数据集:https://arxiv.org/abs/2004.10362
Astropy:在线天文数据访问包:https://github.com/astropy/astroquery
大英博物馆在网上展出190万件艺术品,可免费使用(Creative Commons 4.0 license):http://www.openculture.com/2020/04/the-british-museum-puts-1-9-million-works-of-art-online.html
知识嵌入常用数据集:https://github.com/simonepri/datasets-knowledge-embedding
HybridQA:包含表格/文本数据、聚焦异构信息推理的大规模多跳问答数据集:https://github.com/wenhuchen/HybridQA
c4:Google发布的超大规模网页数据集:https://www.tensorflow.org/datasets/catalog/c4
嘈杂环境音源分离数据集:https://github.com/JorisCos/LibriMix
SmoothNLP 金融文本数据集(公开) Public Financial Datasets for NLP Researches Only:https://github.com/smoothnlp/FinancialDatasets
NLP数据集/基准任务在线浏览器:https://huggingface.co/nlp/viewer/
无人机检测/跟踪图像/视频数据集:https://github.com/VisDrone/VisDrone-Dataset
MSeg:多域语义分割复合数据集:http://vladlen.info/papers/MSeg.pdf
整体大规模视频理解数据集:https://github.com/holistic-video-understanding/HVU-Dataset
MetFaces Dataset:艺术作品人脸数据集:https://github.com/NVlabs/metfaces-dataset
3D60:室内球面3D全景图数据集:https://github.com/VCL3D/3D60
Lyft发布的LEVEL 5无人驾驶预测数据集:https://self-driving.lyft.com/level5/prediction/