what is the largest dataset in terms of number of tokens today?

title: 'Announcing Nemotron-CC: A Trillion-Token English Language Dataset for LLM Pretraining | NVIDIA Technical Blog'

Based on the available sources, the largest public dataset by token count appears to be Red Pajama 2—which was announced as having around 30 trillion tokens^[5]. Other large datasets include Nemotron‑CC (6.3 trillion tokens)^[2], the Common Corpus (over 2 trillion tokens)^[4], Dolma (3 trillion tokens)^[3], Zyda (1.3 trillion tokens)^[1], and MINT‑1T (one trillion tokens)^[6]. However, of these, the reported 30 trillion tokens of Red Pajama 2 makes it the largest dataset in terms of number of tokens.

Curated by Joan

Related Content From The Pandipedia

Which tokenizer do gpt-oss models use?what takes longer in LLMs, to encode tokens or to decode them and why What are the most expensive Magic cards?Revitalizing Post-Industrial Cities Through Biomanufacturing Hubs What is the significance of the "ImageNet" challenge in deep learning?What is Top-P?What is the estimated size of the Europe car rental market in 2023?The Wärtsilä RT-flex96C: The World's Largest Diesel Engine Middle East headlines are shifting fast What is the largest animal on Earth?What are the world's most beautiful libraries?The Largest Ocean in the Universe What Are the Benefits and Drawbacks of Eating Organic Food?The Significance of Variational Autoencoders

what is the largest dataset in terms of number of tokens today?

Follow Up Recommendations

Related Content From The Pandipedia