what is the largest dataset in terms of number of tokens today?

 title: 'Announcing Nemotron-CC: A Trillion-Token English Language Dataset for LLM Pretraining | NVIDIA Technical Blog'

Based on the available sources, the largest public dataset by token count appears to be Red Pajama 2—which was announced as having around 30 trillion tokens[5]. Other large datasets include Nemotron‑CC (6.3 trillion tokens)[2], the Common Corpus (over 2 trillion tokens)[4], Dolma (3 trillion tokens)[3], Zyda (1.3 trillion tokens)[1], and MINT‑1T (one trillion tokens)[6]. However, of these, the reported 30 trillion tokens of Red Pajama 2 makes it the largest dataset in terms of number of tokens.

Follow Up Recommendations