Based on the available sources, the largest public dataset by token count appears to be Red Pajama 2—which was announced as having around 30 trillion tokens[5]. Other large datasets include Nemotron‑CC (6.3 trillion tokens)[2], the Common Corpus (over 2 trillion tokens)[4], Dolma (3 trillion tokens)[3], Zyda (1.3 trillion tokens)[1], and MINT‑1T (one trillion tokens)[6]. However, of these, the reported 30 trillion tokens of Red Pajama 2 makes it the largest dataset in terms of number of tokens.
Get more accurate answers with Super Search, upload files, personalized discovery feed, save searches and contribute to the PandiPedia.
Let's look at alternatives: