
FineWeb-C: A Community-Built Dataset For Improving Language Models In ALL Languages
FineWeb2 significantly advances multilingual pretraining datasets, covering over 1000 languages with high-quality data. The dataset uses approximately 8 terabytes of compressed text data and contains nearly 3 trillion words, sourced from 96 CommonCrawl snapshots between […]