Close Menu
CatchTheBullCatchTheBull
  • Home
  • Crypto News
  • Bitcoin
  • Altcoin
  • Blockchain
  • Airdrops News
  • NFT News
What's Hot

Moonpay Launches Open Wallet Standard to Unify AI Agent Payments – News Bytes Bitcoin News

March 24, 2026

Hedera’s HBAR Price Up By 4% on BitTrade Debut Anticipation

March 24, 2026

TRON DAO targets agentic economy with $1B AI fund

March 24, 2026
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram
CatchTheBullCatchTheBull
  • Home
  • Crypto News
  • Bitcoin
  • Altcoin
  • Blockchain
  • Airdrops News
  • NFT News
CatchTheBullCatchTheBull
Blockchain

NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

By WebDeskMay 7, 20252 Mins Read
NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training
Share
Facebook Twitter LinkedIn Pinterest Email


Joerg Hiller
May 07, 2025 15:38

NVIDIA introduces Nemotron-CC, a trillion-token dataset for large language models, integrated with NeMo Curator. This innovative pipeline optimizes data quality and quantity for superior AI model training.





NVIDIA has integrated its Nemotron-CC pipeline into the NeMo Curator, offering a groundbreaking approach to curating high-quality datasets for large language models (LLMs). The Nemotron-CC dataset leverages a 6.3-trillion-token English language collection from Common Crawl, aiming to enhance the accuracy of LLMs significantly, according to NVIDIA.

Advancements in Data Curation

The Nemotron-CC pipeline addresses the limitations of traditional data curation methods, which often discard potentially useful data due to heuristic filtering. By employing classifier ensembling and synthetic data rephrasing, the pipeline generates 2 trillion tokens of high-quality synthetic data, recovering up to 90% of content lost by filtering.

Innovative Pipeline Features

The pipeline’s data curation process begins with HTML-to-text extraction using tools like jusText and FastText for language identification. It then applies deduplication to remove redundant data, utilizing NVIDIA RAPIDS libraries for efficient processing. The process includes 28 heuristic filters to ensure data quality and a PerplexityFilter module for further refinement.

Quality labeling is achieved through an ensemble of classifiers that assess and categorize documents into quality levels, facilitating targeted synthetic data generation. This approach enables the creation of diverse QA pairs, distilled content, and organized knowledge lists from the text.

Impact on LLM Training

Training LLMs with the Nemotron-CC dataset yields significant improvements. For instance, a Llama 3.1 model trained on a 1 trillion-token subset of Nemotron-CC achieved a 5.6-point increase in the MMLU score compared to models trained on traditional datasets. Furthermore, models trained on long horizon tokens, including Nemotron-CC, saw a 5-point boost in benchmark scores.

Getting Started with Nemotron-CC

The Nemotron-CC pipeline is available for developers aiming to pretrain foundation models or perform domain-adaptive pretraining across various fields. NVIDIA provides a step-by-step tutorial and APIs for customization, enabling users to optimize the pipeline for specific needs. The integration into NeMo Curator allows for seamless development of both pretraining and fine-tuning datasets.

For more information, visit the NVIDIA blog.

Image source: Shutterstock


Credit: Source link

Previous ArticleStealthEX Partners with Turbo for Exclusive AMA Event with $100 TURBO Token Prize Pool
Next Article Bitcoin Poised To Retest All-Time High If This Level Holds: Bitfinex

Related Posts

Oracle ORCL Launches Agentic Applications Builder for Enterprise AI Automation

March 24, 2026

EigenCloud Launches Agentic Builder Series as India Eyes AI Agent Wave

March 24, 2026

Anthropic Launches Claude Computer Control Feature for Mac Users

March 23, 2026
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Moonpay Launches Open Wallet Standard to Unify AI Agent Payments – News Bytes Bitcoin News

March 24, 2026

Hedera’s HBAR Price Up By 4% on BitTrade Debut Anticipation

March 24, 2026

TRON DAO targets agentic economy with $1B AI fund

March 24, 2026

Subscribe to Updates

Get the latest Crypto, Blockchain and Airdrop News from us to Catch The Bull.

Advertisement Banner

Welcome to CatchTheBull, your trusted source for the latest Crypto News and Airdrops. We bring you real-time updates, expert insights, and opportunities to stay ahead in the crypto world. Discover trending projects, market analyses, and airdrop details all in one place.

Join us on this journey to navigate the ever-evolving blockchain universe!

Facebook X (Twitter) Instagram YouTube
Top Insights

Markets Flip Script as Fed Hike Odds Overtake Cuts for First Time in 2026 Cycle

Anthropic Launches Claude Computer Control Feature for Mac Users

Only 5% of altcoins beat the 200‑day as volume collapses 80%

Get Informed

Subscribe to Updates

Get the latest Crypto, Blockchain and Airdrop News from us to Catch The Bull.

© 2026 CatchTheBull. All Rights Are Reserved.
  • Contact Us
  • Privacy Policy
  • Terms of Use
  • DMCA

Type above and press Enter to search. Press Esc to cancel.

  • bitcoinBitcoin(BTC)$71,026.000.32%
  • ethereumEthereum(ETH)$2,153.590.19%
  • tetherTether(USDT)$1.00-0.01%
  • rippleXRP(XRP)$1.421.22%
  • binancecoinBNB(BNB)$636.59-1.14%
  • usd-coinUSDC(USDC)$1.000.00%
  • solanaSolana(SOL)$91.673.77%
  • tronTRON(TRX)$0.3102740.90%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.031.04%
  • dogecoinDogecoin(DOGE)$0.0940011.94%
  • whitebitWhiteBIT Coin(WBT)$54.942.33%
  • USDSUSDS(USDS)$1.00-0.02%
  • cardanoCardano(ADA)$0.2628434.93%
  • bitcoin-cashBitcoin Cash(BCH)$471.85-0.48%
  • HyperliquidHyperliquid(HYPE)$38.612.18%
  • leo-tokenLEO Token(LEO)$9.421.15%
  • chainlinkChainlink(LINK)$9.213.23%
  • moneroMonero(XMR)$344.90-3.74%
  • Ethena USDeEthena USDe(USDE)$1.00-0.01%
  • CantonCanton(CC)$0.145380-0.66%
  • stellarStellar(XLM)$0.1662016.16%
  • USD1USD1(USD1)$1.00-0.30%
  • daiDai(DAI)$1.000.00%
  • litecoinLitecoin(LTC)$55.721.94%
  • RainRain(RAIN)$0.0088345.53%
  • avalanche-2Avalanche(AVAX)$9.545.94%
  • hedera-hashgraphHedera(HBAR)$0.0936254.23%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.05%
  • zcashZcash(ZEC)$229.625.22%
  • suiSui(SUI)$0.955.46%
  • shiba-inuShiba Inu(SHIB)$0.0000065.82%
  • the-open-networkToncoin(TON)$1.332.97%
  • crypto-com-chainCronos(CRO)$0.0757030.83%
  • MemeCoreMemeCore(M)$1.73-2.65%
  • BittensorBittensor(TAO)$306.669.00%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.1056168.28%
  • tether-goldTether Gold(XAUT)$4,414.671.30%
  • Circle USYCCircle USYC(USYC)$1.120.01%
  • polkadotPolkadot(DOT)$1.40-3.59%
  • mantleMantle(MNT)$0.721.91%
  • uniswapUniswap(UNI)$3.590.22%
  • pax-goldPAX Gold(PAXG)$4,423.893.09%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • Pi NetworkPi Network(PI)$0.188681-0.40%
  • okbOKB(OKB)$86.403.93%
  • Global DollarGlobal Dollar(USDG)$1.000.01%
  • Falcon USDFalcon USD(USDF)$1.00-0.01%
  • nearNEAR Protocol(NEAR)$1.311.25%
  • aaveAave(AAVE)$110.651.91%
  • SkySky(SKY)$0.0715122.30%