News BlockFin
  • bitcoinBitcoin(BTC)$104,950.00-0.77%
  • ethereumEthereum(ETH)$2,614.04-0.13%
  • tetherTether(USDT)$1.000.00%
  • rippleXRP(XRP)$2.21-2.27%
  • binancecoinBNB(BNB)$666.28-0.25%
  • solanaSolana(SOL)$153.64-2.15%
  • usd-coinUSDC(USDC)$1.000.00%
  • dogecoinDogecoin(DOGE)$0.188999-3.40%
  • tronTRON(TRX)$0.2726661.30%
  • cardanoCardano(ADA)$0.67-3.71%
  • Home
  • Bitcoin
  • Crypto Updates
    • Crypto Updates
    • Altcoin
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • Metaverse
  • Web3
  • Analysis
  • Regulations
  • Scams
No Result
View All Result
News BlockFin
  • Home
  • Bitcoin
  • Crypto Updates
    • Crypto Updates
    • Altcoin
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • Metaverse
  • Web3
  • Analysis
  • Regulations
  • Scams
No Result
View All Result
News BlockFin
No Result
View All Result

NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining

Home Blockchain
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter




Iris Coleman
Jan 10, 2025 14:13

NVIDIA debuts Nemotron-CC, a 6.3-trillion-token English dataset, enhancing pretraining for giant language fashions with modern knowledge curation strategies.





NVIDIA has introduced the discharge of Nemotron-CC, a groundbreaking 6.3-trillion-token English language dataset designed to advance the pretraining of huge language fashions (LLMs). This dataset, derived from Widespread Crawl, goals to raise the accuracy and effectivity of LLMs by way of modern knowledge curation methods, together with using 1.9 trillion tokens of synthetically generated knowledge, based on NVIDIA.

Enhancing LLM Pretraining

NVIDIA’s initiative addresses a vital want in LLM coaching, the place the standard of pretraining datasets performs a pivotal function. Whereas latest fashions like Meta’s Llama sequence have been primarily based on datasets comprising as much as 15 trillion tokens, the precise composition of those datasets stays largely undisclosed. Nemotron-CC seeks to fill this hole by offering the broader neighborhood with a high-quality dataset able to supporting each brief and lengthy token horizon coaching.

Conventional datasets typically sacrifice as much as 90% of information to enhance benchmark accuracies, limiting their utility for in depth coaching. Nemotron-CC, nevertheless, demonstrates easy methods to remodel Widespread Crawl knowledge right into a superior dataset, surpassing even the Llama 3.1 8B mannequin by way of superior strategies corresponding to classifier ensembling and artificial knowledge rephrasing.

Important Outcomes

Nemotron-CC’s efficacy is evidenced by its efficiency in varied benchmarks. When coaching 8B parameter fashions for one trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms main datasets like DCLM, growing MMLU scores by 5.6 factors. Moreover, the entire 6.3-trillion-token dataset matches DCLM on MMLU whereas providing 4 occasions extra distinctive actual tokens. This allows efficient coaching over lengthy token horizons, with Nemotron-CC-trained fashions surpassing Llama 3.1 8B in a number of metrics, together with a 5-point enhance in MMLU and a 3.1-point rise in ARC-Problem scores.

Revolutionary Knowledge Curation Methods

The event of Nemotron-CC concerned a number of key insights. By ensembling completely different model-based classifiers, NVIDIA was in a position to choose a broader array of high-quality tokens. Moreover, rephrasing methods decreased noise and errors, yielding various and precious knowledge variants. The choice to disable conventional heuristic filters additional boosted the dataset’s high quality with out compromising accuracy.

NVIDIA utilized its NeMo Curator software to extract and refine knowledge from Widespread Crawl, making use of filters for language, deduplication, and high quality classification. This course of was complemented by artificial knowledge era, contributing roughly two trillion tokens to the dataset.

Future Prospects

Nemotron-CC is positioned as an important useful resource for pretraining state-of-the-art LLMs over various token horizons. NVIDIA plans to increase its choices by releasing extra specialised datasets, together with these centered on particular domains like arithmetic, to additional improve LLM capabilities.

Picture supply: Shutterstock



Source link

Tags: DatasetIntroducesLLMmassiveNemotronCCNVIDIAPretraining
Previous Post

Solana Faces a Bold New Challenger Lightchain AI and the Future of Blockchain

Next Post

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

News BlockFin

News BlockFin

Related Posts

NVIDIA MLPerf v5.0: Reproducing Training Scores for LLM Benchmarks
Blockchain

NVIDIA MLPerf v5.0: Reproducing Training Scores for LLM Benchmarks

June 4, 2025
OP_RETURN and Storing Data on the Bitcoin Blockchain
Blockchain

OP_RETURN and Storing Data on the Bitcoin Blockchain

June 4, 2025
Crocodilus Malware Goes Global with Smarter Theft Tools
Blockchain

Crocodilus Malware Goes Global with Smarter Theft Tools

June 4, 2025
AI-Powered Interactivity Transforms Australia’s National Communication Museum
Blockchain

AI-Powered Interactivity Transforms Australia’s National Communication Museum

June 3, 2025
No License, No Overseas Ops
Blockchain

No License, No Overseas Ops

June 3, 2025
Multichain Bridges: Enabling Blockchain Interoperability
Blockchain

Multichain Bridges: Enabling Blockchain Interoperability

June 2, 2025
Next Post
Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Bybit Freezes Indian Trades, Cites Compliance Challenges

Bybit Freezes Indian Trades, Cites Compliance Challenges

How High Can Cardano (ADA) Go? Price Bounces Back amid Surging Trading Volume

How High Can Cardano (ADA) Go? Price Bounces Back amid Surging Trading Volume

Facebook Twitter Youtube Youtube RSS
News BlockFin

News BlockFin delivers the latest cryptocurrency and blockchain news, expert market analysis, and in-depth articles. Stay informed with round-the-clock updates and insights from the world of digital currencies.

CATEGORIES

  • Altcoin
  • Analysis
  • Bitcoin
  • Blockchain
  • Crypto Exchanges
  • Crypto Updates
  • DAO
  • Ethereum
  • Metaverse
  • NFT
  • Regulations
  • Scam Alert
  • Sustainability
  • Uncategorized
  • Web3

SITEMAP

  • About Us
  • Advertise With Us
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact Us

Copyright © 2024 News BlockFin.
News BlockFin is not responsible for the content of external sites.

No Result
View All Result
  • Home
  • Bitcoin
  • Crypto Updates
    • Crypto Updates
    • Altcoin
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • Metaverse
  • Web3
  • Analysis
  • Regulations
  • Scams

Copyright © 2024 News BlockFin.
News BlockFin is not responsible for the content of external sites.