AI Trends

Wikipedia Releases New Dataset to Train AI Model, Fight Bot Menace

The dataset, now hosted on Google-owned Kaggle, includes structured Wikipedia content in English and French as of April 15

The Left Shift Bureau

21 Apr 2025 — 1 min read

Wikipedia, the world’s most popular free online encyclopedia, has announced the release of a dedicated dataset specifically designed for AI model training. The goal is to provide high-quality, machine-readable data directly to developers, thereby deterring bots from aggressively scraping the site.

The dataset, now hosted on Google-owned Kaggle, includes structured Wikipedia content in English and French as of April 15. It features openly licensed research summaries, short descriptions, infobox data, image links, and article sections—excluding references and markdown formatting.

This makes the dataset ideal for AI-related tasks such as modeling, fine-tuning, alignment, and benchmarking. While Wikimedia already maintains content partnerships with Google and the Internet Archive, this new release is geared toward making Wikipedia’s data more accessible to smaller AI teams and independent researchers.

"The Wikimedia Foundation is the organisation that manages the data from wikipedia.org, the internet’s free encyclopedia. This data documents and describes the world in real time, with a foundational commitment to open access to data and information," Google said in a blog post.

Wikipedia has recently faced mounting pressure from AI bots that scrape its content to train generative AI models. This surge in automated traffic has led to increased server costs and slower load times for users.

ByteDance Unveils Seedream 3.0, Claims It Beats GPT-4o in Image Generation

The model can generate up to 2K resolution images, with improvements in speed—achieving 4 to 8 times faster performance without sacrificing quality

Saying 'Please' to ChatGPT is Costing OpenAI Millions of Dollars

Every ChatGPT prompt triggers thousands of calculations, consuming significant energy and water

Tableau Next Signals the Death of Dashboards With AI Agents Taking Over

At the heart of Tableau Next is Data Cloud, which powers intelligent data workflows and introduces automation in traditionally manual processes like data cleaning, transformation, and visualisation

Amazon Could Replace Android on Fire TV Devices with New Linux-Based OS

Vega OS is expected to expand beyond smart TVs to power Amazon’s tablets and Echo smart devices

Read more

ByteDance Unveils Seedream 3.0, Claims It Beats GPT-4o in Image Generation

Saying 'Please' to ChatGPT is Costing OpenAI Millions of Dollars

Tableau Next Signals the Death of Dashboards With AI Agents Taking Over

Amazon Could Replace Android on Fire TV Devices with New Linux-Based OS