Wikipedia Releases New Dataset to Train AI Model, Fight Bot Menace

The dataset, now hosted on Google-owned Kaggle, includes structured Wikipedia content in English and French as of April 15

Wikipedia Releases New Dataset to Train AI Model, Fight Bot Menace

Wikipedia, the world’s most popular free online encyclopedia, has announced the release of a dedicated dataset specifically designed for AI model training. The goal is to provide high-quality, machine-readable data directly to developers, thereby deterring bots from aggressively scraping the site.

The dataset, now hosted on Google-owned Kaggle, includes structured Wikipedia content in English and French as of April 15. It features openly licensed research summaries, short descriptions, infobox data, image links, and article sections—excluding references and markdown formatting.

This makes the dataset ideal for AI-related tasks such as modeling, fine-tuning, alignment, and benchmarking. While Wikimedia already maintains content partnerships with Google and the Internet Archive, this new release is geared toward making Wikipedia’s data more accessible to smaller AI teams and independent researchers.

"The Wikimedia Foundation is the organisation that manages the data from wikipedia.org, the internet’s free encyclopedia. This data documents and describes the world in real time, with a foundational commitment to open access to data and information," Google said in a blog post.

Wikipedia has recently faced mounting pressure from AI bots that scrape its content to train generative AI models. This surge in automated traffic has led to increased server costs and slower load times for users.