Wikipedia Dataset

Wikipedia Dataset

Wikimedia Enterprise has released an early beta dataset on Hugging Face, allowing the public to use it freely and provide feedback for future improvements. This dataset is sourced from the Snapshot API, which delivers bulk database dumps, or “snapshots,” of Wikimedia projects. In this release, the dataset includes English and French Wikipedia articles. It’s built using the new Structured Contents beta, offering more machine-readable response formats without needing to parse large article bodies.

What’s in the dataset and how to use it

The dataset on Hugging Face includes all articles from the English and French Wikipedia editions, pre-parsed into structured JSON files with a consistent schema, compressed as zip files. Each JSON object represents one full Wikipedia article, stripped of markdown and non-prose sections (e.g., references). The dataset includes fields like article title, ID, abstract, version information, editor signals, revision scores, URLs, creation/modification timestamps, and more (details listed in the dataset fields section).

This structured dataset is highly useful for various tasks, including all phases of model development, such as pre-training, fine-tuning, retrieval-augmented generation (RAG), testing, and benchmarking.

Dataset structure and fields

Structure:

Dataset structure

Each line/article in the dataset includes:

  • name: Article title
  • identifier: Article ID
  • abstract: Lead section summarizing the article
  • version: Metadata of the latest article revision
  • version.editor: Editor-specific signals for contextualizing revisions
  • version.scores: ML model-generated likelihood of a revision being reverted
  • url: Article URL
  • date_created: Timestamp of article creation or first revision
  • date_modified: Timestamp of the latest revision
  • main_entity: Wikidata QID associated with the article
  • is_part_of: Wikimedia project the article belongs to
  • additional_entities: Array of Wikidata entities in the article
  • in_language: Language in which the article is written
  • image: Main image of the article’s subject
  • license: Relevant licenses for content reuse
  • description: One-sentence summary of the article
  • infobox: Parsed information from the article’s side panel (infobox)
  • sections: Parsed sections of the article, including links

Note: The dataset excludes other media, images, lists, tables, and references. A detailed schema and data dictionary are available for reference.

Data licensing and attribution

The dataset’s original text is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some content may be licensed solely under Creative Commons, and specific terms are outlined in the Wikimedia Terms of Use. Proper attribution is essential for sustainability, as it encourages new editors and donors to continue contributing to Wikipedia. Full attribution requirements are detailed on Hugging Face.

Accessing the dataset

The official Wikimedia Enterprise beta dataset is available on Hugging Face at huggingface.co/datasets/wikimedia/structured-wikipedia. Users can explore, download, or integrate the dataset directly into their machine learning workflows via Hugging Face tools and APIs.

Conclusion

This beta release on Hugging Face is a significant step in making Wikimedia’s vast content more accessible for AI and machine learning applications. By offering structured data from the Snapshot API, Wikimedia enables researchers, developers, and data scientists to tap into a valuable knowledge resource for a variety of innovative uses.