Name: CIA World Factbook Archive 1990-2025
Creator: Central Intelligence Agency
License: https://creativecommons.org/publicdomain/mark/1.0/

Disclosure
This archive presents public-domain data from the CIA World Factbook. All analytic frameworks, regional groupings, and classification standards referenced below originate from official U.S. Government publications and internationally recognized bodies. This project is not affiliated with the Central Intelligence Agency or the U.S. Government.

01 Primary Data Source

CIA World Factbook

All country data in this archive is sourced from the CIA World Factbook, a public-domain publication of the Central Intelligence Agency. The Factbook provides intelligence on the history, people, government, economy, energy, geography, environment, communications, transportation, military, terrorism, and transnational issues for 266 world entities. Published annually since 1962; this archive covers the 1990–2025 editions (36 years). Publisher: Central Intelligence Agency, United States Government Classification: OPEN SOURCE // PUBLIC DOMAIN URL: https://www.cia.gov/the-world-factbook/

02 Analytic Standards

ICD 203 — Analytic Standards

Intelligence Community Directive 203, "Analytic Standards," establishes the standards for analytic products produced by the Intelligence Community. This archive formats its dossier pages following ICD 203 principles including sourcing transparency, distinguishing between underlying intelligence and analytic judgment, and incorporating analysis of alternatives. Issuing Authority: Director of National Intelligence (DNI) Effective: 2 January 2015 Reference: https://www.dni.gov/files/documents/ICD/ICD-203.pdf

ICD 208 — Maximizing the Utility of Analytic Products

Intelligence Community Directive 208 provides guidance on the write-for-maximum-utility standard, influencing how intelligence assessments are structured and presented. Dossier formatting in this archive follows ICD 208 principles for clear, structured presentation. Issuing Authority: Director of National Intelligence (DNI) Effective: 17 December 2008 Reference: https://www.dni.gov/files/documents/ICD/ICD-208.pdf

Confidence Levels

This archive assigns confidence levels to data fields based on recency, following the general framework of ICD 203 confidence assessments adapted for structured data: HIGH — Data from the current assessment year (0–1 years old) MODERATE — Data within 2–3 years of the assessment year LOW — Data older than 4 years from the assessment year Note: These confidence levels reflect data freshness only, not analytic confidence in the traditional intelligence sense. The CIA World Factbook updates fields on varying schedules; some fields may carry older data in recent editions.

03 Regional Framework — Unified Command Plan

DoD Unified Command Plan (UCP)

This archive organizes countries by U.S. Department of Defense Combatant Command (COCOM) areas of responsibility as defined in the Unified Command Plan. The UCP is a classified document signed by the President; however, COCOM area-of-responsibility boundaries are publicly available through DoD publications and official command websites. The six geographic combatant commands used in this archive (sovereign states + territories): EUCOM — U.S. European Command (53 entities, Europe & Eurasia) CENTCOM — U.S. Central Command (21 entities, Middle East & Central/South Asia) INDOPACOM — U.S. Indo-Pacific Command (49 entities, Asia-Pacific & Indian Ocean) AFRICOM — U.S. Africa Command (57 entities, Africa) SOUTHCOM — U.S. Southern Command (46 entities, Central & South America, Caribbean) NORTHCOM — U.S. Northern Command (6 entities, North America) Issuing Authority: President of the United States / Secretary of Defense Current Version: 2024 (most recent publicly acknowledged revision) Reference: https://www.defense.gov/Spotlights/Unified-Command-Plan/

04 Country Groupings — Time Series Presets

P5 (UNSC Permanent Members)

United States, China, Russia, United Kingdom, France The five permanent members of the United Nations Security Council with veto power, established by Article 23 of the UN Charter (1945). Sometimes referred to as the "Permanent Five" or "P5." Source: Charter of the United Nations, Chapter V, Article 23 Reference: https://www.un.org/en/about-us/un-charter/chapter-5

G7 (Group of Seven)

United States, United Kingdom, Germany, France, Japan, Italy, Canada An intergovernmental political and economic forum of seven major advanced economies. The G7 does not have a formal charter; membership has been stable since Canada joined in 1976 (Russia participated as G8 from 1997 to 2014). Source: G7 official communications Reference: https://www.g7germany.de/g7-en/g7-and-g20/what-is-the-g7-

BRICS

China, India, Brazil, Russia, South Africa Originally coined as "BRIC" by Goldman Sachs economist Jim O'Neill (2001) to describe four emerging economies. South Africa joined in 2010. The grouping formalized through annual summits beginning in 2009. In 2024, the grouping expanded (BRICS+) to include Egypt, Ethiopia, Iran, Saudi Arabia, and the UAE, though this archive uses the original five for the preset. Source: BRICS Joint Statistical Publication; Johannesburg II Declaration (2023) Reference: https://brics2024.gob.ru/en

NATO Select

United States, United Kingdom, Germany, France, Turkey, Poland, Norway A representative sample of NATO member states selected by military expenditure, geographic coverage, and strategic importance. NATO has 32 member states as of 2024 (after Finland and Sweden accession). The full membership list is publicly available. Source: North Atlantic Treaty (Washington Treaty, 1949); NATO official member list Reference: https://www.nato.int/cps/en/natohq/nato_countries.htm

Near-Peer Competitors

China, Russia Terminology from the U.S. National Defense Strategy (NDS). The 2022 NDS identifies the People's Republic of China as "the most consequential strategic competitor" and the Russian Federation as an "acute threat." The 2018 NDS used the term "great power competition" to describe strategic rivalry with both nations. Source: 2022 National Defense Strategy of the United States of America Issuing Authority: U.S. Department of Defense Reference: https://www.defense.gov/National-Defense-Strategy/

Regional Powers

India, Brazil, Turkey, Saudi Arabia, Iran, Indonesia States with significant military, economic, and political influence within their respective regions but not classified as global "great powers" in U.S. strategic documents. This grouping draws from international relations scholarship and aligns with how these states are discussed in the National Security Strategy and Defense Intelligence Agency threat assessments. Sources: • National Security Strategy of the United States (2022) • DIA Worldwide Threat Assessment • Academic consensus in international relations (e.g., Buzan & Wæver, "Regions and Powers," Cambridge University Press, 2003)

Africa Top 5 (by Population)

Nigeria, Ethiopia, Egypt, Democratic Republic of the Congo, Tanzania The five most populous countries on the African continent, as reported in the CIA World Factbook. This grouping is derived directly from Factbook population data.

Indo-Pacific

Japan, South Korea, Australia, India, Indonesia, Thailand, Philippines Key partner nations and allies within the U.S. Indo-Pacific Command (INDOPACOM) area of responsibility. Selection reflects the nations most frequently referenced in the Indo-Pacific Strategy of the United States (2022). Source: Indo-Pacific Strategy of the United States (February 2022) Issuing Authority: The White House Reference: https://www.whitehouse.gov/briefing-room/speeches-remarks/2022/02/11/fact-sheet-indo-pacific-strategy-of-the-united-states/

05 Archive Construction Methodology

Overview
The CIA World Factbook has been published in different formats over 36 years. Building a unified archive required acquiring data from three distinct sources, writing format-specific parsers for each era, and then normalizing the results into a single relational schema. The ETL pipeline ran in seven sequential steps, each building on the previous.

Step 1 — HTML Editions (2000–2020)

Source: CIA World Factbook zip archives retrieved from the Internet Archive Wayback Machine. Each annual edition was published as a downloadable .zip containing one HTML file per country. The CIA changed their HTML structure five times across these 21 years, requiring five distinct parsers: 2000 — Classic format: <b>FieldName:</b> followed by plain text (parse_classic) 2001–2008 — Table-based format: <td class="FieldLabel"> cells in nested tables (parse_table_format) 2009–2014 — CollapsiblePanel divs with JavaScript show/hide sections (parse_collapsiblepanel_format) 2015–2017 — Expand/collapse h2 sections with anchor-based navigation (parse_expandcollapse_format) 2018–2020 — Modern field-anchor divs with structured class names (parse_modern_format) Each parser uses BeautifulSoup to extract country name, FIPS code, categories, field names, and content text. HTML entities and tags are stripped from content. A known-good Wayback Machine timestamp was identified for each year to ensure consistent, complete snapshots.

Step 2 — Text Editions (1990–2001)

Source: CIA World Factbook plain-text files from Project Gutenberg (public domain ebooks). The text editions used four markup conventions across the decade: 1990 — "Old" format: country names on bare lines, sections marked with " - ", fields with "Field: value" 1991–1992 — "Tagged" format: _@_ country delimiters, _*_ section markers, _#_ field markers 1993–1994 — "Asterisk" format: *Country Name headers, section names on standalone lines 1995–2000 — "At-sign" format: @Country Name delimiters with inline field: value pairs 2001 — "Equals" format: fallback text parser for the 2001 edition where HTML was incomplete Each text is downloaded from Project Gutenberg by ebook number (e.g., 1990 = Ebook #14, 1994 = Ebook #180). The parser splits the monolithic text file into country blocks, then extracts categories, field names, and content using format-specific regex patterns. The 1990 and 2001 overlap years (available in both HTML and text) allow cross-validation.

Step 3 — JSON Editions (2021–2025)

Source: github.com/factbook/cache.factbook.json — a community-maintained mirror that cached the CIA Factbook API as structured JSON files. The repository was auto-updated weekly (every Thursday) from August 2021 until the CIA discontinued the online Factbook in February 2026. To obtain year-specific snapshots rather than a single point-in-time dump, the ETL uses git history: for each target year, it checks out the last commit before January 1 of the following year (e.g., the 2023 snapshot uses the last commit before 2024-01-01). Each JSON file contains one country with categories and fields as nested objects. HTML tags embedded in JSON values are stripped during loading.

Step 4 — Country Identity Standardization (MasterCountries)

The CIA World Factbook uses FIPS 10-4 country codes (a U.S. Government standard), not the internationally used ISO 3166-1 codes. Many FIPS codes differ from their ISO equivalents (e.g., FIPS "CH" = China, but ISO "CH" = Switzerland). Country names also changed across editions (e.g., "Burma" vs. "Myanmar," "Zaire" vs. "Democratic Republic of the Congo"). The standardization process built a MasterCountries table that serves as the single source of identity for all 281 entities across all 36 years: 1. Name cleanup — Corrected garbage names from HTML parsing failures and updated historical names to modern official names 2. Code deduplication — Merged duplicate FIPS codes where the same country appeared under different codes across editions 3. FIPS-to-ISO crosswalk — Added ISO Alpha-2 codes using the NGA Geopolitical Entities and Codes (GEC) standard, sourced from the authoritative crosswalk at github.com/mysociety/gaze (derived from NGA GEC data). This maps all 281 FIPS codes to their ISO 3166-1 equivalents where one exists 4. Identity linking — Every year-specific country record in the Countries table is linked to its MasterCountryID via foreign key, enabling cross-year queries (e.g., track China's GDP across all 36 years regardless of which code or name variant was used that year) The result is a schema where: CanonicalCode = original FIPS 10-4 code (preserved for provenance) ISOAlpha2 = ISO 3166-1 Alpha-2 code (used for international compatibility) CanonicalName = standardized modern name MasterCountryID = stable integer key linking all years together

Step 5 — Entity Classification

The Factbook covers 281 entities, not all of which are sovereign nations. An automated classifier reads each entity's "Dependency status" and "Government type" fields across all 36 years to assign a type: sovereign — Independent state (195 entities) territory — Dependency, overseas territory, unincorporated territory disputed — Disputed sovereignty (Kosovo, West Bank, Gaza Strip, etc.) freely_associated — Self-governing in free association with another state special_admin — Special Administrative Region (Hong Kong, Macau) crown_dependency — Crown dependency of the UK (Jersey, Guernsey, Isle of Man) antarctic — Antarctic territory or claim misc — Oceans, World entry, European Union dissolved — Historical entity that no longer exists A set of hardcoded overrides handles entities where auto-classification fails (e.g., oceans, disputed territories). The classifier runs in read-only mode first for manual review, then writes to the database with an --apply flag.

Step 6 — Field Name Canonicalization

Across 36 years, the CIA used 1,090 distinct field name variants. Many refer to the same concept (e.g., "GDP (purchasing power parity)" vs. "GDP - purchasing power parity" vs. "National product"). The field mapping pipeline normalizes these to 414 canonical names using three rule layers: Rule 1 — Exact match: field already matches a modern canonical name (no change needed) Rule 2 — Normalization: strip punctuation and formatting differences (parentheses vs. dashes) Rule 3 — Known CIA renames: a lookup table of ~120 historical-to-modern name mappings An IsNoise flag marks 43 metadata and formatting artifacts (e.g., "Header", "Definition", "note") that should be excluded from search results and analysis. Original field names are preserved in the database; mappings live in a separate FieldNameMappings lookup table.

Step 7 — SQLite Export

The canonical source database is SQL Server. For the web application, the entire schema is exported to a single SQLite file (data/factbook.db). The export script copies all five tables (MasterCountries, Countries, CountryCategories, CountryFields, FieldNameMappings) with identical structure. A full-text search index (FTS5) is built on CountryFields.Content for fast keyword search across 1,071,213 field entries. The final SQLite database is ~314 MB.

06 Data Processing & Numeric Extraction

Numeric Extraction

The CIA World Factbook publishes data as natural-language text. This archive uses pattern-based parsers to extract numeric values for visualization and comparison: • Population: regex extraction of numbers with 5+ digits from text • GDP (PPP): dollar-amount parsing handling "$X trillion" and "$X billion" formats • Military expenditure: percentage-of-GDP extraction from "X% of GDP" patterns • GDP per capita: dollar-amount extraction from "$X,XXX" patterns • Life expectancy: extraction from "total population: X.X years" pattern • Growth rates: signed percentage extraction from "X.X%" patterns All parsers return NULL for fields where no valid number can be extracted, rather than imputing or estimating values. Visualization tools display "—" for missing data.

Entity Deduplication

The Factbook uses varying country codes and names across editions. This archive maintains a MasterCountries table that maps all historical variants to canonical entities. Where ISO Alpha-2 codes are shared by multiple entities (e.g., GB is used by United Kingdom, Guernsey, Jersey, and Isle of Man), queries filter by EntityType = 'sovereign' for country-level analysis.

Field Name Canonicalization

Factbook field names vary across editions (e.g., "GDP (purchasing power parity)" vs "GDP - purchasing power parity"). A FieldNameMappings table maps all historical field names to canonical names, with an IsNoise flag to exclude metadata and formatting artifacts.

07 Classification & Disclaimer

Classification
OPEN SOURCE. All data in this archive originates from the CIA World Factbook, a public-domain U.S. Government publication. No classified or controlled unclassified information (CUI) is contained in this archive. The intelligence community formatting (ICD 203/208 structure, COCOM organization, confidence badges) is used for presentation purposes and does not imply access to classified sources or methods.

08 Data Repository

The complete archive dataset is available as open-source data on GitHub. The repository contains the full SQL schema, INSERT scripts for all five tables (split by year for CountryFields), field mapping definitions, and the ETL pipeline used to parse the original CIA publications from 1990 to 2025.

View on GitHub Data Export (CSV/Excel)

Sources & Methodology