01
Primary Data Source
REFERENCE // SOURCES & METHODOLOGY
Sources & Methodology
External references, classification standards, and analytic framework used in this archive
Disclosure
This archive presents public-domain data from the CIA World Factbook. All analytic frameworks, regional groupings, and classification standards referenced below originate from official U.S. Government publications and internationally recognized bodies. This project is not affiliated with the Central Intelligence Agency or the U.S. Government.
This archive presents public-domain data from the CIA World Factbook. All analytic frameworks, regional groupings, and classification standards referenced below originate from official U.S. Government publications and internationally recognized bodies. This project is not affiliated with the Central Intelligence Agency or the U.S. Government.
CIA World Factbook
All country data in this archive is sourced from the CIA World Factbook, a public-domain
publication of the Central Intelligence Agency. The Factbook provides intelligence on the
history, people, government, economy, energy, geography, environment, communications,
transportation, military, terrorism, and transnational issues for 266 world entities.
Published annually since 1962; this archive covers the 1990–2025 editions (36 years).
Publisher: Central Intelligence Agency, United States Government
Classification: OPEN SOURCE // PUBLIC DOMAIN
URL: https://www.cia.gov/the-world-factbook/
02
Analytic Standards
ICD 203 — Analytic Standards
Intelligence Community Directive 203, "Analytic Standards," establishes the standards
for analytic products produced by the Intelligence Community. This archive formats its
dossier pages following ICD 203 principles including sourcing transparency, distinguishing
between underlying intelligence and analytic judgment, and incorporating analysis of
alternatives.
Issuing Authority: Director of National Intelligence (DNI)
Effective: 2 January 2015
Reference: https://www.dni.gov/files/documents/ICD/ICD-203.pdf
ICD 208 — Maximizing the Utility of Analytic Products
Intelligence Community Directive 208 provides guidance on the write-for-maximum-utility
standard, influencing how intelligence assessments are structured and presented. Dossier
formatting in this archive follows ICD 208 principles for clear, structured presentation.
Issuing Authority: Director of National Intelligence (DNI)
Effective: 17 December 2008
Reference: https://www.dni.gov/files/documents/ICD/ICD-208.pdf
Confidence Levels
This archive assigns confidence levels to data fields based on recency, following
the general framework of ICD 203 confidence assessments adapted for structured data:
HIGH — Data from the current assessment year (0–1 years old)
MODERATE — Data within 2–3 years of the assessment year
LOW — Data older than 4 years from the assessment year
Note: These confidence levels reflect data freshness only, not analytic confidence
in the traditional intelligence sense. The CIA World Factbook updates fields on
varying schedules; some fields may carry older data in recent editions.
03
Regional Framework — Unified Command Plan
DoD Unified Command Plan (UCP)
This archive organizes countries by U.S. Department of Defense Combatant Command (COCOM)
areas of responsibility as defined in the Unified Command Plan. The UCP is a classified
document signed by the President; however, COCOM area-of-responsibility boundaries are
publicly available through DoD publications and official command websites.
The six geographic combatant commands used in this archive (sovereign states + territories):
EUCOM — U.S. European Command (53 entities, Europe & Eurasia)
CENTCOM — U.S. Central Command (21 entities, Middle East & Central/South Asia)
INDOPACOM — U.S. Indo-Pacific Command (49 entities, Asia-Pacific & Indian Ocean)
AFRICOM — U.S. Africa Command (57 entities, Africa)
SOUTHCOM — U.S. Southern Command (46 entities, Central & South America, Caribbean)
NORTHCOM — U.S. Northern Command (6 entities, North America)
Issuing Authority: President of the United States / Secretary of Defense
Current Version: 2024 (most recent publicly acknowledged revision)
Reference: https://www.defense.gov/Spotlights/Unified-Command-Plan/
04
Country Groupings — Time Series Presets
P5 (UNSC Permanent Members)
United States, China, Russia, United Kingdom, France
The five permanent members of the United Nations Security Council with veto power,
established by Article 23 of the UN Charter (1945). Sometimes referred to as the
"Permanent Five" or "P5."
Source: Charter of the United Nations, Chapter V, Article 23
Reference: https://www.un.org/en/about-us/un-charter/chapter-5
G7 (Group of Seven)
United States, United Kingdom, Germany, France, Japan, Italy, Canada
An intergovernmental political and economic forum of seven major advanced economies.
The G7 does not have a formal charter; membership has been stable since Canada joined
in 1976 (Russia participated as G8 from 1997 to 2014).
Source: G7 official communications
Reference: https://www.g7germany.de/g7-en/g7-and-g20/what-is-the-g7-
BRICS
China, India, Brazil, Russia, South Africa
Originally coined as "BRIC" by Goldman Sachs economist Jim O'Neill (2001) to describe
four emerging economies. South Africa joined in 2010. The grouping formalized through
annual summits beginning in 2009. In 2024, the grouping expanded (BRICS+) to include
Egypt, Ethiopia, Iran, Saudi Arabia, and the UAE, though this archive uses the original
five for the preset.
Source: BRICS Joint Statistical Publication; Johannesburg II Declaration (2023)
Reference: https://brics2024.gob.ru/en
NATO Select
United States, United Kingdom, Germany, France, Turkey, Poland, Norway
A representative sample of NATO member states selected by military expenditure,
geographic coverage, and strategic importance. NATO has 32 member states as of 2024
(after Finland and Sweden accession). The full membership list is publicly available.
Source: North Atlantic Treaty (Washington Treaty, 1949); NATO official member list
Reference: https://www.nato.int/cps/en/natohq/nato_countries.htm
Near-Peer Competitors
China, Russia
Terminology from the U.S. National Defense Strategy (NDS). The 2022 NDS identifies
the People's Republic of China as "the most consequential strategic competitor" and
the Russian Federation as an "acute threat." The 2018 NDS used the term "great power
competition" to describe strategic rivalry with both nations.
Source: 2022 National Defense Strategy of the United States of America
Issuing Authority: U.S. Department of Defense
Reference: https://www.defense.gov/National-Defense-Strategy/
Regional Powers
India, Brazil, Turkey, Saudi Arabia, Iran, Indonesia
States with significant military, economic, and political influence within their
respective regions but not classified as global "great powers" in U.S. strategic
documents. This grouping draws from international relations scholarship and aligns
with how these states are discussed in the National Security Strategy and Defense
Intelligence Agency threat assessments.
Sources:
• National Security Strategy of the United States (2022)
• DIA Worldwide Threat Assessment
• Academic consensus in international relations (e.g., Buzan & Wæver,
"Regions and Powers," Cambridge University Press, 2003)
Africa Top 5 (by Population)
Nigeria, Ethiopia, Egypt, Democratic Republic of the Congo, Tanzania
The five most populous countries on the African continent, as reported in the
CIA World Factbook. This grouping is derived directly from Factbook population data.
Indo-Pacific
Japan, South Korea, Australia, India, Indonesia, Thailand, Philippines
Key partner nations and allies within the U.S. Indo-Pacific Command (INDOPACOM) area
of responsibility. Selection reflects the nations most frequently referenced in the
Indo-Pacific Strategy of the United States (2022).
Source: Indo-Pacific Strategy of the United States (February 2022)
Issuing Authority: The White House
Reference: https://www.whitehouse.gov/briefing-room/speeches-remarks/2022/02/11/fact-sheet-indo-pacific-strategy-of-the-united-states/
05
Archive Construction Methodology
Overview
The CIA World Factbook has been published in different formats over 36 years. Building a unified archive required acquiring data from three distinct sources, writing format-specific parsers for each era, and then normalizing the results into a single relational schema. The ETL pipeline ran in seven sequential steps, each building on the previous.
The CIA World Factbook has been published in different formats over 36 years. Building a unified archive required acquiring data from three distinct sources, writing format-specific parsers for each era, and then normalizing the results into a single relational schema. The ETL pipeline ran in seven sequential steps, each building on the previous.
Step 1 — HTML Editions (2000–2020)
Source: CIA World Factbook zip archives retrieved from the Internet Archive Wayback Machine.
Each annual edition was published as a downloadable .zip containing one HTML file per country.
The CIA changed their HTML structure five times across these 21 years, requiring five distinct parsers:
2000 — Classic format: <b>FieldName:</b> followed by plain text (parse_classic)
2001–2008 — Table-based format: <td class="FieldLabel"> cells in nested tables (parse_table_format)
2009–2014 — CollapsiblePanel divs with JavaScript show/hide sections (parse_collapsiblepanel_format)
2015–2017 — Expand/collapse h2 sections with anchor-based navigation (parse_expandcollapse_format)
2018–2020 — Modern field-anchor divs with structured class names (parse_modern_format)
Each parser uses BeautifulSoup to extract country name, FIPS code, categories, field names, and
content text. HTML entities and tags are stripped from content. A known-good Wayback Machine
timestamp was identified for each year to ensure consistent, complete snapshots.
Step 2 — Text Editions (1990–2001)
Source: CIA World Factbook plain-text files from Project Gutenberg (public domain ebooks).
The text editions used four markup conventions across the decade:
1990 — "Old" format: country names on bare lines, sections marked with " - ", fields with "Field: value"
1991–1992 — "Tagged" format: _@_ country delimiters, _*_ section markers, _#_ field markers
1993–1994 — "Asterisk" format: *Country Name headers, section names on standalone lines
1995–2000 — "At-sign" format: @Country Name delimiters with inline field: value pairs
2001 — "Equals" format: fallback text parser for the 2001 edition where HTML was incomplete
Each text is downloaded from Project Gutenberg by ebook number (e.g., 1990 = Ebook #14,
1994 = Ebook #180). The parser splits the monolithic text file into country blocks, then
extracts categories, field names, and content using format-specific regex patterns. The 1990
and 2001 overlap years (available in both HTML and text) allow cross-validation.
Step 3 — JSON Editions (2021–2025)
Source: github.com/factbook/cache.factbook.json — a community-maintained mirror that cached
the CIA Factbook API as structured JSON files. The repository was auto-updated weekly (every
Thursday) from August 2021 until the CIA discontinued the online Factbook in February 2026.
To obtain year-specific snapshots rather than a single point-in-time dump, the ETL uses git
history: for each target year, it checks out the last commit before January 1 of the following
year (e.g., the 2023 snapshot uses the last commit before 2024-01-01). Each JSON file contains
one country with categories and fields as nested objects. HTML tags embedded in JSON values are
stripped during loading.
Step 4 — Country Identity Standardization (MasterCountries)
The CIA World Factbook uses FIPS 10-4 country codes (a U.S. Government standard), not the
internationally used ISO 3166-1 codes. Many FIPS codes differ from their ISO equivalents
(e.g., FIPS "CH" = China, but ISO "CH" = Switzerland). Country names also changed across
editions (e.g., "Burma" vs. "Myanmar," "Zaire" vs. "Democratic Republic of the Congo").
The standardization process built a MasterCountries table that serves as the single source
of identity for all 281 entities across all 36 years:
1. Name cleanup — Corrected garbage names from HTML parsing failures and updated
historical names to modern official names
2. Code deduplication — Merged duplicate FIPS codes where the same country appeared
under different codes across editions
3. FIPS-to-ISO crosswalk — Added ISO Alpha-2 codes using the NGA Geopolitical Entities
and Codes (GEC) standard, sourced from the authoritative crosswalk at
github.com/mysociety/gaze (derived from NGA GEC data). This maps all 281 FIPS codes
to their ISO 3166-1 equivalents where one exists
4. Identity linking — Every year-specific country record in the Countries table is linked
to its MasterCountryID via foreign key, enabling cross-year queries (e.g., track China's
GDP across all 36 years regardless of which code or name variant was used that year)
The result is a schema where:
CanonicalCode = original FIPS 10-4 code (preserved for provenance)
ISOAlpha2 = ISO 3166-1 Alpha-2 code (used for international compatibility)
CanonicalName = standardized modern name
MasterCountryID = stable integer key linking all years together
Step 5 — Entity Classification
The Factbook covers 281 entities, not all of which are sovereign nations. An automated
classifier reads each entity's "Dependency status" and "Government type" fields across all
36 years to assign a type:
sovereign — Independent state (195 entities)
territory — Dependency, overseas territory, unincorporated territory
disputed — Disputed sovereignty (Kosovo, West Bank, Gaza Strip, etc.)
freely_associated — Self-governing in free association with another state
special_admin — Special Administrative Region (Hong Kong, Macau)
crown_dependency — Crown dependency of the UK (Jersey, Guernsey, Isle of Man)
antarctic — Antarctic territory or claim
misc — Oceans, World entry, European Union
dissolved — Historical entity that no longer exists
A set of hardcoded overrides handles entities where auto-classification fails (e.g., oceans,
disputed territories). The classifier runs in read-only mode first for manual review, then
writes to the database with an --apply flag.
Step 6 — Field Name Canonicalization
Across 36 years, the CIA used 1,090 distinct field name variants. Many refer to the same
concept (e.g., "GDP (purchasing power parity)" vs. "GDP - purchasing power parity" vs.
"National product"). The field mapping pipeline normalizes these to 414 canonical names
using three rule layers:
Rule 1 — Exact match: field already matches a modern canonical name (no change needed)
Rule 2 — Normalization: strip punctuation and formatting differences (parentheses vs. dashes)
Rule 3 — Known CIA renames: a lookup table of ~120 historical-to-modern name mappings
An IsNoise flag marks 43 metadata and formatting artifacts (e.g., "Header", "Definition",
"note") that should be excluded from search results and analysis. Original field names are
preserved in the database; mappings live in a separate FieldNameMappings lookup table.
Step 7 — SQLite Export
The canonical source database is SQL Server. For the web application, the entire schema
is exported to a single SQLite file (data/factbook.db). The export script copies all five
tables (MasterCountries, Countries, CountryCategories, CountryFields, FieldNameMappings)
with identical structure. A full-text search index (FTS5) is built on CountryFields.Content
for fast keyword search across 1,071,213 field entries. The final SQLite database is ~314 MB.
06
Data Processing & Numeric Extraction
Numeric Extraction
The CIA World Factbook publishes data as natural-language text. This archive uses
pattern-based parsers to extract numeric values for visualization and comparison:
• Population: regex extraction of numbers with 5+ digits from text
• GDP (PPP): dollar-amount parsing handling "$X trillion" and "$X billion" formats
• Military expenditure: percentage-of-GDP extraction from "X% of GDP" patterns
• GDP per capita: dollar-amount extraction from "$X,XXX" patterns
• Life expectancy: extraction from "total population: X.X years" pattern
• Growth rates: signed percentage extraction from "X.X%" patterns
All parsers return NULL for fields where no valid number can be extracted, rather
than imputing or estimating values. Visualization tools display "—" for missing data.
Entity Deduplication
The Factbook uses varying country codes and names across editions. This archive
maintains a MasterCountries table that maps all historical variants to canonical
entities. Where ISO Alpha-2 codes are shared by multiple entities (e.g., GB is used
by United Kingdom, Guernsey, Jersey, and Isle of Man), queries filter by
EntityType = 'sovereign' for country-level analysis.
Field Name Canonicalization
Factbook field names vary across editions (e.g., "GDP (purchasing power parity)" vs
"GDP - purchasing power parity"). A FieldNameMappings table maps all historical field
names to canonical names, with an IsNoise flag to exclude metadata and formatting artifacts.
07
Classification & Disclaimer
Classification
OPEN SOURCE. All data in this archive originates from the CIA World Factbook, a public-domain U.S. Government publication. No classified or controlled unclassified information (CUI) is contained in this archive. The intelligence community formatting (ICD 203/208 structure, COCOM organization, confidence badges) is used for presentation purposes and does not imply access to classified sources or methods.
OPEN SOURCE. All data in this archive originates from the CIA World Factbook, a public-domain U.S. Government publication. No classified or controlled unclassified information (CUI) is contained in this archive. The intelligence community formatting (ICD 203/208 structure, COCOM organization, confidence badges) is used for presentation purposes and does not imply access to classified sources or methods.
08
Data Repository
The complete archive dataset is available as open-source data on GitHub. The repository contains the full SQL schema, INSERT scripts for all five tables (split by year for CountryFields), field mapping definitions, and the ETL pipeline used to parse the original CIA publications from 1990 to 2025.