Skip to content
PatentWorld

Reference

Data Dictionary & Methodology

Definitions, measurement choices, and data sources behind PatentWorld's analysis of 9.36 million US patents (1976–2025, through September).

Patent Universe

PatentWorld covers 9.36 million patents granted by the United States Patent and Trademark Office (USPTO), encompassing all patent types: utility, design, plant, and reissue. Utility patents account for over 90% of all grants and represent the primary unit of analysis throughout most chapters.

When chapters restrict analysis to utility patents only, the universe is approximately 8.45 million patents. Design patents (covering ornamental appearance), plant patents, and reissue patents are included in aggregate counts but are excluded from most quality-metric analyses because their claim structures, citation patterns, and examination processes differ materially from utility patents.

All types vs. utility only

  • All types: 9.36 million patents (utility + design + plant + reissue)
  • Utility only: ~8.45 million patents
  • Difference: ~910,000 patents, predominantly design patents

Temporal Coverage

The dataset spans January 1976 through September 2025. The start date corresponds to the earliest systematically available digital patent records in the PatentsView database. All time-series analyses use grant year unless otherwise noted.

2025 is a partial year. Data for 2025 covers only grants issued through September 2025. Annual totals for 2025 should not be directly compared with full-year figures from prior years without appropriate adjustment. Where feasible, chapters note this truncation explicitly.

Several analyses also present data by filing year (the date of the earliest US application in the patent family). Filing-year counts for recent years are subject to right-truncation bias because many applications filed after 2022 remain pending and have not yet been granted. Chapters that use filing-year data note this limitation.

Grant year vs. filing year

  • Grant year: The year the USPTO issued the patent. Default time axis throughout PatentWorld.
  • Filing year: The year the earliest US application was filed. Typically 2–3 years before grant. Subject to right-truncation for recent years.
  • Implication: Trends by filing year lead grant-year trends by the average pendency period. Filing-year counts for 2023–2025 are incomplete.

Geography Basis

Geographic analyses use two distinct bases, depending on the chapter and research question:

Inventor country
The country of residence of the patent's inventor(s), as recorded in PatentsView's disambiguated inventor-location data. This basis captures where inventive work occurs and is used in the domestic geography, international geography, and geographic mechanics chapters.
Assignee country
The country associated with the patent's assignee (the entity that owns the patent rights). This basis captures where IP ownership resides and is used in the organizational composition chapter and in the Act 6 deep-dive chapters when analyzing geographic patterns by assignee.

These two bases can yield different results. A patent invented in Germany but assigned to a US corporation would be counted under Germany (inventor basis) or the United States (assignee basis). Chapters specify which basis is used.

Counting Methods

Fractional counting
When a patent has multiple inventors in different countries, each country receives equal fractional credit (1/n for n countries). Used for geographic analyses of inventor location.
Whole counting
Each country with at least one inventor on a patent receives full credit (count of 1). Used when measuring co-invention and international collaboration, where the focus is on participation rather than proportional attribution.

Field Classification

PatentWorld uses several complementary approaches to classify patents by technology area:

CPC Sections

The Cooperative Patent Classification (CPC) system is a hierarchical classification jointly managed by the USPTO and the European Patent Office (EPO). PatentWorld uses CPC section as the primary technology grouping throughout Acts 1–5. The CPC was formally introduced in 2013; historical patents were retrospectively reclassified.

SectionName
AHuman Necessities
BPerforming Operations; Transporting
CChemistry; Metallurgy
DTextiles; Paper
EFixed Constructions
FMechanical Engineering; Lighting; Heating; Weapons
GPhysics
HElectricity
YCross-Sectional Technologies

Section Y is a tagging section applied alongside primary A–H classifications. Y02 and Y04S codes are used to identify green patents. Because Y codes are secondary tags, they are excluded from multi-section convergence analyses to avoid double-counting.

WIPO Technology Sectors

The World Intellectual Property Organization (WIPO) groups patents into 35 technology fields across 5 sectors: Electrical Engineering, Instruments, Chemistry, Mechanical Engineering, and Other Fields. PatentWorld uses WIPO sectors primarily in the gender and generalist-specialist chapters.

NMF Topic Model

The Language of Innovation chapter applies Non-Negative Matrix Factorization (NMF) topic modeling to 8.45 million patent abstracts to discover 25 data-driven technology themes. Unlike CPC, which relies on examiner-assigned codes, the NMF topic model extracts thematic structure directly from patent text using TF-IDF vectorization followed by matrix decomposition.

Act 6 Domain Definitions

Act 6 deep-dive chapters define domain-specific patent universes using curated sets of CPC subclasses. The specific codes used are documented in each chapter's measurement details panel. Two domains are defined here because they appear across multiple chapters:

Green patents
Patents classified under CPC codes Y02 (technologies for climate change mitigation) or Y04S (smart grids). Sub-categories: Y02E (energy generation), Y02T (transportation/EVs), Y02C (carbon capture), Y02B (buildings), Y02P (industrial production), Y02W (waste management). See the green innovation chapter.
AI patents
Patents classified under CPC subclass G06N (computing arrangements based on specific computational models, including neural networks, genetic algorithms, and knowledge-based systems) plus additional AI-related codes: G06F18 (pattern recognition), G06V (image/video recognition), G10L15 (speech recognition), and G06F40 (natural language processing). See the AI patents chapter.

Data Processing

Raw data were obtained as tab-separated value (TSV) files from PatentsView's bulk data downloads. These files were processed using DuckDB, an analytical SQL database engine, to compute aggregated statistics for each visualization. The analysis encompasses all USPTO-granted patents from January 1976 through September 2025.

Processing Pipeline

1Joining patent records with inventor, assignee, location, and classification tables
2Aggregating by year, technology category, geography, and organization
3Computing derived metrics: citation counts, team sizes, concentration ratios, diversity indices
4Filtering to primary classifications (CPC sequence = 0) to avoid double-counting
5Exporting pre-computed JSON files for each visualization (no backend server; all data are static)

Counting Conventions

Primary classification (sequence = 0)
Each patent is assigned multiple CPC codes. The code with sequence = 0 is the examiner-designated primary classification. PatentWorld uses the primary classification for technology-field analyses to avoid counting a single patent multiple times across sections.
Filing route
Patents are classified by prosecution path: domestic (filed directly at the USPTO by US applicants), PCT (filed via the Patent Cooperation Treaty and entering the US national phase), or direct foreign (filed directly at the USPTO by foreign applicants without using the PCT). The organizational composition chapter analyzes trends by filing route.
Gender inference
Inventor gender is inferred from first names using PatentsView's gender_code field (M/F). This name-based inference does not capture non-binary identities and may misclassify individuals whose names are ambiguous or culturally variable.

Metric Definitions

The following metrics are used throughout PatentWorld. Definitions are standardized across chapters to enable cross-chapter comparisons.

Citation Metrics

Forward citations (5-year)
Number of subsequent patents citing a given patent within 5 years of grant. The 5-year window is standard in the patent analytics literature as a balance between measurement completeness and recency. Subject to truncation bias for patents granted after 2020, which have not yet accumulated a full 5-year citation window. In system-level chapter charts, a conservative visual threshold grays out data after 2018 to account for partial accumulation; firm-level analyses (company profiles) use a 2020 visual threshold.
Backward citations
Number of prior patents cited by a given patent. Reflects the extent to which the invention builds on existing prior art. Higher counts may indicate more incremental innovation or more thorough prosecution.
Self-citation rate
Fraction of a patent's backward citations directed to patents held by the same assignee. Higher values indicate greater internal knowledge reuse. In multi-assignee contexts, a citation is counted as a self-citation if any assignee on the citing patent matches any assignee on the cited patent.
Non-patent literature (NPL) citations
Citations to academic papers, technical standards, and other non-patent sources. Higher NPL citation rates indicate stronger ties to scientific research, and are particularly prevalent in biotechnology and pharmaceutical patents.
Cohort normalization
A patent's citation count divided by the mean citation count of all patents in the same grant-year × CPC section cohort. A value of 1.0 indicates average impact for that cohort; values above 1.0 indicate above-average impact. Cohort normalization controls for both temporal and field-specific citation patterns.
Exposure-time normalization
A patent's citation count divided by the number of years since grant (current year minus grant year). This simple normalization adjusts for the mechanical fact that older patents have had more time to accumulate citations. Available as an interactive toggle on selected charts; distinct from cohort normalization, which also controls for field-specific citation rates.
Citation half-life
The time it takes for a patent (or a firm's patent portfolio) to accumulate half of its total forward citations. Shorter half-lives indicate more immediately impactful inventions; longer half-lives suggest foundational work whose influence emerges gradually.
Dud rate (zero-citation rate)
The share of patents receiving zero forward citations within 5 years of grant. A high dud rate indicates a large proportion of patents that generate no measurable downstream impact. Used alongside the blockbuster rate to characterize the tails of a firm's quality distribution. Subject to the same 5-year truncation window as other forward-citation metrics.

Quality & Scope

Claims per patent
Total number of independent and dependent claims in a patent. Measures the scope of legal protection sought. Higher claim counts generally indicate broader or more detailed protection.
Patent scope
Number of distinct CPC subclasses assigned to each patent. Measures the technological breadth of the invention. A patent classified under many subclasses spans a wider range of technical applications.
Grant lag (pendency)
Number of days between patent application filing date and grant date. Measures prosecution speed at the USPTO. Grant lag varies substantially by technology area, with software and biotechnology patents typically taking longer than mechanical inventions.
Team size
Number of disambiguated inventors listed on a patent. Used to measure the collaborative intensity of inventive activity. Average team size has increased from 1.7 inventors per patent in 1976 to 3.2 in 2024.
Blockbuster patent
A patent in the top 1% of 5-year forward citations within its grant-year × CPC section cohort. Under uniform quality, 1% of patents in any group would be blockbusters; rates above 1% indicate disproportionate high-impact output.
Quality bifurcation (top-decile share)
The share of patents within a technology domain that fall in the top decile of system-wide 5-year forward citations (within grant-year × CPC section cohorts). Tracks how the share of high-impact patents in a domain changes over time. Used in Act 6 deep-dive chapters to measure whether a domain is producing an increasing or decreasing concentration of high-quality output relative to the patent system as a whole.
Sleeping beauty patent
A patent that receives few or no citations for an extended dormancy period after grant before experiencing a sudden surge of recognition. Operationalized as a patent receiving fewer than 2 citations per year in its first 10 years followed by a burst of 10 or more citations within a 3-year window. A separate half-life analysis identifies patents whose citation half-life (time to accumulate 50% of total forward citations) exceeds 10 years. Relevant in the patent quality and organizational patent quality chapters.

Diversity & Concentration

Originality
Measures the breadth of a patent's backward citations across CPC sections. Computed as 1 minus the Herfindahl-Hirschman Index of the CPC section distribution among backward citations:
Originality = 1 − Σ si2
where si is the share of backward citations in CPC section i. Range: 0 (all citations from one section) to ~1 (citations spread evenly across sections). Higher values indicate the patent synthesizes knowledge from more diverse technological domains. Note: PatentWorld computes these indices over CPC sections (8 categories), which compresses the range relative to the finer-grained NBER technology subcategories (~36 categories) used in Trajtenberg, Henderson, and Jaffe (1997). The Hall, Jaffe, and Trajtenberg (2001) small-sample correction is not applied, which may introduce upward bias for patents with very few citations. In the dedicated originality/generality trend analysis, patents with fewer than 2 backward citations are excluded because the Herfindahl index is degenerate for a single citation. In per-dimension quality breakdowns (e.g., by team size or inventor rank), single-citation patents are assigned an originality of 0.0 rather than excluded, which may slightly depress group-level averages.
Generality
Measures how broadly a patent is cited across CPC sections. Computed identically to originality but using forward citations instead of backward citations. Range: 0 (all citing patents in one section) to ~1 (cited across many sections). Higher values indicate wider downstream influence. Patents with fewer than 2 forward citations are excluded. Because generality requires a complete 5-year forward-citation window, generality values are reported only for patents granted through 2020. The same caveats regarding CPC section granularity and the absence of small-sample correction apply.
Herfindahl-Hirschman Index (HHI)
A measure of concentration computed as:
HHI = Σ si2 × 10,000
where si is the market share (or category share) of entity i, expressed as a decimal. The index ranges from near 0 (highly fragmented) to 10,000 (monopoly). The US DOJ/FTC (2010 Horizontal Merger Guidelines) classify markets as unconcentrated (<1,500), moderately concentrated (1,500–2,500), or highly concentrated (>2,500). In PatentWorld, HHI is used to measure assignee concentration within technology fields and geographic regions. Note: system-level analyses (Acts 1–5) report HHI on the standard 0–10,000 scale, while deep-dive domain analyses (Act 6) report HHI on a 0–1 decimal scale for consistency with normalized entropy metrics.
Shannon entropy
H = −Σ pi log(pi) over category shares, where the logarithm base determines the unit. PatentWorld uses log2 (bits) for topic modeling, inventor specialization, and CPC subclass portfolio diversity analyses, and natural log (nats) for WIPO-level technology field diversity. Higher values indicate more even distribution across categories. The absolute value depends on both the number of categories and the log base used, so entropy values from different analyses are not directly comparable without normalization.
Gini coefficient
A measure of statistical dispersion ranging from 0 (perfect equality) to 1 (maximum inequality). In PatentWorld, used to measure concentration of citations across patents (via Lorenz curves) and quality distribution within technology fields and organizations. Values above 0.8 indicate high concentration.
Modified Gini coefficient (blockbuster concentration)
A directional concentration measure derived from Lorenz curves comparing the distribution of blockbuster patents (top 1% by forward citations within grant-year × CPC section cohorts, 5-year citation window) to the distribution of overall patents across organizations. Computed as:
Gini = 1 − 2 ∫ y · dx
where x is the cumulative share of total patents per firm (sorted ascending) and y is the cumulative share of blockbuster patents. Unlike the standard Gini (bounded [0, 1]), this measure can take negative values when the Lorenz curve lies above the 45° diagonal—indicating that blockbusters are distributed more evenly than overall patents. Positive values indicate that large patent holders capture a disproportionate share of blockbusters; negative values indicate smaller firms hold a higher blockbuster share than their overall patent output would predict. For example, the coefficient declined from 0.161 in 1976–1989 to −0.069 in 2010–2020, suggesting that high-impact innovation has become increasingly distributed across firm sizes.
Concentration ratios (CR4, CR10)
CR4 is the combined patent share of the four largest organizations in a domain; CR10 is the share of the ten largest. Used in Act 6 deep-dive chapters to measure organizational concentration within technology domains. Higher values indicate a more concentrated competitive landscape.

Innovation Strategy

Exploration composite
Measures the degree to which a patent represents exploratory (as opposed to exploitative) innovation. Equally weighted average of three normalized sub-scores:
  • Technology newness: Whether the patent uses CPC subclasses that are new to the assignee's historical portfolio.
  • Citation newness: Whether the patent cites prior art not previously cited by the assignee's other patents.
  • External sourcing: Whether the patent's backward citations come predominantly from patents held by other organizations rather than the assignee's own prior patents.

Each sub-score is normalized to a 0–1 scale. Patents scoring above 0.6 are classified as exploratory; those scoring 0.4–0.6 as ambidextrous; those below 0.4 as exploitative. These thresholds follow the tercile-approximation convention used in the organizational ambidexterity literature. Used in the organizational mechanics chapter and Act 6 deep dives.

Ambidexterity index
Measures an organization's balance between exploration and exploitation. Computed as 1 minus the absolute deviation of the exploration share from 50%. Values near 1.0 indicate a balanced firm that pursues both exploratory and exploitative innovation in roughly equal measure; values near 0 indicate a specialist firm focused predominantly on one strategy. Used in the organizational mechanics chapter.
Inventor mobility (talent flow)
The movement of inventors between organizations, tracked by consecutive patent filings with different assignees. Net talent flow reveals which organizations are gaining versus losing inventive talent. A positive net flow indicates a firm is attracting more inventors than it is losing. Used in the inventor mechanics chapter.
Patent velocity
Total domain patents divided by active career span (last grant year minus first grant year plus one) for each organization, measured in patents per year. Used in Act 6 deep-dive chapters to compare the patenting intensity of organizations across entry decade cohorts. Higher values indicate more rapid patent accumulation relative to the time spent in the domain.
Subfield diversity
Normalized Shannon entropy of subfield patent distributions within a technology domain. Computed as H/ln(N), where H is the Shannon entropy over CPC subfield shares and N is the number of subfields. Ranges from 0 (all activity concentrated in one subfield) to 1 (perfectly even distribution across all subfields). Used in Act 6 deep-dive chapters to measure how evenly inventive activity is spread across domain subfields over time.
Cosine similarity
A measure of similarity between two vectors based on the angle between them. Values range from 0 (completely different) to 1 (identical). Used in portfolio analysis to compare CPC-distribution vectors between companies, identifying competitive proximity and industry clusters.
Jensen-Shannon divergence (JSD)
A symmetric measure of the difference between two probability distributions, bounded between 0 (identical) and 1 (completely different). Used to detect technology portfolio pivots by comparing a company's CPC distribution across consecutive time windows. Higher JSD values indicate a larger strategic shift.
Exploration index (patent-level)
The cosine distance of a patent's CPC distribution from the assignee's historical CPC centroid. Values near 1.0 indicate the patent is in entirely new technology areas for the firm; values near 0 indicate continuation of the existing portfolio. Distinct from the exploration composite, which averages three sub-scores. Used in the organizational mechanics chapter to show how individual patents' novelty decays over an organization's filing history.
Generalist vs. specialist inventors
Inventors are classified based on the technological diversity of their patent portfolios. Generalists hold patents spanning multiple CPC sections; specialists concentrate in one or few sections. The generalist-specialist chapter measures this using entropy-based diversity scores across each inventor's CPC section distribution.

Standard Quality Metrics Suite

Seven metrics used consistently across chapters

Chapters across multiple acts present a consistent set of seven patent quality metrics, computed for different grouping variables (by technology field, by assignee, by inventor category, by geography). These are: patent count, claims per patent, patent scope, forward citations, backward citations, self-citation rate, and grant lag. Each is defined individually above. When these seven metrics appear together, they enable direct cross-chapter comparisons of patent quality across different analytical dimensions.

Disambiguation Reliability

PatentsView uses machine learning algorithms to disambiguate inventor and assignee identities across patent records. This disambiguation is essential for analyses of inventor productivity, mobility, and organizational patenting patterns, but it introduces potential errors:

SplittingA single inventor is assigned multiple disambiguated IDs, leading to undercounting of individual productivity. Most likely for inventors who change institutions or name spellings.
LumpingTwo distinct inventors are assigned the same ID, leading to overcounting of individual productivity. Most likely for inventors with common names.
AssigneeCorporate name changes, mergers, and subsidiaries create particular challenges. A single entity may appear under multiple names, or distinct entities may be incorrectly merged.

Chapters that rely heavily on disambiguated identities — including top inventors, serial vs. new inventors, inventor mechanics, and organizational mechanics — note this limitation in their measurement details panel.

Data Limitations

Granted patents only: The dataset includes only granted patents, not applications that were abandoned or rejected. This introduces survivorship bias — the analysis cannot measure inventive activity that does not result in a grant.

US patents only: The analysis covers patents granted by the USPTO. It does not include patents filed only at foreign patent offices (EPO, JPO, CNIPA, KIPO, etc.). Firms that patent primarily abroad may appear to have lower patent output in this dataset than their true inventive activity warrants.

Citation truncation: Recently granted patents have had less time to accumulate forward citations, creating a right-truncation bias in citation-based metrics. Five-year forward citation counts are unreliable for patents granted after 2020. System-level charts apply a conservative visual threshold at 2018 (graying out post-2018 data) to flag partially accumulated citations; firm-level analyses use a 2020 threshold. Cohort normalization mitigates but does not eliminate this bias.

Inventor disambiguation: PatentsView uses algorithmic disambiguation to link inventor records across patents. Some errors in matching (lumping) or splitting inventor identities may exist.

Classification changes: The CPC system was introduced in 2013, replacing the earlier USPC system. Historical patents were retrospectively reclassified, but some inconsistencies may remain, particularly for patents from the 1970s–1980s.

Gender inference: Inventor gender is inferred from first names and may not reflect actual gender identity. Non-binary identities are not captured. Accuracy varies by cultural context and name ambiguity.

Partial year (2025): Data for 2025 covers grants through September only. Annual totals and rates for 2025 are not directly comparable to full-year figures without adjustment.

Domain count variations: Technology domain patent counts derived from annual per-year aggregations may differ slightly from cross-domain comparison totals due to patents lacking valid grant-year assignments or differences in CPC reclassification timing across pipeline stages. These variations are small (typically <1% of domain volume).

Assignee coverage: Not all patents have assignee records in PatentsView. Unassigned patents are excluded from organizational analyses, which may undercount individual inventor and small-entity patenting.

Citation category coverage: PatentsView provides a citation_category field that distinguishes "cited by examiner," "cited by applicant," "cited by third party," and other sources. However, this field is available only for citations from approximately 2001 onward and may have coverage gaps in earlier records.

Terminology Conventions

PatentWorld uses certain terms interchangeably in narrative text; this section defines each term precisely to avoid ambiguity.

Entity Terms

Assignee / organization / firm / company
Assignee is used in technical and data-processing contexts (the entity recorded in PatentsView's assignee table). Organization is the broadest term, encompassing corporations, universities, and government agencies. Firm and company are used when the discussion is restricted to corporate assignees.
Patent / grant / filing
Patent refers to a granted patent in most contexts. Grant emphasizes the issued status (vs. pending application). Filing refers to the application stage or, in country-of-origin analyses, to patents originating from a specific jurisdiction.

Technology Taxonomy

Domain / field / class / subclass
Domain refers to a broad technology area (e.g., “AI,” “green innovation”) typically defined by a curated set of CPC codes. Field is used for WIPO technology fields (35 categories). Section refers to CPC top-level sections (A–H, Y). Class and subclass refer to progressively finer levels of the CPC hierarchy (e.g., G06 is a class; G06N is a subclass). There are 8 CPC sections, approximately 130 classes, and approximately 670 subclasses in active use.
Patent share vs. market share
Patent share (preferred) refers to the fraction of patents held by an entity or group within a defined universe. PatentWorld avoids the term “market share” in the patent context because patents represent inventive output rather than product-market revenue.

URL Slug Convention

Chapter URLs use descriptive kebab-case slugs (e.g., /chapters/org-composition/ for “Assignee Composition”, /chapters/inv-gender/ for “Gender and Patenting”). Slugs are optimized for brevity and readability rather than literal title matching.

Data Source

All data are derived from PatentsView (opens in new window), a patent data platform supported by the United States Patent and Trademark Office (USPTO). The bulk data files were accessed in February 2026.

TableDescription
g_patent9.36 million patent records with grant dates, types, and filing information
g_cpc_current58 million CPC classification assignments
g_inventor_disambiguated24 million disambiguated inventor records
g_us_patent_citation151 million citation relationships
g_assignee_disambiguated8.7 million assignee records
g_location_disambiguatedGeocoded inventor and assignee locations
g_gov_interestGovernment interest statements identifying federally funded research

Data attribution: PatentsView (patentsview.org (opens in new window)), USPTO. PatentsView is a tool built to increase the usability and transparency of US patent data. The database is derived from the USPTO examination and granting of patents.

For information about the author, chapter structure, and data sources, see the About page.