What do 9.36 million patents actually talk about? By applying NMFNMFNon-negative Matrix Factorization — a matrix decomposition method used for topic modeling that produces interpretable, additive topic representations. topic modeling to every patent abstract filed with the USPTO since 1976, we can uncover the hidden themes of American innovation. Using TF-IDFTF-IDFTerm Frequency–Inverse Document Frequency — a numerical statistic that reflects how important a word is to a document relative to a larger collection. Common words are down-weighted. to convert raw text into numerical features, and non-negative matrix factorization to discover 25 latent topics, this analysis reveals which themes are rising, which are declining, and how they map onto the formal technology classification system.
Emerging Themes
The stacked area chart below shows how the share of each topic has evolved over time. Computing, semiconductor, and communications topics have expanded dramatically, while traditional mechanical and chemical engineering topics have seen their relative share decline — though not necessarily in absolute volume.
Topic Prevalence Over Time
Share of patents belonging to each of 25 NMF-derived topics, 1976–2025. Topics sorted by total patent count.
The language of innovation has shifted decisively toward computing and digital technology over 50 years. Topics related to software, semiconductors, and wireless communications now dominate patent abstracts.
The Patent Landscape
To visualize the full semantic landscape of patents, we project a stratified sample of 15,000 patents from high-dimensional TF-IDF space into two dimensions using UMAP. Each dot represents a patent, colored by its dominant topic. Clusters reveal families of related inventions; overlapping regions reveal technology convergence.
Semantic Map of Patents (UMAP)
15,000 patents projected into 2D via UMAP on TF-IDF vectors (600 per topic, stratified). Each dot = one patent, colored by dominant topic.
The UMAP projection reveals clear topic clusters with meaningful spatial relationships — computing and electronics topics cluster together, while chemistry and biotech form their own neighborhood. Bridging patents between clusters often represent the most novel cross-domain inventions.
Topics and Technology
How do the discovered topics map onto the formal CPC classification system? The chart below cross-tabulates the top 8 most prevalent topics against CPC sections. Some topics align closely with a single CPC section (e.g., chemistry-related topics map to section C), while others — especially computing — span multiple sections.
Topic Distribution by CPC Section
Share (%) of patents in each CPC section belonging to each of the top 8 topics. Sections ordered A–H.
Topics related to computing and data processing appear across nearly all CPC sections, confirming that digital technology has become a general-purpose innovation platform that pervades every industry.
Novelty
How novel are today's patents compared to decades past? We measure novelty as the Shannon entropy of each patent's topic distribution: patents that draw roughly equally from many topics (high entropy) are more thematically diverse — and arguably more novel — than patents concentrated in a single topic (low entropy).
Patent Novelty Over Time
Median and average Shannon entropy of patent topic distributions by year. Higher entropy = more thematically diverse patents.
Patent novelty has risen steadily since the 1990s, suggesting that modern inventions increasingly combine ideas from multiple technology domains. This trend accelerated in the 2010s, coinciding with the rise of AI and other general-purpose technologies.
Having uncovered the hidden thematic structure of patent language, the final chapter zooms in to the company level -- building interactive innovation profiles for 100 major patent filers. The topics and trends identified here provide the foundation for understanding how individual firms have navigated the evolving technology landscape.
This analysis uses TF-IDFTF-IDFTerm Frequency–Inverse Document Frequency — a numerical statistic that reflects how important a word is to a document relative to a larger collection. Common words are down-weighted. vectorization (10,000 features, unigrams + bigrams) and NMFNMFNon-negative Matrix Factorization — a matrix decomposition method used for topic modeling that produces interpretable, additive topic representations. with 25 components on 0 patent abstracts from 1976–2025. The UMAPUMAPUniform Manifold Approximation and Projection — a dimensionality reduction technique that preserves both local and global structure, used to visualize high-dimensional data in 2D. projection uses a stratified sample of 15,000 patents (600 per topic) with cosine distance. Novelty is measured as Shannon entropy of the NMF topic weight vector. Topic names are auto-generated from the top-weighted terms and may not perfectly capture all nuances of each topic cluster. Source: PatentsView / USPTO.