Not everything that counts can be counted, and not everything that can be counted counts. We focus on free, publicly available sources to guarantee the reproducibility of our datasets.
Our API offers a machine-readable list of all the web resources that we remix to build our citation index and our knowledge graph, along with the exact URLs that we downloaded, timestamps of the downloads, and checksums of the files we downloaded.
See our API documentation for more information.
Our citation index includes from the following datasets:
- CommonCrawl (coming soon to Cobaltmetrics):
- Creators: U.S. state and federal courts, via the Free Law Project
- License: CC0 1.0
- Usage notes: Cobaltmetrics indexes citations from all court opinions that are not blocked
- Hypothesis (temporarily excluded from our citation index due to performance issues):
- Usenet (temporarily excluded from our citation index due to data quality issues):
- Creators: the Usenet community, via the Internet Archive
- License: TBC
- Usage notes: Cobaltmetrics indexes citations from all posts and all newsgroups in the archive
- Creators: the Wikimedia community
- Licenses: GFDL and/or CC BY-SA 3.0, or CC BY 2.5, depending on the project (see https://dumps.wikimedia.org/legal.html and https://en.wikipedia.org/wiki/Wikipedia:Copyrights)
- Usage notes: Cobaltmetrics indexes citations from all projects with public data dumps (Wikibooks, Wikinews, Wikipedia, Wikiquote, Wikisource, Wikispecies, Wikiversity, Wikivoyage, Wiktionary, etc.), in all languages, except Wikidata which is used to build the knowledge graph
Our knowledge graph is derived from the following resources. Unless stated otherwise in usage notes, Cobaltmetrics indexes all identifier mappings from each resource.
URI badges build upon additional resources to trigger URI alerts: