URI transmutation is the process of converting any URI into a set of equivalent URIs, equivalence being defined as directly or indirectly identifying the same resource.
All features presented here to normalize, transmute, or interpolate URIs are already present in our RESTful API. See our API documentation for details.
There are many types of unique identifiers on the web, and the acronyms can be confusing.
A Uniform Resource Identifier (URI) is a character string that unambiguously identifies a document on the web (a.k.a. a web resource).
A Uniform Resource Locator (URL) is also a URI. We think that the distinction between URIs and URLs is now obsolete, and we prefer the term URI. However, when we use the term URL, in line with common practice, we refer to URIs that use the
A Persistent Identifier (PID, a.k.a. a permalink) is a URI that provide a long-lasting reference to a document to prevent link rot. PIDs are an essential part of the knowledge graph that powers Cobaltmetrics, but so are other URIs, so we prefer the term URI.
The first step of the transmutation process is to normalize the input URI. URI normalization (a.k.a. canonicalization or standardization) is defined in part in the RFC for the generic syntax of URIs, but most rules depend on the scheme of the input URI. For example, DOIs are case insensitive, and ORCID iDs should be hyphenated.
See our page on URI schemes for the complete list of schemes that are currently supported in Cobaltmetrics.
Two URIs are considered to be equivalent if they identify the same resource.
For some URIs, equivalence can be computed on the fly. For others, equivalence needs to be learned from a database. For example, ORCID iDs are assigned from a well-documented, reserved block of ISNI identifiers, so all ORCID iDs are valid ISNI identifiers, and we know which ISNI identifiers can be converted to ORCID iDs. On the other hand, even if
doi:10.1093/NAR/GKS1195 refer to the same publication, the URIs have nothing in common and we must learn the relation from data made available by PubMed.
The URI transmutation API will soon include an option to disable interpolations. Please contact us if you are interested.
There is one small class of equivalence rules that require special attention, because they make simplifying assumptions on URIs and URI equivalence. We call them interpolations because they allow us to simplify the transmutation process and improve the user experience, while not affecting any of the important conclusions drawn from the data.
The remainder of this section can be skipped by readers not interested in technical details.
By default, the transmutation API currently makes the following interpolations:
httpURI is considered to be equivalent to the same URI with the
httpsscheme, and vice versa;
httpsURIs, any URI whose host is prefixed with the
wwwsubdomain is considered to be equivalent to the same URI without the subdomain, and vice versa;
httpsURIs, any URI whose host is documented as an alias in Surveyor of Tiny Town is considered to be equivalent to the same URI with the canonical host for that warrior project;
httpsURIs, any URI whose path ends with a trailing forward slash
/is considered to be equivalent to the same URI without the trailing slash, and vice versa;
httpsURIs, any URI whose query string contains two or more key-value pairs is considered to be equivalent to all URIs that can be built from permutations of the query string, other things held constant;
httpsURIs, any URI with a fragment identifier is considered to be equivalent to the same URI without the fragment;
domainURI is considered to be equivalent to the same URI with the
hostURI is considered to be equivalent to the corresponding
httpsURI with the domain or host as the authority component, a default path
/, no query, and no fragment.
These assumptions do not hold in all cases, but they hold in most cases and the impact of false positives is minimal.
The following URI interpolations are disabled by default, but can be enabled in the API:
httpsURIs, any URI returned in the
Linkheaders of an HTTP
HEADrequest is considered to be equivalent if it is typed with one of the following relation types:
working-copy-of. See our blog post on Signposting for more information.
Note that these interpolations are unstable over time and thus require the use of the
X-Release: unstable header. See the API documentation for details.
The URI transmutation API contains various circuit breakers used to prevent operations from causing latency or errors. See the API documentation for more information.
See our page on data sources.