URI Transmutation

Last updated on June 07, 2019

URI transmutation is the process of converting any URI into a set of equivalent URIs, equivalence being defined as directly or indirectly identifying the same resource.

All features presented here to normalize, transmute, or interpolate URIs are already present in our RESTful API. See our API documentation for details.

URIs, URLs, and PIDs

There are many types of unique identifiers on the web, and the acronyms can be confusing.

A Uniform Resource Identifier (URI) is a character string that unambiguously identifies a document on the web (a.k.a. a web resource).

A Uniform Resource Locator (URL) is also a URI. We think that the distinction between URIs and URLs is now obsolete, and we prefer the term URI. However, when we use the term URL, in line with common practice, we refer to URIs that use the http, https, or ftp schemes.

A Persistent Identifier (PID, a.k.a. a permalink) is a URI that provide a long-lasting reference to a document to prevent link rot. PIDs are an essential part of the knowledge graph that powers Cobaltmetrics, but so are other URIs, so we prefer the term URI.

URI Normalization

The first step of the transmutation process is to normalize the input URI. URI normalization (a.k.a. canonicalization or standardization) is defined in part in the RFC for the generic syntax of URIs, but most rules depend on the scheme of the input URI. For example, DOIs are case insensitive, and ORCID iDs should be hyphenated.

See our page on URI schemes for the complete list of schemes that are currently supported in Cobaltmetrics.

URI Equivalence

Two URIs are considered to be equivalent if they identify the same resource.

For some URIs, equivalence can be computed on the fly. For others, equivalence needs to be learned from a database. For example, ORCID iDs are assigned from a well-documented, reserved block of ISNI identifiers, so all ORCID iDs are valid ISNI identifiers, and we know which ISNI identifiers can be converted to ORCID iDs. On the other hand, even if pmid:23193287, pmcid:3531190, and doi:10.1093/NAR/GKS1195 refer to the same publication, the URIs have nothing in common and we must learn the relation from data made available by PubMed.

URI Interpolation

The URI transmutation API will soon include an option to disable interpolations. Please contact us if you are interested.

There is one small class of equivalence rules that require special attention, because they make simplifying assumptions on URIs and URI equivalence. We call them interpolations because they allow us to simplify the transmutation process and improve the user experience, while not affecting any of the important conclusions drawn from the data.

The remainder of this section can be skipped by readers not interested in technical details.

Default URI Interpolations

By default, the transmutation API currently makes the following interpolations:

  • Any http URI is considered to be equivalent to the same URI with the https scheme, and vice versa;
  • For http and https URIs, any URI whose host is prefixed with the www subdomain is considered to be equivalent to the same URI without the subdomain, and vice versa;
  • For http and https URIs, any URI whose host is documented as an alias in Surveyor of Tiny Town is considered to be equivalent to the same URI with the canonical host for that warrior project;
  • For http and https URIs, any URI whose path ends with a trailing forward slash / is considered to be equivalent to the same URI without the trailing slash, and vice versa;
  • For http and https URIs, any URI whose query string contains two or more key-value pairs is considered to be equivalent to all URIs that can be built from permutations of the query string, other things held constant;
  • For http and https URIs, any URI with a fragment identifier is considered to be equivalent to the same URI without the fragment;
  • Any domain URI is considered to be equivalent to the same URI with the host scheme;
  • Any domain or host URI is considered to be equivalent to the corresponding https URI with the domain or host as the authority component, a default path /, no query, and no fragment.

These assumptions do not hold in all cases, but they hold in most cases and the impact of false positives is minimal.

Advanced URI Interpolations

The following URI interpolations are disabled by default, but can be enabled in the API:

  • For http and https URIs, any URI returned in the Link headers of an HTTP HEAD request is considered to be equivalent if it is typed with one of the following relation types: alternate, bookmark, canonical, cite-as, duplicate, identifier, latest-version, memento, predecessor-version, self, successor-version, working-copy-of. See our blog post on Signposting for more information.

Note that these interpolations are unstable over time and thus require the use of the X-Release: unstable header. See the API documentation for details.

Circuit Breakers

The URI transmutation API contains various circuit breakers used to prevent operations from causing latency or errors. See the API documentation for more information.

Data Sources

See our page on data sources.