URI Transmutation

This page is a work in progress. Please contact us if you have any questions.

URI transmutation is the process for converting any URI into a set of equivalent URIs, equivalence being defined as directly or indirectly identifying the same resource.

URI normalization

The first step of the transmutation process is to normalize the input URI. URI normalization (a.k.a. canonicalization or standardization) is defined in part in the RFC for the generic syntax of URIs, but most rules depend on the scheme of the input URI. For example, DOIs are case insensitive, and ORCID iDs should be hyphenated.

URI equivalence

Two URIs are considered to be equivalent if they identify the same resource.

For some URIs, equivalence can be computed on the fly. For others, equivalence needs to be learned from a database. For example, ORCID iDs are assigned from a well-documented, reserved block of ISNI identifiers, so all ORCID iDs are valid ISNI identifiers, and we know which ISNI identifiers can be converted to ORCID iDs. On the other hand, even if pmid:23193287, pmcid:3531190, and doi:10.1093/NAR/GKS1195 refer to the same publication, the URIs have nothing in common and we must learn the relation from data made available by PubMed.

URI interpolation

The URI transmutation API will soon include an option to disable interpolations. Please contact us if you are interested.

There is one small class of equivalence rules that require special attention, because they make simplifying assumptions on URIs and URI equivalence. We call them interpolations because they allow us to simplify the transmutation process and improve the user experience, while not affecting any of the important conclusions drawn from the data.

The transmutation API currently makes the following interpolations:

  • Any HTTP URL is considered to be equivalent to the same URL with the HTTPS scheme, and vice versa;
  • For HTTP and HTTPS URLs, any URL host prefixed with the www subdomain is considered to be equivalent to the same host without the prefix, and vice versa.

These assumptions do not hold in all cases, but they hold in most cases and the impact of false positives is minimal.

Data sources

Coming soon.