September 10, 2020

Bring on the data citations!

In a recent blog post, we talked about the process of building preprint citation handling into Edifix. But that’s not all we’ve been working on!

Starting in September 2020, you’ll see two new behaviors in Edifix processing and Crossref linking: first, Edifix can now parse data citations; and, second, we’ve updated and improved how Edifix verifies DOIs.

What does Edifix do with data citations?

Edifix approaches data citations much like citations to preprints and conference proceedings: these types of entries are parsed but not restructured.

Here’s an example:



As you can see, Edifix has hyperlinked the DOI and changed it to URL format, but has not added, changed, or rearranged any other reference element. 

How does it work?

In deciding whether an entry should be identified as a data citation, Edifix looks for the names of known data servers (e.g., Dryad Digital Repository, HEPData, Harvard Dataverse) and for DOI prefixes that we know to be associated with specific data servers.

If one or more of these elements is found in a reference entry, Edifix identifies it as a data citation.

→ JATS Note: When you export XML from Edifix, you’ll see that data sources are incorrectly tagged as <article-title> instead of <data-title>. This is because Edifix currently exports JATS 1.0, which does not support <data-title>. We are currently surveying Edifix customers who use JATS, in preparation for an update to JATS 1.2.

When is a data citation not (necessarily) a data citation? 

Some data servers, such as those we mentioned above, host data sets exclusively; if a reference entry includes the name of one of these servers, we can be confident that the entry should be identified as a data citation and tagged as <data> in XML export. 

Other servers, including the widely used Zenodo and FigShare, host data sets and also a variety of other content types. This means that in order to identify a reference to a source hosted on one of these servers as a data set, Edifix needs more information: for example, a Zenodo entry that includes a URL or DOI + “data set” can confidently be identified as a data citation.

Edifix output:

JATS 1.0 XML export:

On the other hand, an entry that includes the elements “Zenodo” + a URL or DOI will be identified as a website or online document.

Edifix output:

JATS 1.0 XML export:

As always, Edifix errs on the side of caution in making this decision. In the second example above, the presence of the word “Datasets” in the title of the cited work isn’t enough to reliably identify this as a data citation, since without the explicit identification “[Data set]”, this could be an article about data sets! 


Like all things Edifix reference processing, recognizing and parsing data-set references is an ongoing project as we learn about new data archives and encounter new (and sometimes creative!) ways of citing data. If you see a data citation that is not handled correctly in the latest version of eXtyles, please email it to [email protected], and we’ll be happy to try to add support for it.