Earlier in the year we wrote about the idea of an Open Knowledge Base (OKB) for the Netherlands. There's been quite a lot of activity since then.
-
An international meeting to discuss some of the ramifications of building an OKB
-
A workshop held by SURF to explore the next steps for a Dutch version of an OKB
-
This workshop led to the creation of a business plan being submitted to the rectors of the Dutch universities. The business plan sets various issues to be tackled and requests the finding to do so.
-
The publication of the Seven Guiding Principles for Research Information (v2.1)
-
An informal project group experimenting with building a Dutch OKB.
This blog post relates to the last bullet point.
The blog post last year considered the OKB in terms of an ecosystem such as Wikidata. Without denying the strength of Wikidata, the discussion in the summer workshop pushed the concept of a PID graph, as an easier way of getting things set up.
There's a short introduction to PID graphs here, but some of the key points are
-
It does not create a database but a data graph.
-
A graph consists of (usually) millions of statements (known as triples). Each statement has the form subject-predicate-object
-
PID stands for persistent identifier. The subject and object parts of the statement should refer to a persistent identifier
-
In the scholarly communications landscape, permanent IDs could be things like ORCID (for authors), ROR (for institutions), DOIs (for actual scholarly content)
A Dutch version of a PID graph would work by importing metadata from the university’s current research information system (CRIS) that include permanent identifiers.
Here some data transformation would need to take place, converting the CRIS data into the triple statements needed for the data graph.
There are various tools for the combined task of managing, querying and visualising the PID graph. At the moment, the project group is experimenting with GraphDB lite, but others deserve attention.
Once gathered as a data graph, new connections can be inferred from the collected statements from the different Dutch universities.
Two use cases have been proposed by the group to demonstrate the added value of a national approach
1. Disambiguation: link instances of the same author, institution, or publication with different names - and distinguish between instances of unique authors, institutions, or publications with similar names.
2. ‘Single version of the truth’: provides a holistic view of research-related metadata in a consistent and non-redundant form, eliminating discrepancies between institutional versions
Together, these two use cases provide a more reliable and complete foundation, upon which we can develop the PID graph.
These are obvious examples. The informal project group working on this will experiment with others. They are making use of test datasets from Erasmus, VU Amsterdam, Delft and Groningen CRIS systems.
If the business case is approved at the UNL meeting in December, the project will develop into a formal pilot and continue with the experimentation outlined above.
Members of the informal project group
- Maurice Vanderfeesten (VU Amsterdam)
- Nick Veenstra (RUG)
- Jeffrey Sweeney (Rotterdam School of Managment)
- Tung Tung Chan (Erasmus)
- Clifford Tatum (SURF)
- Gül Akcaova (SURF)
- Darco Jansen (UNL)
- John Doove (SURF)
- Wim Hugo (DANS)
- Armand Guicherit (TU Delft)
- Alastair Dunning (TU Delft)
1 Praat mee