Commit a1e30036 authored by Bronger, Torsten's avatar Bronger, Torsten
Browse files

Switch to message-queue-based approach

parent 8029f283
......@@ -10,6 +10,20 @@ publication. Then, this data is entered to the (institutional) research data
repository.
The pipe
--------
The *crawlers* find data publications in the wild and send them into Kafka.
The *curators* read them from there and enrich them with further metadata,
e.g. POF4 and institute. Then, they are sent to a different topic in Kafka.
From there, the *archivers* read them and put them into the data repository.
Through a third channel in Kafka, the archivers report to the curators which
publications have been successfully written.
The curators read and write to a MongoDB collection called “findings”. There,
the current state of a data publication is tracked.
Architecture
------------
......@@ -19,19 +33,38 @@ Database
At its heart, there is a MongoDB “FindRD” with the following collections:
findings
This contains metadata of research data publications. The only mandatory
field is the PID. It contains publications from external sources,
i.e. Zenodo, Pangaea, datapub, and all text journals. It is ever growing,
but existing documents may be updated.
crawlings
This contains the timestamp of last crawling for each crawler.
This is used only by the curators. It contains metadata of research data
publications. The only mandatory field is the prime PID (the primary key).
It contains publications from external sources, i.e. Zenodo, Pangaea,
datapub, and all text journals. It is ever growing, but existing documents
may be updated.
Amongst the fields is “retry_timestamp” (set to zero if the archiver
acknowledges addition), whether it is a false positive (i.e., not eligible
for archiving), timestamp (i.e., version) of message to the archivers, etc.
conflicts
This is used only by the curators. It contains metadata documents just as
“findings” but this collection does not have a primary key. It stores
findings that cannot be reliably mapped to one existing entry in “findings”,
needing human intervention. Typically, a human must elect one entry in
“findings” the new prime, which must be realised in “aliases” as well. Every
conflict has a “retry_timestamp” after which it is re-fed into the incoming
Kafka channel.
aliases
This is used only by the curators. It maps alias PIDs to their prime PID,
and prime PIDs to itself.
emails
This contains the timestamp as from the next email may be sent a certain
email address. It is meant to prevent spamming people. The timestamp may be
far in the future, which effectively prevents mails to that person
altogether.
This is used only by the curators. This contains the timestamp as from the
next email may be sent a certain email address. It is meant to prevent
spamming people. The timestamp may be far in the future, which effectively
prevents mails to that person altogether.
crawlings
This is used only by the crawlers. This contains the timestamp of last
crawling for each crawler.
JuSER crawler
......@@ -75,44 +108,80 @@ them to the “findings” collection. They store their latest run in the
not every time the whole repository is harvested.
Steward
Curator
.......
This is a cron job.
This is a deployment with possibly many replicas.
The curator takes all documents from the Kafka input stream.
It checks the PID/PIDs against the ``aliases`` collection. If no prime PID can
be found, it is created (avoiding races). In case of multiple PIDs in the
input, if they map to at least two different prime PIDs, a mail to the RDM team
is sent, the document is stored in “conflicts” (see there).
The steward takes all PIDs from “findings” that are ``dirty`` and not
``false_positive`` and does something about it, e.g.:
If there is one prime PID, the document is merged with the existing one. If
there is no prime PID, the document is simply added to “findings”. In both
cases, the document is updated with a reasonable ``retry_timestamp” and sent
through Kafka to the archiver.
- adds it to Dataverse if not existing there yet (and set ``dirty`` to false);
if existing, update the record there
- sends an email to the authoring scientist
- sends an email to the FDM team
There may be cases when the curator sets ``retry_timestamp`` but does not sent
to the archiver. Instead, it sends a curation request mail to the RDM team or
to the authoring scientist.
Documents
---------
Archiver
........
The archiver takes its input from Kafka, checks whether a record is suitable
for inclusion into Dataverse (is it an insert or update? Are all mandatory
fields available?). If this is fine, the archiver applies the data to the data
repository and reports the success back though Kafka.
Janitor
.......
This is a cron job.
It scans for documents with a due retry_timestamp and re-feds them into Kafka
just like a crawler.
Documents in “findings”
-----------------------
The documents in the collection “findings” must have the following fields:
``uris``
This is a list of URIs/PIDs the data publication was found under on the
Internet.
``pid``
prime PID for that entry. Before dealing with the “findings” collection, the
curator first checks in the “aliases” collection what actually is the prime
PID.
``timestamp``
The timestamp UUID of the last update. This is important to match
acknowledgements of the archiver correctly with the findings, i.e. reset the
``retry_timestamp`` of the correct document.
``dirty``
A boolean which is true at the start. If true, this record needs to be added
to or updated in the data repository.
``retry_timestamp``
A point in time when the record is to be re-evaluated to be sent to the
archivers. This is set when the record is being sent to the archivers. It
is set to zero if the archiver acknowledged the addition. Before being
re-sent, the curator checks whether further enrichment can be applied,
e.g. the author can now be found in the text publication database.
``false_positive``
If true, the steward ignores that entry.
If true, the curator ignores that entry for good.
Other fields are optional:
- ``institute``
- ``alias_pids``
- ``pof4_topic``
- ``email`` (of contact person)
- …
However, some of those fields might be reuqired when adding a record to the
However, some of those fields might be required when adding a record to the
data repository.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment