Commit c47a41c6 authored by Bronger, Torsten's avatar Bronger, Torsten
Browse files

Switch to MongoDB-based architacture

parent a7f14da7
......@@ -13,31 +13,90 @@ repository.
Architecture
------------
Crawler
.......
Database
........
At its heart, there is a MongoDB with the following collections:
source
This contains metadata of research data publications. The only mandatory
field is the PID. It contains publications from external sources,
i.e. Zenodo, Pangaea, datapub, and all text journals. It is ever growing.
destination
This contains metadata of research data publications. The only mandatory
field is the PID. It contains publications from our institutional Dataverse.
It is ever growing.
aliases
This maps PID aliases to their prime PID. In “source” and “destination” are
only prime PIDs in the MongoDB documents.
crawling
This contains the timestamp of last crawling for each crawler.
The crawler is a single-instance program that fetches all eligible DOIs from
the text publication repository. It then contacts a scraper (see below) which
returns the bibliographic metadata for the accompanying research data. In
particular, it returns a PID for the research data if it can be found.
Finally, it checks whether the research data should be added to the data
repository and if it should, it does so.
JuSER crawler
.............
This is a cron job.
This crawler fetches all eligible DOIs from the text publication repository.
At the same time, it creates a “text DOI” → “data PID” mapping from the data
repository.
It then contacts a scraper (see below) which returns the bibliographic metadata
for the accompanying research data. In particular, it returns a PID for the
research data if it can be found. Finally, it checks whether the research data
should be added to the data repository and if it should, it does so.
Scraper
.......
,,,,,,,
This is a deployment.
The scraper works stateless as “cattle”. Each instance listens to HTTP GET
requests which have a DOI as the parameter. Then, the DOI is resolved and the
landing page searched for data about the research data. This is then returned
in the same (possibly quite long-lasting) HTTP request.
The scraper has to resolve the DOI to a landing page. For this, a Redis is
used as a cache.
Other crawlers
..............
These are cron jobs.
Other crawlers are more straightforward. They harvest repositories and add
them to the “source” collection. They store their latest run in the “crawling”
collection *after* the run, to make sure that nothing is lost, while not every
time the whole repository is harvested.
Dataverse crawler
.................
This is a cron job.
This crawler is special because it adds records to the “destination”
collection.
Steward
.......
This is a cron job.
The steward takes all PIDs from “source” that are not in “destination” and does
something about it, e.g.:
- adds it to Dataverse
- sends an email to the authoring scientist
- sends an email to the FDM team
Programming language
--------------------
Candidates are Python and Go. I will start with Python for both crawler and
scraper because I don’t need runtime efficiency. However, it may turn out for
the crawler that the concurrency model of Python (async/await) is too awkward
for that. Besides, I need Python ease in string processing only for the
scraper.
.. LocalWords: FindRD
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment