README.rst 2.95 KB
Newer Older
Bronger, Torsten's avatar
Bronger, Torsten committed
1
2
3
4
5
6
7
8
9
10
FindRD
======

Purpose
-------

FindRD crawls through an (institutional) text publication repository and visits
the respective journal’s website to find the corresponding data for each text
publication.  Then, this data is entered to the (institutional) research data
repository.
Bronger, Torsten's avatar
Bronger, Torsten committed
11
12
13
14
15


Architecture
------------

16
17
18
Database
........

Bronger, Torsten's avatar
Bronger, Torsten committed
19
At its heart, there is a MongoDB “FindRD” with the following collections:
20

Bronger, Torsten's avatar
Bronger, Torsten committed
21
findings
22
23
24
25
  This contains metadata of research data publications.  The only mandatory
  field is the PID.  It contains publications from external sources,
  i.e. Zenodo, Pangaea, datapub, and all text journals.  It is ever growing.

Bronger, Torsten's avatar
Bronger, Torsten committed
26
items
27
28
29
30
31
  This contains metadata of research data publications.  The only mandatory
  field is the PID.  It contains publications from our institutional Dataverse.
  It is ever growing.

aliases
Bronger, Torsten's avatar
Bronger, Torsten committed
32
33
  This maps PID aliases to their prime PID.  In “findings” and “items” are only
  prime PIDs in the MongoDB documents.
34

Bronger, Torsten's avatar
Bronger, Torsten committed
35
crawlings
36
37
  This contains the timestamp of last crawling for each crawler.

Bronger, Torsten's avatar
Bronger, Torsten committed
38

39
40
41
42
43
44
45
46
47
48
49
50
51
52
JuSER crawler
.............

This is a cron job.

This crawler fetches all eligible DOIs from the text publication repository.

At the same time, it creates a “text DOI” → “data PID” mapping from the data
repository.

It then contacts a scraper (see below) which returns the bibliographic metadata
for the accompanying research data.  In particular, it returns a PID for the
research data if it can be found.  Finally, it checks whether the research data
should be added to the data repository and if it should, it does so.
Bronger, Torsten's avatar
Bronger, Torsten committed
53
54
55


Scraper
56
57
58
,,,,,,,

This is a deployment.
Bronger, Torsten's avatar
Bronger, Torsten committed
59
60
61
62
63
64

The scraper works stateless as “cattle”.  Each instance listens to HTTP GET
requests which have a DOI as the parameter.  Then, the DOI is resolved and the
landing page searched for data about the research data.  This is then returned
in the same (possibly quite long-lasting) HTTP request.

65
66
67
68
69
70
71
72
73
74
The scraper has to resolve the DOI to a landing page.  For this, a Redis is
used as a cache.


Other crawlers
..............

These are cron jobs.

Other crawlers are more straightforward.  They harvest repositories and add
Bronger, Torsten's avatar
Bronger, Torsten committed
75
76
77
them to the “findings” collection.  They store their latest run in the
“crawling” collection *after* the run, to make sure that nothing is lost, while
not every time the whole repository is harvested.
78
79
80
81
82
83
84


Dataverse crawler
.................

This is a cron job.

Bronger, Torsten's avatar
Bronger, Torsten committed
85
This crawler is special because it adds records to the “items” collection.
86
87
88
89
90
91
92


Steward
.......

This is a cron job.

Bronger, Torsten's avatar
Bronger, Torsten committed
93
The steward takes all PIDs from “findings” that are not in “items” and does
94
95
96
97
98
something about it, e.g.:

- adds it to Dataverse
- sends an email to the authoring scientist
- sends an email to the FDM team
Bronger, Torsten's avatar
Bronger, Torsten committed
99
100


101
102
103
104
105
106
107
108
109
110
111
112
Documents
---------

The documents in the collections “findings” and “items” must have a “``pid``”
field.  This is the prime PID.  The rest is optional:

- ``institute``
- ``pof4_topic``
- ``aliases`` (alias PIDs)
- ``email`` (of contact person)
- …

113
..  LocalWords:  FindRD