Commit 779b5d78 authored by Bronger, Torsten's avatar Bronger, Torsten
Browse files

Extend README

parent 88d32022
......@@ -8,3 +8,36 @@ FindRD crawls through an (institutional) text publication repository and visits
the respective journal’s website to find the corresponding data for each text
publication. Then, this data is entered to the (institutional) research data
repository.
Architecture
------------
Crawler
.......
The crawler is a single-instance program that fetches all eligible DOIs from
the text publication repository. It then contacts a scraper (see below) which
returns the bibliographic metadata for the accompanying research data. In
particular, it returns a PID for the research data if it can be found.
Finally, it checks whether the research data should be added to the data
repository and if it should, it does so.
Scraper
.......
The scraper works stateless as “cattle”. Each instance listens to HTTP GET
requests which have a DOI as the parameter. Then, the DOI is resolved and the
landing page searched for data about the research data. This is then returned
in the same (possibly quite long-lasting) HTTP request.
Programming language
--------------------
Candidates are Python and Go. I will start with Python for both crawler and
scraper because I don’t need runtime efficiency. However, it may turn out for
the crawler that the concurrency model of Python (async/await) is too awkward
for that. Besides, I need Python ease in string processing only for the
scraper.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment