Kingfisher Collect#
Read the Kingfisher Collect documentation, which covers general usage.
Note
Is the service unresponsive or erroring? Follow these instructions.
Note
The collect
user owns the deployment of Kingfisher Collect. Only automated scripts and system administrators should manually delete any data and log files.
Review a new publication#
Create an issue to request a new spider in the kingfisher-collect repository.
Schedule a crawl, once the spider is written and deployed.
Review the data.
Access Scrapyd’s web interface#
One-time setup
Request a username and password from James or Yohanna. (They will add a key-value pair under the apache.sites.kingfisher-collect.htpasswd
key in the pillar/private/kingfisher_main.sls
file.)
Open https://collect.kingfisher.open-contracting.org to view the statuses and logs of crawls.
Collect data with Kingfisher Collect#
One-time setup
Create a ~/.netrc file, using the same credentials as Access Scrapyd’s web interface.
First, read this section of the Kingfisher Collect documentation.
To schedule a crawl, replace spider_name
with a spider’s name and NAME
with your name (you can edit the note any way you like), and run, from your computer:
$ curl -n https://collect.kingfisher.open-contracting.org/schedule.json -d project=kingfisher -d spider=spider_name -d note="Started by NAME."
{"node_name": "ocp04", "status": "ok", "jobid": "6487ec79947edab326d6db28a2d86511e8247444"}
Kingfisher Collect, by default, instructs Kingfisher Process to run structural checks (slow) and create compiled releases. If you do not need to perform these steps:
Skip structural checks, by adding
-d steps=compile
to the commandSkip compiled releases, by adding
-d steps=check
to the commandSkip both, by adding
-d steps=
To cancel a crawl, replace JOBID
with the job ID from the response or from Scrapyd’s jobs page:
$ curl -n https://collect.kingfisher.open-contracting.org/cancel.json -d project=kingfisher -d job=JOBID
{"node_name": "ocp04", "status": "ok", "prevstate": "running"}
The crawl won’t stop immediately. You can force an unclean shutdown by sending the request again; however, it’s preferred to allow the crawl to stop gracefully, so that the log file is completed.
Update spiders in Kingfisher Collect#
One-time setup
Create a ~/.netrc file, using the same credentials as Access Scrapyd’s web interface. Then, create a ~/.config/scrapy.cfg file, and set the url
variable to https://collect.kingfisher.open-contracting.org/
.
Change to your local directory containing your local repository.
Ensure your local repository and the GitHub repository are in sync:
git checkout main git remote update git status
The output should be exactly:
On branch main Your branch is up to date with 'origin/main'. nothing to commit, working tree clean
Activate a virtual environment in which
scrapyd-client
is installed, and deploy the spiders:scrapyd-deploy kingfisher
Access Scrapy’s crawl logs#
See also
If using a browser, either:
Click on a “Log” link from Scrapyd’s jobs page.
Open the logs page for the kingfisher project.
If using the command-line:
Change to the
logs
directory for thekingfisher
project:cd ~collect/scrapyd/logs/kingfisher
Scrapy statistics are extracted from the end of each log file every hour on the hour, into a new file ending in .log.stats
in the same directory as the log file. Access as above, or, from the jobs page:
Right-click on a “Log” link.
Select “Copy Link” or similar.
Paste the URL into the address bar.
Change
.log
at the end of the URL to.log.stats
and press Enter.
If you can’t wait for the statistics to be extracted, you can connect to the server, replace spider_name/alpha-numeric-string
, and run:
tac /home/collect/scrapyd/logs/kingfisher/spider_name/alpha-numeric-string.log | grep -B99 statscollectors | tac
If you are frequently running the above, create an issue to change the schedule.
Tip
The log file is named after the job’s ID, like 7df53218f37a11eb80dd0c9d92c523cb.log
. If a crawl no longer appears on the jobs page, it can be difficult to find the crawl’s log file, because its filename is opaque. To address this, Kingfisher Collect writes the job’s ID to a scrapyd-job.txt
file in the crawl’s directory. So, the log file will be at, for example:
cd ~collect/scrapyd
less logs/kingfisher/colombia/$(cat data/colombia/20210708_212020/scrapyd-log.txt).log
Data retention policy#
On the first day of each month, the following are deleted:
Crawl logs older than 90 days
Crawl directories containing exclusively files older than 90 days