Kingfisher Collect¶

Read the Kingfisher Collect documentation, which covers general usage.

Note

Is the service unresponsive or erroring? Follow these instructions.

Note

The collect user owns the deployment of Kingfisher Collect. Only automated scripts and system administrators should manually delete any data and log files.

Review a new publication¶

Create an issue to request a new spider in the kingfisher-collect repository.
Schedule a crawl, once the spider is written and deployed.
Wait for the crawl to finish.
Review the crawl’s log file.
Review the data.

Access Scrapyd’s web interface¶

One-time setup

Request a username and password from James or Yohanna. (They will add a key-value pair under the apache.sites.kingfisher-collect.htpasswd key in the pillar/private/kingfisher_main.sls file.)

Open https://collect.kingfisher.open-contracting.org to view the statuses and logs of crawls.

Collect data with Kingfisher Collect¶

One-time setup

Create a ~/.netrc file, using the same credentials as Access Scrapyd’s web interface.

First, read this section of the Kingfisher Collect documentation.

To schedule a crawl, replace spider_name with a spider’s name and NAME with your name (you can edit the note any way you like), and run, from your computer:

$ curl -n https://collect.kingfisher.open-contracting.org/schedule.json -d project=kingfisher -d spider=spider_name -d note="Started by NAME."
{"node_name": "ocp04", "status": "ok", "jobid": "6487ec79947edab326d6db28a2d86511e8247444"}

Kingfisher Collect, by default, instructs Kingfisher Process to only create compiled releases. If you need something different:

Also perform structural checks (slow), by adding -d steps=check,compile to the command
Only perform structural checks (slow), by adding -d steps=check to the command
Skip both, by adding -d steps= to the command

To cancel a crawl, replace JOBID with the job ID from the response or from Scrapyd’s jobs page:

$ curl -n https://collect.kingfisher.open-contracting.org/cancel.json -d project=kingfisher -d job=JOBID
{"node_name": "ocp04", "status": "ok", "prevstate": "running"}

The crawl won’t stop immediately. You can force an unclean shutdown by sending the request again; however, it’s preferred to allow the crawl to stop gracefully, so that the log file is completed.

Update spiders in Kingfisher Collect¶

One-time setup

Create a ~/.netrc file, using the same credentials as Access Scrapyd’s web interface. Then, create a ~/.config/scrapy.cfg file, and set the url variable to https://collect.kingfisher.open-contracting.org/.

Change to your local directory containing your local repository.

Ensure your local repository and the GitHub repository are in sync:

git checkout main
git remote update
git status

The output should be exactly:

On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

Activate a virtual environment in which scrapyd-client is installed, and deploy the spiders:
```
scrapyd-deploy kingfisher
```

Access Scrapy’s crawl logs¶

See also

How to interpret a Scrapy crawl log

If using a browser, either:

Click on a “Log” link from Scrapyd’s jobs page.
Open the logs page for the kingfisher project.

If using the command-line:

Connect to the data support server.
Change to the logs directory for the kingfisher project:
```
cd ~collect/scrapyd/logs/kingfisher
```

Scrapy statistics are extracted from the end of each log file every hour on the hour, into a new file ending in .log.stats in the same directory as the log file. Access as above, or, from the jobs page:

Right-click on a “Log” link.
Select “Copy Link” or similar.
Paste the URL into the address bar.
Change .log at the end of the URL to .log.stats and press Enter.

If you can’t wait for the statistics to be extracted, you can connect to the server, replace spider_name/alpha-numeric-string, and run:

tac /home/collect/scrapyd/logs/kingfisher/spider_name/alpha-numeric-string.log | grep -B99 statscollectors | tac

If you are frequently running the above, create an issue to change the schedule.

Tip

The log file is named after the job’s ID, like 7df53218f37a11eb80dd0c9d92c523cb.log. If a crawl no longer appears on the jobs page, it can be difficult to find the crawl’s log file, because its filename is opaque. To address this, Kingfisher Collect writes the job’s ID to a scrapyd-job.txt file in the crawl’s directory. So, the log file will be at, for example:

cd ~collect/scrapyd
less logs/kingfisher/colombia/$(cat data/colombia/20210708_212020/scrapyd-log.txt).log

Data retention policy¶

On the first day of each month, the following are deleted:

Crawl logs older than 90 days
Crawl directories containing exclusively files older than 90 days