Kingfisher Collect#

Read the Kingfisher Collect documentation, which covers general usage.


Is the service unresponsive or erroring? Follow these instructions.

Review a new publication#

  1. Create an issue in the kingfisher-collect repository.

  2. Schedule a crawl, once the spider is written and deployed.

  3. Wait for the crawl to finish.

  4. Review the crawl’s log file.

  5. Review the data.

Access Scrapyd’s web interface#

One-time setup

Request a username and password from James or Yohanna. (They will add a key-value pair under the apache.sites.ocdskingfisherscrape.htpasswd key in the pillar/private/kingfisher_process.sls file.)

Open to view the statuses and logs of crawls.

Collect data with Kingfisher Collect#

One-time setup

Create a ~/.netrc file.

First, read this section of the Kingfisher Collect documentation.

To schedule a crawl, replace spider_name with a spider’s name and NAME with your name (you can edit the note any way you like), and run, from your computer:

curl -n -d project=kingfisher -d spider=spider_name -d note="Started by NAME."

You should see a response like:

{"node_name": "process1", "status": "ok", "jobid": "6487ec79947edab326d6db28a2d86511e8247444"}

To cancel a crawl, replace JOBID with the job ID from the response or from Scrapyd’s jobs page:

curl -n -d project=kingfisher -d job=JOBID

You should see a response like:

{"node_name": "process1", "status": "ok", "prevstate": "running"}

The crawl won’t stop immediately. You can force an unclean shutdown by sending the request again; however, it’s preferred to allow the crawl to stop gracefully, so that the log file is completed.

Update spiders in Kingfisher Collect#

One-time setup

Create a ~/.netrc file. Then, create a ~/.config/scrapy.cfg file, and set the url variable to

  1. Change to your local directory containing your local repository

  2. Ensure your local repository and the GitHub repository are in sync:

    git checkout main
    git remote update
    git status

    The output should be exactly:

    On branch main
    Your branch is up to date with 'origin/main'.
    nothing to commit, working tree clean
  3. Activate a virtual environment in which scrapyd-client is installed, and deploy the spiders:

    scrapyd-deploy kingfisher

Access Scrapyd’s crawl logs#

From a browser, click on a “Log” link from Scrapyd’s jobs page, or open the logs page for the kingfisher project.

From the command-line, connect to the server, and change to the logs directory for the kingfisher project:

curl --silent --connect-timeout 1 || true
cd scrapyd/logs/kingfisher

Scrapy statistics are extracted from the end of each log file every hour on the hour, into a new file ending in .log.stats in the same directory as the log file. Access as above, or, from the jobs page:

  • Right-click on a “Log” link.

  • Select “Copy Link” or similar.

  • Paste the URL into the address bar.

  • Change .log at the end of the URL to .log.stats and press Enter.

If you can’t wait for the statistics to be extracted, you can connect to the server, replace spider_name/alpha-numeric-string, and run:

tac /home/ocdskfs/scrapyd/logs/kingfisher/spider_name/alpha-numeric-string.log | grep -B99 statscollectors | tac

If you are frequently running the above, create an issue to change the schedule.


The log file is named after the job’s ID, like 7df53218f37a11eb80dd0c9d92c523cb.log. If a crawl no longer appears on the jobs page, it can be difficult to find the crawl’s log file, because its filename is opaque. To address this, Kingfisher Collect writes the job’s ID to a scrapyd-job.txt file in the crawl’s directory. So, the log file will be at, for example:

cd /home/ocdskfs/scrapyd
less logs/kingfisher/colombia/$(cat data/colombia/20210708_212020/scrapyd-log.txt).log

Create a .netrc file#

To collect data with (and update spiders in) Kingfisher Collect, you need to send requests to it from your computer as described above, using the same username and password as to Access Scrapyd’s web interface.

Instead of setting the username and password in multiple locations (on the command line and in scrapy.cfg files), set them in one location: in a .netrc file. In order to create (or append the Kingfisher Collect credentials to) a .netrc file, replace PASSWORD with the password, and run:

echo 'machine login scrape password PASSWORD' >> ~/.netrc

You must change the file’s permissions to be readable only by the owner:

chmod 600 ~/.netrc

To check the permissions:

$ stat -f "%Sp" ~/.netrc

If you run grep ~/.netrc, you should only see the single line you added with the correct password. If there are multiple lines or an incorrect password, you must correct the file in a text editor.

To test your configuration, run:

curl -n

You should see a response like:

{"node_name": "process1", "status": "ok", "projects": ["kingfisher"]}

Data retention policy#

On the first day of each month, the following are deleted:

  • Crawl logs older than 90 days

  • Crawl directories containing exclusively files older than 90 days