Kingfisher Collect

Read the Kingfisher Collect documentation, which covers general usage.

Note

Is the service unresponsive or erroring? Follow these instructions.

Review a new publication

  1. Create an issue in the kingfisher-collect repository. CDS and James will prioritize it.
  2. Schedule a crawl, once the spider is written and deployed by CDS.
  3. Wait for the crawl to finish.
  4. Review the crawl’s log file.
  5. Review the data.

Access Scrapyd’s web interface

One-time setup

Save the username (scrape) and the password (ask a colleague) in your password manager. (If you have access, the password is the value of the kingfisher_collect.web.password key in the pillar/private/kingfisher.sls file.)

Open https://collect.kingfisher.open-contracting.org to view the statuses and logs of crawls.

Collect data with Kingfisher Collect

One-time setup

Create a ~/.netrc file.

First, read this section of the Kingfisher Collect documentation.

To schedule a crawl, replace spider_name with a spider’s name and NAME with your name (you can edit the note any way you like), and run, from your computer:

curl -n https://collect.kingfisher.open-contracting.org/schedule.json -d project=kingfisher -d spider=spider_name -d note="Started by NAME."

You should see a response like:

{"node_name": "process1", "status": "ok", "jobid": "6487ec79947edab326d6db28a2d86511e8247444"}

To cancel a crawl, replace JOBID with the job ID from the response or from Scrapyd’s jobs page:

curl -n https://collect.kingfisher.open-contracting.org/cancel.json -d project=kingfisher -d job=JOBID

You should see a response like:

{"node_name": "process1", "status": "ok", "prevstate": "running"}

The crawl won’t stop immediately. You can force an unclean shutdown by sending the request again; however, it’s preferred to allow the crawl to stop gracefully, so that the log file is completed.

Update spiders in Kingfisher Collect

One-time setup

Create a ~/.netrc file. Then, create a ~/.config/scrapy.cfg file, and set the url variable to https://collect.kingfisher.open-contracting.org/.

  1. Change to your local directory containing your local repository

  2. Ensure your local repository and the GitHub repository are in sync:

    git checkout main
    git remote update
    git status
    

    The output should be exactly:

    On branch main
    Your branch is up to date with 'origin/main'.
    
    nothing to commit, working tree clean
    
  3. Activate a virtual environment in which scrapyd-client is installed, and deploy the spiders:

    scrapyd-deploy kingfisher
    

Access Scrapyd’s crawl logs

From a browser, click on a “Log” link from Scrapyd’s jobs page, or open the logs page for the kingfisher project.

From the command-line, connect to the server, and change to the logs directory for the kingfisher project:

curl --silent --connect-timeout 1 collect.kingfisher.open-contracting.org:8255 || true
ssh ocdskfp@collect.kingfisher.open-contracting.org
cd scrapyd/logs/kingfisher

Scrapy statistics are extracted from the end of each log file every hour on the hour, into a new file ending in .log.stats in the same directory as the log file. Access as above, or, from the jobs page:

  • Right-click on a “Log” link.
  • Select “Copy Link” or similar.
  • Paste the URL into the address bar.
  • Change .log at the end of the URL to .log.stats and press Enter.

If you can’t wait for the statistics to be extracted, you can connect to the server, replace spider_name/alpha-numeric-string, and run:

tac /home/ocdskfs/scrapyd/logs/kingfisher/spider_name/alpha-numeric-string.log | grep -B99 statscollectors | tac

If you are frequently running the above, create an issue to change the schedule.

Create a .netrc file

To collect data with (and update spiders in) Kingfisher Collect, you need to send requests to it from your computer as described above, using the same username (scrape) and password (ask a colleague) as to access https://collect.kingfisher.open-contracting.org in a web browser.

Instead of setting the username and password in multiple locations (on the command line and in scrapy.cfg files), set them in one location: in a .netrc file. In order to create (or append the Kingfisher Collect credentials to) a .netrc file, replace PASSWORD with the password, and run:

echo 'machine collect.kingfisher.open-contracting.org login scrape password PASSWORD' >> ~/.netrc

You must change the file’s permissions to be readable only by the owner:

chmod 600 ~/.netrc

To check the permissions:

$ stat -f "%Sp" ~/.netrc
-rw-------

If you run grep collect.kingfisher.open-contracting.org ~/.netrc, you should only see the single line you added with the correct password. If there are multiple lines or an incorrect password, you must correct the file in a text editor.

To test your configuration, run:

curl -n https://collect.kingfisher.open-contracting.org/listprojects.json

You should see a response like:

{"node_name": "process1", "status": "ok", "projects": ["kingfisher"]}

Data retention policy

On the first day of each month, the following are deleted:

  • Crawl logs older than 90 days
  • Crawl directories containing exclusively files older than 90 days