Kingfisher tasks

Access Scrapyd’s web interface


Connect to the Kingfisher Scrape server

Connect to the server as the ocdskfs user:


Check if spiders are running

Access Scrapyd’s web interface, click “Jobs” and look under “Running”.

Collect data with Kingfisher Scrape

Read its documentation, which covers general usage.

  1. Connect to the server

  2. Schedule a crawl and set its note and any other spider arguments. For example, replace spider_name with a spider’s name and NAME with your name:

    curl http://localhost:6800/schedule.json -d project=kingfisher -d spider=spider_name -d note="Started by NAME."

Access Scrapyd’s crawl logs

From a browser, click on a “Log” link from the jobs page, or open Scrapyd’s logs page for the kingfisher project.

From the command-line, connect to the server as the ocdskfs user, and change to the logs directory for the kingfisher project:

cd scrapyd/logs/kingfisher

Scrapy statistics are extracted from the end of each log file every hour on the hour, into a new file ending in _report.log in the same directory as the log file. Access as above, or, from the jobs page:

  • Right-click on a “Log” link.
  • Select “Copy Link” or similar.
  • Paste the URL into the address bar.
  • Change .log at the end of the URL to _report.log and press Enter.

Update spiders in Kingfisher Scrape

  1. Merge your changes to the master branch of the kingfisher-scrape repository.

  2. Connect to the server as the ocdskfs user and change to the working directory:

    cd ocdskingfisherscrape
  3. Pull your changes into the local repository:

    git pull --rebase
  4. Activate the virtual environment and Update the project’s requirements:

    source .ve/bin/activate
    pip install -r requirements.txt
  5. Deploy the spiders:


Deploy Kingfisher Process without losing Scrapy requests

This should match salt/ocdskingfisherprocess.sls (up-to-date as of 2019-12-19). You can git log salt/ocdskingfisherprocess.sls to see if there have been any relevant changes, and update this page accordingly.

This assumes that there have been no changes to requirements.txt. If you are adding an index, altering a column, updating many rows, or performing another operation that locks tables or rows for longer than uWSGI’s harakiri setting, this might interfere with an ongoing collection (until queues are fully implemented).

Below, the two key operations are reloading uWSGI with the new application code, and migrating the database.

It’s possible for requests to arrive after uWSGI reloads and before the database migrates. If the new application code is not backwards-compatible with the old database schema, the requests might error. If, on the other hand, your old application code is forwards-compatible with the new database schema, then reload uWSGI after migrating the database, instead of before.

service uwsgi reload runs /etc/init.d/uwsgi reload, which sends the SIGHUP signal to the master uWSGI process, which causes it to gracefully reload and not lose any requests from Scrapy.

  1. Get the deploy token.

  2. Connect to the server as the ocdskfp user and change to the working directory:

    cd ocdskingfisherprocess
  3. Check that you won’t deploy more commits than you intend, for example:

    git fetch
    # From
    #    d8736f4..173dcf2  master                                  -> origin/master
    git log d8736f4..173dcf2
  4. Update the code:

    git pull --rebase
  5. In a new terminal, connect to the server as the root user, reload uWSGI, then close your connection to the server:

    service uwsgi reload
  6. In the original terminal, open a terminal multiplexer, in case you lose your connection while migrating the database. You can re-attach to the session with tmux attach-session -t deploy:

    tmux new -s deploy
  7. If workers are likely to interfere with a migration (e.g. inserting new rows that meet the criteria for an update), comment out the lines that start them in the cron table and kill them:

    crontab -e
    pkill -f ocdskingfisher-process-cli
  8. Migrate the database (log the time, in case you need to retry). Alembic has no verbose mode for upgrades. To see the current queries, open another terminal, open a PostgreSQL shell, and run SELECT pid, state, wait_event_type, query FROM pg_stat_activity;. If a migration query has a wait_event_type of Lock, look for queries that block it (for example, long-running DELETE queries). To stop a query, run SELECT pg_cancel_backend(PID), where PID is the pid of the query.

    . .ve/bin/activate
    python ocdskingfisher-process-cli upgrade-database
  9. Uncomment the lines that start the workers in the cron table:

    crontab -e
  10. Close the session with Ctrl-D and close your connection to the server.

  11. Release the deploy token.