Kingfisher Process¶
You can read the Kingfisher Process documentation, which covers general usage.
Note
Is the service unresponsive or erroring? Follow these instructions.
Load local data¶
Determine the source ID to use.
If the data source is the same as for an existing spider, use the same source ID, like
moldova
.Otherwise, if the data source has been loaded before, or if you don’t know, use a consistent source ID. From a SQL interface, you can list all source IDs with:
SELECT source_id FROM collection GROUP BY source_id ORDER BY source_id;
If you know the data source had been loaded with a source ID containing “local”, get a shorter list of source IDs with:
SELECT source_id FROM collection WHERE source_id LIKE '%local%' GROUP BY source_id ORDER BY source_id;
If this is the first time loading the data source, use a distinct source ID that follows the pattern
country[_region][_label]
, likemoldova_covid19
.Create a data directory in your
local-load
directory, following the patternsource-YYYY-MM-DD
:mkdir ~/local-load/moldova-2020-04-07
Copy the files to load into the data directory.
If you need to download an archive file (e.g. ZIP) from a remote URL, prefer
curl
towget
, becausewget
sometimes writes unwanted files likewget-log
.If you need to copy files from your local machine, you can use
rsync
(fast) orscp
(slow). For example, on your local machine:rsync -avz file.json USER@collect.kingfisher.open-contracting.org:~/local-load/moldova-2020-04-07
Load the data. For example, to create compiled releases and run structural checks:
sudo -u deployer /opt/kingfisher-process/load.sh --source moldova --note "Added by NAME" --compile --check /home/USER/local-load/moldova-2020-04-07
If the data source is not the same as for an existing spider, add the
--force
flag.If loading data from the Data Registry, omit the
--compile
and--upgrade
flags.If you don’t need structural checks, omit the
--check
flag. For a description of all options, run:sudo -u deployer /opt/kingfisher-process/load.sh --help
Note
Kingfisher Process can keep the collection open for more files to be added later, by using the
--keep-open
flag with theload
command. To learn how to use the additional commands, run:sudo -u deployer /opt/kingfisher-process/addfiles.sh --help sudo -u deployer /opt/kingfisher-process/closecollection.sh --help
Delete the data directory once you’re satisfied that it loaded correctly.
Add structural checks¶
If you skipped structural checks in Kingfisher Collect or when loading local data, you can reschedule them in Kingfisher Process:
Add structural checks to a collection:
sudo -u deployer /opt/kingfisher-process/addchecks.sh 123
Remove a collection¶
Remove the collection:
sudo -u deployer /opt/kingfisher-process/deletecollection.sh 123
Check on progress¶
Using the command-line interface¶
Check the collection status, replacing the collection ID (
123
).$ sudo -u deployer /opt/kingfisher-process/collectionstatus.sh 123 steps: check, compile data_type: release package store_end_at: 2023-06-28 22:13:00.067783 completed_at: 2023-06-28 23:29:37.825645 expected_files_count: 1 collection_files: 1 processing_steps: 0 Compiled collection compilation_started: True store_end_at: 2023-06-28 22:13:04.060873 completed_at: 2023-06-28 22:13:04.060873 collection_files: 277 processing_steps: 0
This output means processing is complete. To learn how to interpret the output, run:
sudo -u deployer /opt/kingfisher-process/collectionstatus.sh --help
Using RabbitMQ¶
Kingfisher Process uses a message broker, RabbitMQ, to organize its tasks into queues. You can login to the RabbitMQ management interface to see the status of the queues and check that it’s not stuck.
Open https://rabbitmq.kingfisher.open-contracting.org. Your username and password are the same as for Kingfisher Collect.
Click on the Queues tab.
Read the rows in which the Name starts with
kingfisher_process_
.If the Messages are non-zero, then there is work to do. If zero, then work is done! (Everything except the checker is fast – don’t be surprised if it’s zero.)
If the Message rates are non-zero, then work is progressing. If zero, and if there is work to do, then it is stuck!
If you think work is stuck, notify James or Yohanna.
Export compiled releases from the database as record packages¶
Check the number of compiled releases to be exported. For example:
SELECT cached_compiled_releases_count FROM collection WHERE id = 123;
Change to the directory in which you want to write the files.
Tip
Large collections will take time to export, so run the commands below in a tmux
session.
To export the compiled releases to a single JSONL file, run, for example:
psql "connection string" -c '\t' \
-c 'SELECT data FROM data INNER JOIN compiled_release r ON r.data_id = data.id WHERE collection_id = 123' \
-o myfilename.jsonl
To export the compiled releases to individual files, run, for example:
psql "connection string" -c '\t' \
-c 'SELECT data FROM data INNER JOIN compiled_release r ON r.data_id = data.id WHERE collection_id = 123' \
| split -l 1 -a 5 --additional-suffix=.json
The files will be named xaaaaa.json
, xaaaab.json
, etc. -a 5
is sufficient for 11M files (26⁵).
If you need to wrap each compiled release in a record package, modify the files in-place. For example:
echo *.json | xargs sed -i '1i {"records":[{"compiledRelease":'
for filename in *.json; do echo "}]}" >> "$filename"; done
Data retention policy¶
On the first day of each month, the following are deleted:
Collections that ended over a year ago, while retaining one set of collections per source from over a year ago
Collections that never ended and started over 2 months ago
Collections that ended over 2 months ago and have no data