Kingfisher Collect#
Read the Kingfisher Collect documentation, which covers general usage.
Note
Is the service unresponsive or erroring? Follow these instructions.
Review a new publication#
Create an issue in the kingfisher-collect repository.
Schedule a crawl, once the spider is written and deployed.
Review the data.
Access Scrapyd’s web interface#
One-time setup
Request a username and password from James or Yohanna. (They will add a key-value pair under the apache.sites.ocdskingfisherscrape.htpasswd
key in the pillar/private/kingfisher_process.sls
file.)
Open https://collect.kingfisher.open-contracting.org to view the statuses and logs of crawls.
Collect data with Kingfisher Collect#
One-time setup
First, read this section of the Kingfisher Collect documentation.
To schedule a crawl, replace spider_name
with a spider’s name and NAME
with your name (you can edit the note any way you like), and run, from your computer:
curl -n https://collect.kingfisher.open-contracting.org/schedule.json -d project=kingfisher -d spider=spider_name -d note="Started by NAME."
You should see a response like:
{"node_name": "process1", "status": "ok", "jobid": "6487ec79947edab326d6db28a2d86511e8247444"}
To cancel a crawl, replace JOBID
with the job ID from the response or from Scrapyd’s jobs page:
curl -n https://collect.kingfisher.open-contracting.org/cancel.json -d project=kingfisher -d job=JOBID
You should see a response like:
{"node_name": "process1", "status": "ok", "prevstate": "running"}
The crawl won’t stop immediately. You can force an unclean shutdown by sending the request again; however, it’s preferred to allow the crawl to stop gracefully, so that the log file is completed.
Update spiders in Kingfisher Collect#
One-time setup
Create a ~/.netrc file. Then, create a ~/.config/scrapy.cfg file, and set the url
variable to https://collect.kingfisher.open-contracting.org/
.
Change to your local directory containing your local repository
Ensure your local repository and the GitHub repository are in sync:
git checkout main git remote update git status
The output should be exactly:
On branch main Your branch is up to date with 'origin/main'. nothing to commit, working tree clean
Activate a virtual environment in which
scrapyd-client
is installed, and deploy the spiders:scrapyd-deploy kingfisher
Access Scrapyd’s crawl logs#
From a browser, click on a “Log” link from Scrapyd’s jobs page, or open the logs page for the kingfisher project.
From the command-line, connect to the server, and change to the logs
directory for the kingfisher
project:
curl --silent --connect-timeout 1 collect.kingfisher.open-contracting.org:8255 || true
ssh ocdskfp@collect.kingfisher.open-contracting.org
cd scrapyd/logs/kingfisher
Scrapy statistics are extracted from the end of each log file every hour on the hour, into a new file ending in .log.stats
in the same directory as the log file. Access as above, or, from the jobs page:
Right-click on a “Log” link.
Select “Copy Link” or similar.
Paste the URL into the address bar.
Change
.log
at the end of the URL to.log.stats
and press Enter.
If you can’t wait for the statistics to be extracted, you can connect to the server, replace spider_name/alpha-numeric-string
, and run:
tac /home/ocdskfs/scrapyd/logs/kingfisher/spider_name/alpha-numeric-string.log | grep -B99 statscollectors | tac
If you are frequently running the above, create an issue to change the schedule.
Tip
The log file is named after the job’s ID, like 7df53218f37a11eb80dd0c9d92c523cb.log
. If a crawl no longer appears on the jobs page, it can be difficult to find the crawl’s log file, because its filename is opaque. To address this, Kingfisher Collect writes the job’s ID to a scrapyd-job.txt
file in the crawl’s directory. So, the log file will be at, for example:
cd /home/ocdskfs/scrapyd
less logs/kingfisher/colombia/$(cat data/colombia/20210708_212020/scrapyd-log.txt).log
Create a .netrc file#
To collect data with (and update spiders in) Kingfisher Collect, you need to send requests to it from your computer as described above, using the same username and password as to Access Scrapyd’s web interface.
Instead of setting the username and password in multiple locations (on the command line and in scrapy.cfg
files), set them in one location: in a .netrc
file. In order to create (or append the Kingfisher Collect credentials to) a .netrc
file, replace PASSWORD
with the password, and run:
echo 'machine collect.kingfisher.open-contracting.org login scrape password PASSWORD' >> ~/.netrc
You must change the file’s permissions to be readable only by the owner:
chmod 600 ~/.netrc
To check the permissions:
$ stat -f "%Sp" ~/.netrc
-rw-------
If you run grep collect.kingfisher.open-contracting.org ~/.netrc
, you should only see the single line you added with the correct password. If there are multiple lines or an incorrect password, you must correct the file in a text editor.
To test your configuration, run:
curl -n https://collect.kingfisher.open-contracting.org/listprojects.json
You should see a response like:
{"node_name": "process1", "status": "ok", "projects": ["kingfisher"]}
Data retention policy#
On the first day of each month, the following are deleted:
Crawl logs older than 90 days
Crawl directories containing exclusively files older than 90 days