Maintain PostgreSQL

Troubleshoot performance

Check the log file – for example, /var/log/postgresql/postgresql-15-main.log – if debugging an unscheduled restart of the postgres service, for example.

Find slow queries

Use the pg_stat_statements extension. For example:

SELECT
    usename,
    substring(query, 1, 80) AS short_query,
    round(total_time::numeric, 2) AS total_time,
    calls,
    round(mean_time::numeric, 2) AS mean,
    round((100 * total_time /
    sum(total_time::numeric) OVER ())::numeric, 2) AS percentage_cpu
FROM pg_stat_statements s
INNER JOIN pg_user u ON s.userid = u.usesysid
ORDER BY total_time DESC
LIMIT 20;

To display the full query, you might prefer to switch to unaligned output mode (\a):

SELECT
    usename,
    query,
    round(total_time::numeric, 2) AS total_time,
    calls,
    round(mean_time::numeric, 2) AS mean,
    round((100 * total_time /
    sum(total_time::numeric) OVER ())::numeric, 2) AS percentage_cpu
FROM pg_stat_statements s
INNER JOIN pg_user u ON s.userid = u.usesysid
ORDER BY total_time DESC
LIMIT 20;

To reset the statistics:

SELECT pg_stat_statements_reset();

Tip

pgBadger is also available.

Reference: Tracking down slow queries in PostgreSQL

Tune server settings

  1. SSH into a PostgreSQL server as the postgres user.

  2. Run the postgresqltuner.pl file:

    /var/lib/postgresql/postgresqltuner.pl --ssd
    
  3. Under Configuration advice, address HIGH and MEDIUM recommendations.

Reference: Tuning Your PostgreSQL Server, Slow Query Questions

Check autovacuum statistics

SELECT
    nspname,
    s.relname,
    reltuples,
    n_live_tup::real,
    n_dead_tup::real,
    TRUNC(n_dead_tup / GREATEST(reltuples::numeric, 1) * 100, 2) AS percent,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_all_tables s
JOIN pg_class c ON relid = c.oid
JOIN pg_namespace ON relnamespace = pg_namespace.oid
ORDER BY percent DESC, last_autovacuum;

See the pg_stat_all_tables table’s documentation.

To get the table related to a TOAST table, take the number after pg_toast_, and run, for example:

SELECT '16712'::regclass;

Troubleshoot backups

If cron sends an email like:

ERROR: [082]: WAL segment 00000001000009FD0000000D was not archived before the 60000ms timeout
       HINT: check the archive_command to ensure that all options are correct (especially --stanza).
       HINT: check the PostgreSQL server log for errors.
       HINT: run the 'start' command if the stanza was previously stopped.

(The same message will appear in a log file like /var/log/pgbackrest/STANZA-backup.log.)

This error occurs occasionally (for example, due to a network issue), and is not a concern. To confirm, after 24 hours, view recent backups with:

sudo -u postgres pgbackrest info --stanza=kingfisher-2023

You should see a “full backup” within one week of the error, an “incr backup” the day before the error, and another the day after. (In other words, an incremental backup failed.)

If a full backup failed, re-run the weekly full backup command listed in the /etc/cron.d/postgres_backups file.

Check usage

Explore databases

List databases:

\l

List schemas:

\dn

List tables, views and sequences in the public schema:

\d

List tables, indexes, views and sequences in the public schema:

\dtivs

To list tables, views and/or sequences in a specific schema, append, for example, reference.* – or *.* for all schema.

You can use the psql command’s -E (--echo-hidden) flag to echo the queries generated by the backslash commands.

Check drive usage

Get all database sizes:

\l+

Get all schema sizes:

SELECT
    schema_name,
    schema_size,
    pg_size_pretty(schema_size),
    TRUNC(schema_size::numeric / pg_database_size(current_database()) * 100, 2) AS percent
FROM (
    SELECT
        nspname AS schema_name,
        SUM(pg_relation_size(c.oid))::bigint AS schema_size
    FROM pg_class c
    JOIN pg_namespace n ON c.relnamespace = n.oid
    GROUP BY schema_name
) t
ORDER BY schema_size DESC;

Get relation sizes in the public schema:

\dtis+

To get relation sizes in a specific schema, append, for example, reference.* – or *.* for all schema.

See the Database Object Size Functions documentation.

Show active connections and running queries

Show running queries:

SELECT pid, client_addr, usename, application_name, state, wait_event_type, NOW() - query_start AS time, query
FROM pg_stat_activity
WHERE query <> ''
ORDER BY time DESC;

See the pg_stat_activity table’s documentation.

Note

A client_addr in the range 172.16.0.0 to 172.31.255.255 (private address space) is Docker.

application_name can suggest an administrator’s manual connection. For example, “psql” or “DataGrip”.

Tip

To see all connections, remove the WHERE clause and add backend_type to the SELECT clause. Typically, the backend_type with no query are:

  • autovacuum launcher

  • logical replication launcher

  • background writer

  • archiver

  • checkpointer

  • walwriter

Stop a query, replacing PID with the query’s pid:

SELECT pg_cancel_backend(PID)

Find unexpected schema or tables

Find unexpected schema:

SELECT nspname
FROM pg_namespace
WHERE
    nspname NOT LIKE 'pg_temp_%' AND
    nspname NOT LIKE 'pg_toast_temp_%' AND
    nspname NOT LIKE 'summary_%' AND
    nspname NOT IN (
        'information_schema',
        'pg_catalog',
        'pg_toast',
        'public',
        'reference',
        'summaries'
    );

Find unexpected tables in the public schema:

SELECT relname
FROM pg_class c
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE
    nspname = 'public' AND
    -- Ignore sequences and indexes
    relkind NOT IN ('S', 'i') AND
    relname NOT IN (
        -- Kingfisher Process tables
        'collection',
        'collection_file',
        'collection_note',
        'compiled_release',
        'data',
        'package_data',
        'record',
        'record_check',
        'release',
        'release_check',
        -- To be removed in future versions
        'alembic_version',
        'record_check_error',
        'release_check_error',
        'transform_upgrade_1_0_to_1_1_status_record',
        'transform_upgrade_1_0_to_1_1_status_release',
        -- https://www.postgresql.org/docs/current/pgstatstatements.html
        'pg_stat_statements',
        -- https://www.postgresql.org/docs/current/tablefunc.html
        'tablefunc_crosstab_2',
        'tablefunc_crosstab_3',
        'tablefunc_crosstab_4'
    );

Control access

Each individual should have a personal account.

Add a user

  1. Add, in a private Pillar file, replacing PASSWORD with a strong password and USERNAME with a recognizable username (for people, use the lowercase first initial and the family name, like jdoe):

    postgres:
      users:
        # me@example.com
        USERNAME:
          password: "PASSWORD"
    
  2. Assign the user to groups. For example, the kingfisher-main target includes the groups:

    kingfisher_process_read

    SELECT on all tables in schema public

    kingfisher_summarize_read

    SELECT on all tables in schema created by Kingfisher Summarize

    postgres:
      users:
        # me@example.com
        USERNAME:
          password: "PASSWORD"
          groups:
            - kingfisher_process_read
            - kingfisher_summarize_read
    
  3. Deploy the server

Update a password

  1. Update the private Pillar file, for example:

    postgres:
      users:
        # me@example.com
        USERNAME:
          password: "PASSWORD"
    
  2. Deploy the server

  3. Notify the contact at the email address in the comment

Delete a user

  1. Delete the user from the private Pillar file.

  2. SSH into the main server as the root user.

  3. Attempt to drop the user as the postgres user, for example:

    su - postgres -c 'psql kingfisher_process -c "DROP ROLE role2drop;"'
    
  4. If you see a message like:

    ERROR:  role "role2drop" cannot be dropped because some objects depend on it
    DETAIL:  privileges for table …
    …
    and 1234 other objects (see server log for list)
    
  5. Open the server log, and search for the relevant DROP ROLE statement (after running the command below, press /, type DROP ROLE, press Enter, and press n until you find it). For example:

    less /var/log/postgresql/postgresql-15-main.log
    
  6. If all the objects listed after DETAIL: in the server log can be dropped (press Space to scroll forward), then press q to quit less and open a SQL terminal as the postgres user:

    su - postgres -c 'psql kingfisher_process'
    
  7. Finally, drop the user:

    REASSIGN OWNED BY role2drop TO anotheruser;
    DROP OWNED BY role2drop;
    DROP ROLE role2drop;
    

Check privileges

List users and groups:

\du

Find unexpected database CREATE privileges:

SELECT usename, string_agg(datname, ', ' ORDER BY datname)
FROM pg_user
CROSS JOIN pg_database
WHERE
    usename NOT IN ('postgres') AND
    has_database_privilege(usename, datname, 'CREATE') AND
    NOT (usename = 'kingfisher_summarize' AND datname = 'kingfisher_process')
GROUP BY usename
ORDER BY usename;

Find unexpected schema CREATE privileges:

SELECT usename, string_agg(nspname, ', ' ORDER BY nspname)
FROM pg_user
CROSS JOIN pg_namespace
WHERE
    usename NOT IN ('postgres') AND
    has_schema_privilege(usename, nspname, 'CREATE') AND
    NOT (usename = 'kingfisher_process' AND nspname = 'public') AND
    NOT (usename = 'kingfisher_summarize' AND nspname LIKE 'summary_%')
GROUP BY usename
ORDER BY usename;

Find unexpected schema USAGE privileges:

SELECT usename, string_agg(nspname, ', ' ORDER BY nspname)
FROM pg_user
CROSS JOIN pg_namespace
WHERE
    usename NOT IN ('postgres') AND
    nspname NOT IN ('information_schema', 'pg_catalog', 'reference', 'summaries') AND
    has_schema_privilege(usename, nspname, 'USAGE') AND
    NOT (usename = 'kingfisher_summarize' AND nspname LIKE 'summary_%') AND
    NOT (pg_has_role(usename, 'kingfisher_process_read', 'MEMBER') AND nspname = 'public') AND
    NOT (pg_has_role(usename, 'kingfisher_summarize_read', 'MEMBER') AND nspname LIKE 'summary_%')
GROUP BY usename
ORDER BY usename;

Find unexpected table non SELECT privileges:

SELECT usename, nspname, string_agg(relname, ', ' ORDER BY relname)
FROM pg_user
CROSS JOIN pg_class c
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE
    usename NOT IN ('postgres') AND
    nspname NOT IN ('pg_toast') AND
    relname NOT IN ('pg_settings') AND
    has_table_privilege(usename, c.oid, 'INSERT,UPDATE,DELETE,TRUNCATE,REFERENCES,TRIGGER') AND
    NOT (usename = 'kingfisher_process' AND nspname = 'public') AND
    NOT (usename = 'kingfisher_summarize' AND nspname LIKE 'summary_%')
GROUP BY usename, nspname
ORDER BY usename, nspname;

Reference: System Information Functions for functions like has_schema_privilege

Restore from backup

pgBackRest

Get the stanza name from the postgres:backup:stanza key in the Pillar data, or from the /etc/pgbackrest/pgbackrest.conf file on the server.

View current backups:

pgbackrest info --stanza=example
  1. Stop PostgreSQL:

    systemctl stop postgres.service
    
  2. In most cases, restore using the --delta option, as it saves time by using hashes to restore only files that differ:

    sudo -u postgres pgbackrest restore --stanza=example --delta
    

    To restore from a backup other than the latest, for example:

    sudo -u postgres pgbackrest restore --stanza=example --delta --set=20210315-145357F_20210315-145459I
    

    To restore every file, for example:

    rm -rf /var/lib/postgresql/15/main
    mkdir /var/lib/postgresql/15/main
    sudo -u postgres pgbackrest restore --stanza=example
    
  3. Start PostgreSQL:

    systemctl start postgres.service
    

pg_dump and pg_restore

If pgBackRest isn’t an option, use pg_dump and pg_restore.

  1. Connect to the old server, and export the existing database. For example, to export spoonbill_web:

    su - postgres -c "/usr/bin/pg_dump -Ft 'spoonbill_web' | gzip > ~/spoonbill_web.tar.gz"
    
  2. Copy the database dump to your local machine. For example:

    rsync -avz root@ocp13.open-contracting.org:/var/lib/postgresql/spoonbill_web.tar.gz .
    
  3. Copy the database dump to the new server. For example:

    rsync -avz spoonbill_web.tar.gz root@ocp27.open-contracting.org:~/
    
  4. Restore the database. For example:

    gunzip ~/spoonbill_web.tar.gz
    sudo -u postgres pg_restore -cC --if-exists -v -U postgres -d spoonbill_web ~/spoonbill_web.tar
    

Recover the replica

If replication breaks or the replica server goes offline, you must recover the replica, in two stages: mitigate the downtime, and fix the replication.

Mitigate downtime

  1. Enable public access to the PostgreSQL service on the main server, by modifying its Pillar file:

    postgres:
      public_access: True
    

    For example, for the kingfisher-main target, modify the pillar/kingfisher_main.sls file.

  2. Deploy the main server

  3. Update DNS records:

    1. Login to GoDaddy

    2. If access was delegated, open Delegate Access and click the Access Now button

    3. Open DNS Management for open-contracting.org

    4. Update the replica’s CNAME record to point to the main server’s A record

    5. Wait for the changes to propagate, which depends on the original TTL value

Fix replication

  1. SSH into the replica server as the root user

  2. Stop PostgreSQL:

    systemctl stop postgres.service
    
  3. Download the latest database or a backup from a specific point in time, replacing kingfisher with the value set in pillar postgres:backup:stanza. pgBackRest is configured to restore the replication configuration (see, for example, /var/lib/postgresql/15/main/postgresql.conf).

    pgbackrest --stanza=kingfisher --type=standby --delta restore
    

    Note

    See Restore from backup for more information on the pgBackRest restore function.

  4. Start PostgreSQL and monitor:

    systemctl start postgres.service
    tail -f /var/log/postgresql/postgresql-15-main.log
    

    You should see messages about recovering from WAL files in the logs.

If all else fails, rebuild the replica.