Four databases. One arena. Expect chaos, comedy, and questionable benchmarking as PostgreSQL takes on the rest in this live, totally biased battle for DB supremacy.
Thursday, September 4 at 17:40 – 18:05
Thursday, September 4 at 17:40 – 18:05
Four databases. One arena. Expect chaos, comedy, and questionable benchmarking as PostgreSQL takes on the rest in this live, totally biased battle for DB supremacy.
Thursday, September 4 at 18:10 – 19:30
Thursday, September 4 at 17:35 – 17:40
Thursday, September 4 at 17:05 – 17:35
Thursday, September 4 at 16:15 – 17:00
Understanding a project’s ecosystem is the main barrier to entry. I’ve had community roles for different projects (Passenger app server, k6) and technology areas (Ruby, DevOps, Microsoft) but I found the Postgres ecosystem one of the most difficult to grasp.
However, the project and the people grew on me and I believe that in order for it to keep up with the growing demand as the most popular database according to StackOverflow, Postgres needs to onboard new contributors, and it needs to do so quickly.
There are plenty of people who want to contribute, but don’t know where to start. I think I do. Join me for a module I developed for inhouse training at EDB, that I’ll have open sourced by the time PGDay Austria takes place, in line with the project’s ethos.
Thursday, September 4 at 16:15 – 17:00
In late 2023, the Java community started a challenge to find the most efficient way to process a file with 1 billion rows of data. Unsurprisingly, many database communities quickly took on the same challenge with varying results. Postgres, in many cases, performed the worst without close attention to settings and efficient resource utilization. But, with a little more effort, could it compete head-to-head?
In this session, we’ll look at the original challenge and how to approach it with vanilla Postgres beyond the basics. Next, we’ll explore how the increasingly popular in-memory analytics database, DuckDB, handles the same challenge. Finally, we’ll explore recent opportunities to integrate the two databases together to provide a powerful analytical engine with Postgres for the best of both worlds.
Postgres.sql script:
--CREATE DATABASE obrc;
SHOW shared_buffers;
CREATE UNLOGGED TABLE obrc
(
station_name VARCHAR(26),
measurement NUMERIC(3,1)
);
-- Use to override server default
ALTER TABLE obrc SET (parallel_workers=16);
-- this took ~3.0 minutes in a single table
COPY obrc(station_name, measurement)
FROM '/mnt/1brc/measurements.txt'
WITH
(
FORMAT CSV,
DELIMITER ';'
);
SELECT * FROM obrc LIMIT 20;
-- This will take a very long time to complete
SELECT count(*) FROM obrc;
ANALYZE obrc;
-- If ANALYZE is up to date
SELECT reltuples::numeric as count
FROM pg_class
WHERE relname='obrc';
/*
* ~34 seconds
*/
EXPLAIN analyze
SELECT
station_name,
MIN(measurement) AS min_measurement,
ROUND(AVG(measurement),1) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM obrc
GROUP BY station_name
ORDER BY station_name;
/*
* ~34 seconds
*/
EXPLAIN analyze
SELECT
'{' ||
STRING_AGG(station_name || '=' || min_measurement || '/' || mean_measurement || '/' || max_measurement, ', ' ORDER BY station_name) ||
'}' AS result
FROM
(SELECT station_name,
MIN(measurement) AS min_measurement,
ROUND(AVG(measurement), 1) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM obrc
GROUP BY station_name
);
-- partitioned by station name
CREATE UNLOGGED TABLE obrc_partitioned
(
station_name VARCHAR(26),
measurement NUMERIC(3,1)
) PARTITION BY hash(station_name);
TRUNCATE obrc_partitioned;
CREATE UNLOGGED TABLE obrc_p0
PARTITION OF obrc_partitioned
FOR VALUES WITH (modulus 10, remainder 0);
CREATE UNLOGGED TABLE obrc_p1
PARTITION OF obrc_partitioned
FOR VALUES WITH (modulus 10, remainder 1);
CREATE UNLOGGED TABLE obrc_p2
PARTITION OF obrc_partitioned
FOR VALUES WITH (modulus 10, remainder 2);
CREATE UNLOGGED TABLE obrc_p3
PARTITION OF obrc_partitioned
FOR VALUES WITH (modulus 10, remainder 3);
CREATE UNLOGGED TABLE obrc_p4
PARTITION OF obrc_partitioned
FOR VALUES WITH (modulus 10, remainder 4);
CREATE UNLOGGED TABLE obrc_p5
PARTITION OF obrc_partitioned
FOR VALUES WITH (modulus 10, remainder 5);
CREATE UNLOGGED TABLE obrc_p6
PARTITION OF obrc_partitioned
FOR VALUES WITH (modulus 10, remainder 6);
CREATE UNLOGGED TABLE obrc_p7
PARTITION OF obrc_partitioned
FOR VALUES WITH (modulus 10, remainder 7);
CREATE UNLOGGED TABLE obrc_p8
PARTITION OF obrc_partitioned
FOR VALUES WITH (modulus 10, remainder 8);
CREATE UNLOGGED TABLE obrc_p9
PARTITION OF obrc_partitioned
FOR VALUES WITH (modulus 10, remainder 9);
-- This doesn't make a difference because COPY is
-- single threaded
COPY obrc_partitioned(station_name, measurement)
FROM '/tmp/measurements.txt'
WITH
(
FORMAT CSV,
DELIMITER ';'
);
EXPLAIN analyze
SELECT
station_name,
MIN(measurement) AS min_measurement,
ROUND(AVG(measurement),1) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM obrc_partitioned
GROUP BY station_name
ORDER BY station_name;
EXPLAIN analyze
SELECT
'{' ||
STRING_AGG(station_name || '=' || min_measurement || '/' || mean_measurement || '/' || max_measurement, ', ' ORDER BY station_name) ||
'}' AS result
FROM
(SELECT station_name,
MIN(measurement) AS min_measurement,
ROUND(AVG(measurement), 1) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM obrc_partitioned
GROUP BY station_name
);
/*
* DuckDB
* Stop Postgres Container
*
*
*/
.duckdb/cli/latest/duckdb
.timer ON
SELECT * FROM READ_CSV_AUTO('/tmp/measurements.txt') LIMIT 5;
CREATE OR REPLACE TABLE measurements AS
SELECT * FROM READ_CSV('/tmp/measurements.txt',
header=false,
columns= {'station_name':'VARCHAR','measurement':'double'},
delim=';');
SELECT station_name,count(*) FROM measurements
GROUP BY 1;
SELECT station_name,
MIN(measurement),
AVG(measurement),
MAX(measurement)
FROM measurements
GROUP BY station_name
LIMIT 5;
WITH src AS (SELECT station_name,
MIN(measurement) AS min_measurement,
CAST(AVG(measurement) AS DECIMAL(8,1)) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM measurements
GROUP BY station_name
LIMIT 5)
SELECT station_name || '=' || CONCAT_WS('/',min_measurement, mean_measurement, max_measurement)
FROM src;
/*
* Stop duckdb
* Start pg_duckdb container
*/
--CALL duckdb.recycle_ddb();
--ALTER SYSTEM SET duckdb.max_memory='15GB';
--SELECT * FROM pg_available_extensions ORDER BY name;
--CREATE EXTENSION pg_duckdb;
--truncate obrc;
--SHOW duckdb.max_memory;
-- Can still use duckdb functions for reading data
SELECT * FROM read_csv('/mnt/1brc/measurements.txt') LIMIT 20;
SELECT count(*) FROM read_csv('/mnt/1brc/measurements.txt');
-- Not supported yet
INSERT INTO obrc
SELECT * FROM read_csv('/mnt/1brc/measurements.txt') r LIMIT 20;
CREATE TEMPORARY TABLE obrc_temp(
station_name text,
measurment NUMERIC(10,2)) USING duckdb;
/*
* also takes ~1.5 minutes
*/
COPY obrc_temp FROM '/mnt/1brc/measurements.txt';
SELECT count(*) FROM obrc_temp;
/*
* ~7 seconds
*/
SELECT station_name,count(*) FROM obrc_temp
GROUP BY 1;
/*
* On my Docker image this didn't have an impact
* for some reason.
*/
SET duckdb.max_workers_per_postgres_scan=12;
SET duckdb.worker_threads=12;
SET duckdb.force_execution=true;
/*
* We can utilize the native file formats
*/
COPY (SELECT * FROM obrc_temp) TO '/mnt/1brc/measurements.parqet' (format parquet);
/*
* pull station names out for later transformation and JOIN queries
*/
COPY (SELECT DISTINCT(station_name) FROM obrc_test) TO '/mnt/1brc/stations.txt';
CREATE TABLE station(
id INT generated ALWAYS AS identity,
station_name text);
COPY station(station_name) FROM '/mnt/1brc/stations.txt';
--DROP TABLE station;
SELECT * FROM station;
/*
* 12 seconds
*/
--EXPLAIN analyze
SELECT
id, obrc_temp.station_name,
MIN(measurment) AS min_measurement,
ROUND(AVG(measurment),1) AS mean_measurement,
MAX(measurment) AS max_measurement
FROM obrc_temp JOIN station using (station_name)
GROUP BY id, station_name
ORDER BY station_name;
SELECT station_name, count(*) FROM obrc_temp
GROUP BY 1;
/*
* playing with timestamps
*/
WITH dates AS (
--SELECT o,t FROM generate_series('2024-01-01','2025-12-31','30 seconds'::interval) WITH ordinality AS g(t,o)
SELECT ROW_NUMBER() OVER() AS o,t FROM generate_series('2024-01-01','2025-12-31','30 seconds'::interval) AS g(t)
),
m AS (
SELECT row_number() over() AS o, station_name, measurment
FROM obrc_temp
WHERE station_name='Helsinki'
)
INSERT INTO obrc_temp SELECT t, station_name, measurment
FROM dates
JOIN m USING (o);
SELECT time_bucket('1 day',ts) bucket, station_name, avg(measurement) FROM obrc_test2
GROUP BY 1,2
ORDER BY 1;
CREATE TEMPORARY TABLE obrc_temp2(
ts timestamptz,
station_name text,
measurement numeric(10,2)
) using duckdb;
/*
* OOM error trying to process
*/
SELECT row_number() OVER(PARTITION BY station_name), * FROM obrc_test LIMIT 10;
/*
* Stop pg_duckdb container
* Start pg_mooncake container
*/
CREATE DATABASE obrc;
CREATE EXTENSION pg_mooncake;
DROP TABLE obrc;
CREATE TABLE obrc
(
station_name VARCHAR(26),
measurement NUMERIC(3,1)
);
CALL mooncake.create_table('obrc_iceberg', 'obrc');
-- not supported
ALTER TABLE obrc SET (parallel_workers=16);
-- Not supported yet
INSERT INTO obrc
SELECT station_name,measurement
FROM mooncake.read_csv('/mnt/obrc/measurements.txt')
AS (station_name text, measurement float4) LIMIT 20;
-- Works as normal
-- Back of napkind ~20% faster
COPY obrc(station_name, measurement)
FROM '/mnt/obrc/measurements.txt'
WITH
(
FORMAT CSV,
DELIMITER ';'
);
-- This didn't seem to work while COPY was happening.
SELECT * FROM pg_stat_progress_copy;
select pg_size_pretty(pg_total_relation_size('obrc'));
SELECT count(*) FROM obrc;
SELECT * FROM obrc LIMIT 10;
--EXPLAIN analyze
SELECT
station_name,
MIN(measurement) AS min_measurement,
ROUND(AVG(measurement),1) AS mean_measurement,
MAX(measurement) AS max_measurement
FROM obrc_iceberg
GROUP BY station_name
ORDER BY station_name;
TRUNCATE obrc;
postgres_config.sql script:
ALTER SYSTEM SET max_worker_processes=16;
ALTER SYSTEM SET max_parallel_workers=8;
ALTER SYSTEM SET max_parallel_workers_per_gather=8;
ALTER SYSTEM SET work_mem='1GB';
ALTER SYSTEM SET shared_buffers='20GB';
ALTER SYSTEM SET random_page_cost=1.1;
ALTER SYSTEM SET jit=OFF;
Thursday, September 4 at 16:15 – 17:00
Picking up the topic of previous talks about using PostgreSQL to store many data, very many data, and absurdly many data, this talk wants to talk about the problems and challenges that one might experience when trying to shove all the data into PostgreSQL.
Many things that are simple with small amounts of data can become headache-inducing.
As data grows even a simple ‘select count(*)’ can be unpleasantly slow.
Taking a naive backup could easily take more than a day, and restoring such a backup is similarly time-consuming.
We will explore the limits of where PostgreSQL can go, the limits imposed by physics, hardware, and software, and ideas how we can go beyond those limitations.
We’ll also look at some performance problems and the many strategies we have to mitigate them.
Hopefully this talk will give you the confidence that PostgreSQL can grow with your data, and show you some of the options you have to improve performance when your data just keeps growing and growing.
Thursday, September 4 at 15:50 – 16:10
Thursday, September 4 at 15:00 – 15:45
PostgreSQL’s autovacuum tuning in the cloud era is a delicate balancing act. It demands careful juggling of application performance, cloud costs, and maintenance(autovacuum) efficiency. Databases on bare-metal enjoyed a happy flow with superior throughput and ultra-low latency. In contrast, the cloud era has suddenly introduced performance speedbreakers, even as it simplifies management for DBAs.
This talk will unravel how to optimize autovacuum settings for peak efficiency across AWS, GCP, and Azure. Discover strategies to overcome throughput and resource limits without busting your cloud budget, keeping your cloud FinOps team smiling all along.
Thursday, September 4 at 15:00 – 15:45
Join this talk to learn how to secure your database in Postgres using streaming replication, from the very BASIC to the very ADVANCED. After all, even if your database is small, your data might still be precious. And as your project grows and gets bigger, your Postgres environment will need to be adapted to scale with it.
This talk will give you a beginner’s guide to streaming replication in Postgres and more specifically you will learn about WAL (write-ahead logging), how to prepare for DR, and how to use the `pg_receivewal` utility in a large setup with cascading replicas, continuous backup, etc. You’ll also learn about the concepts of RTO and RPO and how to scale your data protection architecture as your application grows.
Why attend this talk? Because the amazing power of streaming replication in PostgreSQL is too under-appreciated.
Based on over two decades of practical experience, you will be guided through some of the pitfalls and get introduced to the full range of Postgres Disaster Resilience techniques.