Home PostgreSQL Queries in PostgreSQL: 3.Sequential scanning

Queries in PostgreSQL: 3.Sequential scanning

by admin

In previous articles, I’ve covered the stages of query execution. and about statistics

Now it’s time to look at the most important nodes that a plan can consist of. I’ll start with ways to access the data, and in this article I’ll talk about sequential scanning.

Last time, I showed you how to calculate cardinality based on statistics, and in this one and the next, I’ll be showing you how to calculate the cost of plan nodes. Not that specific valuation formulas are important for understanding the details of the planner, but I want to show that all figures are derived from statistics without involving black magic.

Connected Storage Engines

The way PostgreSQL organizes data on disk is neither the only possible nor the best way for all types of workloads. Following the idea of extensibility, since version 12 PostgreSQL allows you to create and connect different tabular access methods (storage engines), although only one is currently available "out of the box" :

SELECT amname, amhandlerFROM pg_am WHERE amtype = 't';
amname | amhandler−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−heap | heap_tableam_handler(1 row)

The name of the engine can be specified when creating the table ( CREATE TABLE ... USING ); by default, the engine defined by the default_table_access_method

In order for the kernel to be able to handle different engines in the same way, table access methods must implement a special interface. The function specified in the column amhandler , returns an interface structure containing all the information needed by the kernel.

Most kernel components remain common to any table access methods :

  • transaction manager, including support for ACID and snapshot-based isolation;

  • buffer manager;

  • I/O subsystem;

  • TOAST;

  • optimizer and query executor;

  • index support.

Not all of these components may be needed by the engine, but the possibility of using them remains.

In turn, the engine determines :

  • string version format and data structure;

  • table scan implementation;

  • implementation of inserts, deletes, updates, and locks;

  • line version visibility rules;

  • cleaning and analysis procedures;

  • estimate the cost of a sequential scan.

Historically, PostgreSQL used a single storage system built into the kernel without any defined programming interface. Therefore, it is now extremely difficult to create a successful interface that takes into accountall the established features of the standard engine while not interfering with other methods.

For example, the logging issue is still unresolved. New access methods require logging of specific operations that the kernel knows nothing about. The existing mechanism unified log entries. is generally not suitable because of too much overhead. It is possible to introduce another interface to connect new log record types, but in this case, the reliability of the crash recovery will depend on the external code, which is highly undesirable. For now, the only thing left is to modify the kernel specifically for each engine.

Several engines are currently being worked on, among them :

  • Zheap is designed to handle the problem of table sprawl. It does this by implementing in-place version updates of rows and taking the historical data needed to build the snapshot into a separate undo-store. Such an engine would be useful for workloads involving active data updates.

    The design of the engine will seem familiar to Oracle users, although there are some nuances (for example, the index methods interface does not allow to create indexes with their own versioning).

  • Zedstore implements columnar storage and should be effective for OLAP queries.

    The data is organized in the main B-tree of row version identifiers, and each column is stored in its own B-tree associated with the main one. In the future, the engine may store several columns in the same tree at once, obtaining a hybrid storage.

Sequential scan

The storage engine is responsible for the physical organization of tabular data and provides a method of accessing it – sequential scanning – in which the file (or files) of the main table layer is fully read. On each page that is read, the visibility of each version of the row is checked; versions that do not satisfy the query conditions are discarded.

Queries in PostgreSQL: 3.Sequential scanning

Reading is done through a buffer cache; so that large tables do not crowd out useful data, a small buffer ring is used for sequential scanning. In doing so, other processes simultaneously scanning the same table join the ring and thus save disk reads. Therefore, in general, the scan may not start at the beginning of the file.

A sequential scan is the most efficient way to read all or a significant portion of a table. In other words, sequential scanning works well when selectivity is low. (At high selectivity, when only a small fraction of rows are needed from the whole table, using an index is preferable.)

Example plan

In the execution plan, the sequential scan is represented by the Seq Scan node:

EXPLAIN SELECT * FROM flights;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights (cost=0.00..4772.67 rows=214867 width=63)(1 row)

The estimate of the number of rows is a basic statistic:

SELECT reltuples FROM pg_class WHERE relname = 'flights';
reltuples−−−−−−−−−−−214867(1 row)

The optimizer considers two components in the cost estimation: disk I/O and CPU resources.

I/O cost is calculated as the product of the number of pages in the table by the cost per page read, if the pages are read consecutively When the buffer manager asks the operating system for a page of data, physically more data is read from the disk at one time, so that it is highly likely that the next few pages will already be in the operating system’s cache. This causes the cost of sequential reading of one page (which for the scheduler is determined by the value of the parameter seq_page_cost, is by default 1) is less than the cost of random access (which is determined by the value of the parameter random_page_cost, defaults to 4).

The default ratio is typical for HDDs; for SSDs it makes sense to reduce the value considerably random_page_cost (value seq_page_cost is usually left untouched, leaving it as the reference single value).Since the values depend on equipment characteristics, the parameters are usually set at the table space level ( ALTER TABLESPACE ... SET ).

SELECT relpages, current_setting('seq_page_cost') AS seq_page_cost, relpages * current_setting('seq_page_cost')::real AS totalFROM pg_class WHERE relname='flights';
relpages | seq_page_cost | total−−−−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−2624 | 1 | 2624(1 row)

The formula above clearly shows the consequences of table sprawl due to untimely cleanup: the bigger the main table layer, the more pages will have to be scanned, regardless of the number of actual versions of the rows in them.

Evaluation of CPU resources is made up of the cost of processing each version of the lines (which is determined for the scheduler by the value of the parameter cpu_tuple_cost, defaults to 0.01):

SELECT reltuples, current_setting('cpu_tuple_cost') AS cpu_tuple_cost, reltuples * current_setting('cpu_tuple_cost')::real AS totalFROM pg_class WHERE relname='flights';
reltuples | cpu_tuple_cost | total−−−−−−−−−−−+−−−−−−−−−−−−−−−−+−−−−−−−−−214867 | 0.01 | 2148.67(1 row)

The sum of the two given estimates is the full cost of the plan. The initial cost is zero, since the sequential scanning does not require any preparatory steps.

If conditions are imposed on the table to be scanned, they are displayed in the query plan under the Seq Scan node. An estimate of the number of rows will account for the selectivity of the conditions, and an estimate of the cost will account for the cost of calculating them. The command EXPLAIN ANALYZE will output both the number of lines actually received and the number of lines filtered by the conditions :

EXPLAIN (analyze, timing off, summary off)SELECT * FROM flights WHERE status = 'Scheduled';
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights(cost=0.00..5309.84 rows=15383 width=63)(actual rows=15383 loops=1)Filter: ((status)::text = 'Scheduled'::text)Rows Removed by Filter: 199484(5 rows)

Example plan with aggregation

Let’s look at a slightly more complex example of a query execution plan with aggregation:

EXPLAIN SELECT count(*) FROM seats;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Aggregate (cost=24.74..24.75 rows=1 width=8)−> Seq Scan on seats (cost=0.00..21.39 rows=1339 width=0)(2 rows)

The plan consists of two nodes: the top node Aggregate, in which the calculation of the function count , receives data from the lower node Seq Scan, which performs the table scan.

The preparatory work of the Aggregate node is the actual aggregation, which is impossible without getting all rows from the downstream node. The evaluation is calculated on the basis of the evaluation of the conditional operation cpu_operator_cost. over each input line (default value is 0.0025):

SELECTreltuples, current_setting('cpu_operator_cost') AS cpu_operator_cost, round((reltuples * current_setting('cpu_operator_cost')::real)::numeric, 2) AS cpu_costFROM pg_class WHERE relname='seats';
reltuples | cpu_operator_cost | cpu_cost−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−1339 | 0.0025 | 3.35(1 row)

The resulting score is added to the total cost of the underlying Seq Scan node.

The total cost of the Aggregate node is increased by the cost of processing one line of result cpu_tuple_cost :

WITH t(cpu_cost) AS (SELECT round((reltuples * current_setting('cpu_operator_cost')::real)::numeric, 2)FROM pg_class WHERE relname = 'seats')SELECT 21.39 + t.cpu_cost AS startup_cost, round((21.39 + t.cpu_cost +1* current_setting('cpu_tuple_cost')::real)::numeric, 2) AS total_costFROM t; 
startup_cost | total_cost−−−−−−−−−−−−−−+−−−−−−−−−−−−24.74| 24.75(1 row)

Parallel execution plans

As of version 9.6, PostgreSQL supports parallel execution of queries.The idea is that the master process executing the query spawns (via postmaster) multiple worker processes that simultaneously execute the same parallel plan part. The results of this execution are sent to the master process, which collects them in the Gather node. When not receiving data, the master process can also execute the parallel part of the plan.

If necessary, you can disable the master process from executing the parallel part of the plan with the parallel_leader_participation which appeared in version 11.

Queries in PostgreSQL: 3.Sequential scanning

Of course, running processes and forwarding data requires some resources, so it does not make sense to run every request in parallel.

In addition, even with parallel execution, not all steps of the query plan can be parallelized. Some of the operations can be performed by the master process alone, sequentially.

Parallel sequential scanning

An example of a node designed for parallel execution is Parallel Seq Scan.

The name sounds contradictory (after all, parallel or sequential?), but nevertheless it reflects the essence of the operation. In terms of file accesses, the table pages are read sequentially, in the same order as they would be read in a normal sequential scan. However, the reading is performed by several processes running in parallel. The processes are synchronized with each other using a dedicated area of shared memory so that the same page is not read twice.

The subtle point is that instead of an overall picture of a sequential scan, the operating system sees several processes performing random reads. Because of this, data prefetching, which normally speeds up sequential reads, works poorly. Therefore, starting with PostgreSQL 14, each process is allocated a set of consecutive pages to read instead of one.

The parallel scan operation itself does not make much sense, because the overhead of sending data from process to process is added to the usual cost of reading pages. But if the workflows do some processing of the read lines (e.g. aggregation), then the total query execution time can be considerably less.

Example of parallel plan with aggregation

A plan to execute a simple query with aggregation over a large table will use parallelism :

EXPLAIN SELECT count(*) FROM bookings;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Finalize Aggregate (cost=25442.58..25442.59 rows=1 width=8)−> Gather (cost=25442.36..25442.57 rows=2width=8)Workers Planned: 2−> Partial Aggregate(cost=24442.36..24442.37 rows=1 width=8)−> Parallel Seq Scan on bookings(cost=0.00..22243.29 rows=879629 width=0)(7 rows)

The nodes below the Gather node make up the parallel part of the plan. It is executed in each of the worker processes (of which there are 2 planned in this case) and possibly in the master process (if this is not disabled by the parallel_leader_participation ). The Gather node itself and the nodes above it form a sequential part of the plan and are executed only in the master process.

The Parallel Seq Scan node represents the table scan in parallel mode. The rows field shows an estimate of the number of rows that the one process. There are a total of two workflows, and some of the work will be done by the master, so the total number of rows in the table is divided by 2.4 (not by 3, because the share of the master process decreases as the number of workflows increases).

SELECT reltuples::numeric, round(reltuples / 2.4) AS per_processFROM pg_class WHERE relname = 'bookings';
reltuples | per_process−−−−−−−−−−−+−−−−−−−−−−−−−2111110 | 879629(1 row)

The cost of a Parallel Seq Scan node is estimated to be almost the same as for sequential scanning. The gain comes from the fact that each process processes fewer rows, but the input/output component is fully accounted for, since the table still has to be read in its entirety page by page:

SELECT round((relpages * current_setting('seq_page_cost')::real +reltuples / 2.4 * current_setting('cpu_tuple_cost')::real)::numeric, 2)FROM pg_class WHERE relname = 'bookings';
round−−−−−−−−−−22243.29(1 row)

The next node, Partial Aggregate, performs the aggregation of the data received by the workflow, that is, in this case it counts the number of lines.

The aggregation cost estimate is done in a way that is already known and is added to the table scan estimate :

WITH t(startup_cost) AS (SELECT 22243.29 + round((reltuples / 2.4 * current_setting('cpu_operator_cost')::real)::numeric, 2)FROM pg_class WHERE relname='bookings')SELECT startup_cost, startup_cost + round((1 * current_setting('cpu_tuple_cost')::real)::numeric, 2) AS total_costFROM t;
startup_cost | total_cost−−−−−−−−−−−−−−+−−−−−−−−−−−−24442.36| 24442.37(1 row)

The next node, Gather, is executed by the master process. This node is responsible for starting up worker processes and getting data from them.

The estimation of the cost of starting processes (regardless of their number) is determined for the scheduler by the value of parallel_setup_cost (1000 by default), and the cost of sending each line of data between processes is parallel_tuple_cost (default is 0.1). In this case, the initial cost (running processes) prevails, and this value is added to the initial cost of the Partial Aggregate node. The full cost takes into account the forwarding of two lines, and this value is added to the full cost of the Partial Aggregate node:

SELECT24442.36 + round(current_setting('parallel_setup_cost')::numeric, 2) AS setup_cost, 24442.37 + round(current_setting('parallel_setup_cost')::numeric +2 * current_setting('parallel_tuple_cost')::numeric, 2) AS total_cost;
setup_cost | total_cost−−−−−−−−−−−−+−−−−−−−−−−−−25442.36 | 25442.57(1 row)

The last node, Finalize Aggregate, aggregates partial aggregates received by the Gather node from parallel processes. It is evaluated in the same way as a regular aggregate. The initial cost takes into account the aggregation of three rows; this value is added to the full cost of the Gather node (since all rows are needed to calculate the result). To the full cost is added the cost of issuing one line of result.

WITH t(startup_cost) AS (SELECT 25442.57 + round(3* current_setting('cpu_operator_cost')::real)::numeric, 2)FROM pg_class WHERE relname = 'bookings')SELECT startup_cost, startup_cost + round((1 * current_setting('cpu_tuple_cost')::real)::numeric, 2) AS total_costFROM t;
startup_cost | total_cost−−−−−−−−−−−−−−+−−−−−−−−−−−−25442.58| 25442.59(1 row)

Limitations of parallel execution

Number of work processes

In general, the mechanism of background worker processes is used not only for parallel execution of queries. For example, workflows are involved in logical replication and can be used by extensions. The total number of simultaneously executed workflows is limited by the parameter max_worker_processes (default is 8).

The number of simultaneously running background workflows engaged in parallel plans is limited by the parameter value max_parallel_workers (also 8 by default).

The number of workflows serving a single master process is limited by the value of max_parallell_workers_per_gather (default is 2).

The values of these parameters should be changed depending on :

  • hardware capabilities – the system must have free cores unoccupied by other tasks;

  • the presence of large tables in the database;

  • loads – requests that potentially benefit from parallel execution should be executed.

In most cases, OLAP- rather than OLTP-systems meet such criteria.

The scheduler will not consider parallel scanning at all if it estimates that less than min_parallel_table_scan_size data (8MB by default).

Usually the number of processes is calculated by the formula

Queries in PostgreSQL: 3.Sequential scanning

It means that each tripling of the table will add another parallel process. For example, with the default parameter values of :

table,
MB

amount
processes

8

1

24

2

72

3

216

4

648

5

1944

6

The number of processes can also be specified explicitly in the storage parameter parallel_workers tables.

In any case, the number of processes will not exceed the value of max_parallell_workers_per_gather

If you request information from a small table of 19 Mbytes, one workflow (Workers Planned and Workers Launched) will be scheduled and run:

EXPLAIN (analyze, costs off, timing off)SELECT count(*) FROM flights;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Finalize Aggregate (actual rows=1 loops=1)−> Gather (actual rows=2 loops=1)Workers Planned: 1Workers Launched: 1−> Partial Aggregate (actual rows=1 loops=2)−> Parallel Seq Scan on flights (actual rows=107434 lo...(6 rows)

When requesting data from a table of 105 Mbytes, only two processes will be scheduled because of the limit max_parallell_workers_per_gather :

EXPLAIN (analyze, costs off, timing off)SELECT count(*) FROM bookings;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Finalize Aggregate (actual rows=1 loops=1)−> Gather (actual rows=3 loops=1)Workers Planned: 2Workers Launched: 2−> Partial Aggregate (actual rows=1 loops=3)−> Parallel Seq Scan on bookings (actual rows=703703 l...(6 rows)

Removing the constraint, we obtain the calculated three processes :

ALTER SYSTEM SET max_parallel_workers_per_gather = 4;SELECT pg_reload_conf();EXPLAIN (analyze, costs off, timing off)SELECT count(*) FROM bookings;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Finalize Aggregate (actual rows=1 loops=1)−> Gather (actual rows=4 loops=1)Workers Planned: 3Workers Launched: 3−> Partial Aggregate (actual rows=1 loops=4)−> Parallel Seq Scan on bookings (actual rows=527778 l...(6 rows)

If the number of available slots is less than the planned number of processes, only the available number of worker processes will be started.

Not paralleled requests

Not every query can be executed in parallel mode Cannot be paralleled :

  • Requests that modify or block data ( UPDATE , DELETE , SELECT FOR UPDATE etc.).

    Starting with PostgreSQL 11, this restriction does not apply to queries that use the CREATE TABLE AS , SELECT INTO and CREATE MATERIALIZED VIEW , and in version 14, the following command was added to this list REFRESH MATERIALIZED VIEW

    But the insertion of strings in all these cases is done sequentially.

  • Queries that can be paused.This refers to queries in cursors, including FOR PL/pgSQL loops.

  • Requests containing calls unsafe functions that are marked as PARALLEL UNSAFE. This includes all user functions and a small part of the standard ones by default. The list of unsafe functions can be obtained from the system directory :

    SELECT * FROM pg_proc WHERE proparallel = 'u';

  • Queries that are in functions that are called from a parallelized query (to avoid recursive sprawl of the number of workflows).

Some of these restrictions may be removed in future versions of PostgreSQL. For example, version 12 introduced the ability to parallelize queries at the Serializable isolation level.

A query may not be executed in parallel mode for several reasons :

  • this query cannot, in principle, be parallelized under any circumstances;

  • parallel plan is forbidden by configuration parameter values (also because of a limit on table size);

  • a parallel plan has a higher cost than a serial plan.

To check if a query can be parallelized in principle, you can temporarily enable the parameter force_parallel_mode This will cause the scheduler to build parallel plans whenever possible :

EXPLAIN SELECT * FROM flights;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights (cost=0.00..4772.67 rows=214867 width=63)(1 row)
SET force_parallel_mode = on;EXPLAIN SELECT * FROM flights;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Gather (cost=1000.00..27259.37 rows=214867 width=63)Workers Planned: 1Single Copy: true−> Seq Scan on flights (cost=0.00..4772.67 rows=214867 width=63)(4 rows)

Limitedly parallelizable queries

In general, the more of the plan can be executed in parallel, the greater the possible effect. However, there are a number of operations that do not generally prevent parallelization, but the can only be performed sequentially in the leading process. In other words, they cannot appear in the plan tree below the Gather node.

Undisclosed subqueries. The most obvious example of such an operation involving undisclosed subqueries is reading the result of a generic table expression (the CTE Scan plan node):

EXPLAIN (costs off)WITH t AS MATERIALIZED (SELECT * FROM flights)SELECT count(*) FROM t;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−AggregateCTE t−> Seq Scan on flights−> CTE Scan on t(4 rows)

If a generic table expression does not materialize (which became possible with PostgreSQL 12), then the plan does not contain a CTE Scan node and this restriction does not apply.

That said, the generic table expression itself may well be computed in parallel mode if it is advantageous :

EXPLAIN (costs off)WITH t AS MATERIALIZED (SELECT count(*) FROM flights)SELECT * FROM t;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−CTE Scan on tCTE t−> Finalize Aggregate−> GatherWorkers Planned: 1−> Partial Aggregate−> Parallel Seq Scan on flights(7 rows)

The second operation is the use of an undisclosed subquery, represented in the plan by the SubPlan node:

EXPLAIN (costs off)SELECT *FROM flights fWHERE f.scheduled_departure > ( -- SubPlan..SELECT min(f2.scheduled_departure)FROM flights f2WHERE f2.aircraft_code = f.aircraft_code);
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights fFilter: (scheduled_departure > (SubPlan 1))SubPlan 1−> Aggregate−> Seq Scan on flights f2Filter: (aircraft_code = f.aircraft_code)(6 rows)

The first two lines show the plan of the main query : the table is scanned sequentially flights and each row is checked against a filter. The filter condition includes a subquery, the plan of which is given starting from the third row. That is, the SubPlan node is executed several times, in this case for each line of the sequential scan.

The top Seq Scan node in this plan cannot participate in the parallel execution because it uses the results of the SubPlan node.

Finally, the third operation is the computation of the undisclosed subquery represented in the plan by the InitPlan node:

EXPLAIN (costs off)SELECT *FROM flights fWHERE f.scheduled_departure > ( -- SubPlanSELECT min(f2.scheduled_departure) FROM flights f2WHERE EXISTS ( -- InitPlanSELECT *FROM ticket_flights tfWHERE tf.flight_id = f.flight_id));
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Seq Scan on flights fFilter: (scheduled_departure > (SubPlan 2))SubPlan 2−> Finalize AggregateInitPlan 1 (returns $1)−> Seq Scan on ticket_flights tfFilter: (flight_id = f.flight_id)−> GatherWorkers Planned: 1Params Evaluated: $1−> Partial Aggregate−> ResultOne−Time Filter: $1−> Parallel Seq Scan on flights f2(14 rows)

Unlike SubPlan, the InitPlan node is computed only once (in this example, once for each execution of the SubPlan 2 node).

The parent node of InitPlan cannot participate in parallel execution (but nodes that use the result of InitPlan evaluation can, as in this example).

Temporary tables. Temporary tables are scanned only sequentially, as they are only available to the master process :

CREATE TEMPORARY TABLE flights_tmp AS SELECT * FROM flights;EXPLAIN (costs off)SELECT count(*) FROM flights_tmp;
QUERY PLAN−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Aggregate−> Seq Scan on flights_tmp(2 rows)

Limitedly parallelizable functions. Calls of functions marked as PARALLEL RESTRICTED can be executed only in the sequential part of the plan. A list of such functions can be obtained from the system directory by querying

SELECT * FROM pg_proc WHERE proparallel = 'r';

Mark your own functions as PARALLEL RESTRICTED (much less as PARALLEL SAFE ) should be done with great care, carefully examining the available restrictions

Continued

You may also like