Home PostgreSQL Indexes in PostgreSQL – 8

Indexes in PostgreSQL – 8

by admin

We have already covered PostgreSQL indexing mechanism , access methods interface and all basic access methods, such as : hash indexes , B-trees , GiST , SP-GiST and GIN And in this part, let’s look atthe transformation of gin into rum.

RUM

Even though the authors claim thatgin is a powerful spirit, the drink theme still won out : the next generation GIN was named RUM.
This access method expands on the idea behind GIN and allows you to perform full-text searches even faster. It is the only method in this series that does not come standard with PostgreSQL and is a third-party extension. There are several options for installing it :

  • Get the yum or apt package from PGDG repository For example, if you put PostgreSQL from the postgresql-10 package, put also postgresql-10-rum.
  • Build and install yourself from source codes on github (instructions there as well).
  • Use as part of Postgres Pro Enterprise (or at least read from there documentation ).

GIN restrictions

What GIN index restrictions does RUM allow you to overcome?
First, the tsvector data type, in addition to the tokens themselves, contains information about their positions within the document. In the GIN index, as we have seen last time , this information is not stored. Because of this, the operations phrase search, introduced in version 9.6 are served by the GIN inefficiently and are forced to refer to the original data for rechecking.
Second, search engines usually return results in order of relevance (whatever that means). To do this, you can use the functions ranking ts_rank and ts_rank_cd, but you have to compute them for each result line, which of course is slow.
The RUM access method can, in a first approximation, be thought of as a GIN in which positional information has been added, and which maintains the output in the right order (similar to GiST knows how to give out nearest neighbors). Let’s go in order.

Phrase search

A query in full-text search can contain special constructions, which take into account the distance between tokens. For example, you can find documents where grandmother and grandfather are separated by one more word :
postgres=# select to_tsvector('Grandma for grandpa, grandpa for turnip...') @@
to_tsquery('grandma <2> grandpa');
?column?
----------
t
(1 row)

Or specify that the words must stand behind each other :
postgres=# select to_tsvector('Grandma for grandpa, grandpa for turnip...') @@
to_tsquery('dedka <-> dedka');
?column?
----------
t
(1 row)

A normal GIN index can produce documents that have both lexemes, but you can only check the distance between them by looking in the tsvector:
postgres=# select to_tsvector('Grandma for grandpa, grandpa for turnip...');
to_tsvector
------------------------------
'babk':1 'dedk':3, 4 'repk':6
(1 row)

In the RUM index each token is not just referenced to table rows: together with each TID there is also a list of positions in which the token occurs in the document. Here is how you can imagine an index created on a white birch table we are already familiar with (by default, the operator class rum_tsvector_ops is used for tsvector):
postgres=# create extension rum;
CREATE EXTENSION
postgres=# create index on ts using rum(doc_tsv);
CREATE INDEX

Indexes in PostgreSQL - 8
The gray squares in the figure are position information added :
postgres=# select ctid, doc, doc_tsv from ts;
ctid | doc | doc_tsv
--------+-------------------------+--------------------------------
(0, 1) | There was a birch tree in the field | 'birch':3 'gender':2 'standing':4
(0, 2) | There was a curly birch in the field | 'curly':3 'floor':2 'standing':4
(0, 3) | luli, luli, standing | 'lul':1, 2 'standing':3
(0, 4) | Lully, lully, stood | 'lul':1, 2 'standing':3
(1, 1) | No one to break the birch | 'birch':2 'break':3 'not':1
(1, 2) | There is no one to lock up / 'lock up':3 'curly':2 'nec':1
(1, 3) | Lyuli, lyuli, zalomati | 'zalomat':3 'lul':1, 2
(1, 4) | Lulli, lulli, llomati | 'llomat':3 'lul':1, 2
(2, 1) | I'll go for a walk | 'go for a walk':3 'go':2
(2, 2) | white birch I will break | 'white':1 'birch':2 'break':3
(2, 3) | luli, luli, luli, luli | 'lul':3 'lul':1, 2
(2, 4) | luli, luli, zalomayu | 'zaloma':3 'lul':1, 2
(12 rows)

GIN still has delayed insertion when fastupdate is specified; RUM removed this functionality.
To see how the index works on real data, we use the well-known archive pgsql-hackers mailing list.
fts=# alter table mail_messages add column tsv tsvector;
ALTER TABLE
fts=# set default_text_search_config = default;
SET
fts=# update mail_messages
set tsv = to_tsvector(body_plain);

UPDATE 356125

Here’s how a query using a phrase search with a GIN index runs:
fts=# create index tsv_ginon mail_messages using gin(tsv);
CREATE INDEX
fts=# explain (costs off, analyze)
select * from mail_messageswhere tsv @@ to_tsquery('hello <-> hackers');
QUERY PLAN
---------------------------------------------------------------------------------
Bitmap Heap Scan on mail_messages (actual time=2.490..18.088 rows=259 loops=1)
Recheck Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
Rows Removed by Index Recheck: 1517
Heap Blocks: exact=1503
->Bitmap Index Scan on tsv_gin (actual time=2.204..2.204 rows=1776 loops=1)
Index Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
Planning time: 0.266 ms
Execution time: 18.151 ms
(8 rows)

As you can see from the plan, the GIN index is used, but returns 1776 potential matches, of which 259 remain and 1517 are discarded in the recheck step.
Now we remove the GIN-index and build the RUM.
fts=# drop index tsv_gin;
DROP INDEX
fts=# create index tsv_rumon mail_messages using rum(tsv);
CREATE INDEX

The index now has all the information you need and searches accurately :
fts=# explain (costs off, analyze)
select * from mail_messages
where tsv @@ to_tsquery('hello <-> hackers');
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on mail_messages (actual time=2.798..3.015 rows=259 loops=1)
Recheck Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
Heap Blocks: exact=250
-> Bitmap Index Scan on tsv_rum (actual time=2.768..2.768 rows=259 loops=1)
Index Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
Planning time: 0.245 ms
Execution time: 3.053 ms
(7 rows)

Sorting by relevancy

In order to output the documents in the right order, the RUM index supports ordering operators, which we discussed in the part about GiST The rum extension defines this operator <=> which returns a distance between a document (tsvector) and a query (tsquery). For example :
fts=# select to_tsvector('Grandma for grandpa, grandpa for turnip...') <=> to_tsquery('turnip');
?column?
----------
16.4493
(1 row)
fts=# select to_tsvector('Grandma for grandpa, grandpa for turnip...') <=> to_tsquery('dedka');
?column?
----------
13.1595
(1 row)

The document turned out to be more relevant to the first query than the second: the more often a word occurs in the document, the less "valuable" it is.
Again, let’s try to compare GIN and RUM on a relatively large amount of data: let’s select the ten most relevant documents containing "hello" and "hackers".
fts=# explain (costs off, analyze)
select * from mail_messages
where tsv @@ to_tsquery('hello hackers')
order by ts_rank(tsv, to_tsquery('hello hackers'))
limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------
Limit (actual time=27.076..27.078 rows=10loops=1)
-> Sort (actual time=27.075..27.076 rows=10 loops=1)
Sort Key: (ts_rank(tsv, to_tsquery('hello hackers'::text)))
Sort Method: top-N heapsort Memory: 29kB
-> Bitmap Heap Scan on mail_messages (actual ...rows=1776loops=1)
Recheck Cond: (tsv @@ to_tsquery('hello hackers'::text))
Heap Blocks: exact=1503
-> Bitmap Index Scan on tsv_gin (actual ... rows=1776 loops=1)
Index Cond: (tsv @@ to_tsquery('hello hackers'::text))
Planning time: 0.276 ms
Execution time: 27.121 ms
(11 rows)

The GIN index returns 1, 776 matches, which are then separately sorted to select the ten best matches.
With the RUM index the query is done by a simple index scan : no unnecessary documents are scanned, no separate sorting is required:
fts=# explain (costs off, analyze)
select * from mail_messages
where tsv @@ to_tsquery('hello hackers')
order by tsv <=> to_tsquery('hello hackers')
limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------
Limit(actual time=5.083..5.171 rows=10 loops=1)
-> Index Scan using tsv_rum on mail_messages (actual ... rows=10 loops=1)
Index Cond: (tsv @@ to_tsquery('hello hackers'::text))
Order By: (tsv <=> to_tsquery('hello hackers'::text))
Planning time: 0.244 ms
Execution time: 5.207 ms
(6 rows)

Additional information

RUM index, like GIN, can be constructed from multiple fields. But if in GIN lexemes of different columns are stored independently from each other, RUM allows to "link" the main field (tsvector in our case) with an additional one. To do this, use a special class of operators rum_tsvector_addon_ops:
fts=# create index on mail_messages using rum(tsv rum_tsvector_addon_ops, sent)
with (attach='sent', to='tsv');
CREATE INDEX

This index can be used to output the results in sorting order by an additional field :
fts=# select id, sent, sent <=> '2017-01-01 15:00:00'
from mail_messages
where tsv @@ to_tsquery('hello')
order by sent <=> '2017-01-01 15:00:00'
limit 10;
id | sent | ?column?
---------+---------------------+----------
2298548 | 2017-01-01 15:03:22 | 202
2298547 | 2017-01-01 14:53:13 | 407
2298545 | 2017-01-01 13:28:12 | 5508
2298554 | 2017-01-01 18:30:45 | 12645
2298530 | 2016-12-31 20:28:48 | 66672
2298587 | 2017-01-02 12:39:26 | 77966
2298588 | 2017-01-02 12:43:22 | 78202
2298597 | 2017-01-02 13:48:02 | 82082
2298606 | 2017-01-02 15:50:50 | 89450
2298628 | 2017-01-02 18:55:49 | 100549
(10 rows)

Here we are looking for matching strings that are as close as possible to the specified date, whether earlier or later. To get results strictly before (or after) the date, one has to use the operation <=| (or |=> ).
The query, as we expect, is performed by a simple index scan :
ts=# explain (costs off)
select id, sent, sent <=> '2017-01-01 15:00:00'
from mail_messages
where tsv @@ to_tsquery('hello')
order by sent <=> '2017-01-01 15:00:00'
limit 10;
QUERY PLAN
---------------------------------------------------------------------------------
Limit
-> Index Scan using mail_messages_tsv_sent_idx on mail_messages
Index Cond: (tsv @@ to_tsquery('hello'::text))
Order By: (sent <=> '2017-01-01 15:00:00'::timestamp without time zone)
(4 rows)

If we created an index without any additional information about the relationship of the fields, we would have to sort all the results obtained from the index for a similar query.
Of course, you can add other data types to RUM-index besides date – almost all basic types are supported. For example, an online store can quickly display items by newness (date), price (numeric), popularity or discount size (integer or floating point).

Other classes of operators

To complete the picture, it’s worth mentioning the other classes of operators available.
Let’s start with rum_tsvector_hash_ops and rum_tsvector_hash_addon_ops. These are all similar to the rum_tsvector_ops and rum_tsvector_addon_ops already discussed above, but the hash code, not the token itself, is stored in the index. This may reduce the size of the index, but of course makes the search less accurate and requires re-checking. In addition, the index no longer supports partial matches.
A curious class of operators rum_tsquery_ops. It allows you to do the "reverse" problem: find queries that match the document. Why would you need it? For example, to sign the user for new products according to his filter. Or to automatically classify new documents. Here is a simple example :
fts=# create table categories(query tsquery, category text);
CREATE TABLE
fts=# insert into categories values
(to_tsquery('vacuum | autovacuum | freeze'), 'vacuum'),
(to_tsquery('xmin | xmax | snapshot | isolation'), 'mvcc'),
(to_tsquery('wal | (write ahead log) | durability'), 'wal');
INSERT 0 3
fts=# create index on categories using rum(query);
CREATE INDEX
fts=# select array_agg(category)
from categories
where to_tsvector(
'Hello hackers, the attached patch greatly improves performance of tuple
freezing and also reduces size of generated write-ahead logs.'
) @@ query;
array_agg
--------------
{vacuum, wal}
(1 row)

Operator classes remain rum_anyarray_ops and rum_anyarray_addon_ops – they are not designed to work with tsvector, but with arrays. For the GIN this has already been covered last time , so there’s no reason to repeat it.

Pre-recorded index and log size

It is clear that since RUM contains more information than GIN, it will take more space. Last time we compared the sizes of the different indices; let’s add RUM to this table as well:
rum | gin | yeast | btree
--------+--------+--------+--------
457 MB | 179 MB | 125 MB | 546 MB

As you can see, the volume has grown quite a bit – that’s the price you pay for a quick search.
Another non-obvious point worth pointing out is that RUM is an extension, i.e. it can be installed without making any changes to the system kernel. This was made possible in version 9.6 thanks to a patch which made Alexander Korotkov One of the problems that had to be solved in this process was the generation of journaling entries. The logging mechanism has to be absolutely reliable, so you can’t let an extension into this kitchen. Instead of allowing the extension to create its own types of journal entries, the way it is done is this: the extension code reports the intention to change the page, makes any changes to it and signals the end, and already the system core compares the old and new versions of the page and generates the necessary unified journal entries itself.
The current generation algorithm compares pages byte-by-by-byte, finds the changed fragments and logs each such fragment along with the offset from the beginning of the page. This works fine even if only few bytes have changed, or if the whole page has changed. But if you add a fragment to the page and move down the rest of the content (or remove a fragment and move up the content), technically it will change much more bytes than were actually added or deleted.
Because of this, an actively changing RUM-index can generate log entries significantly larger than the GIN (which, being not an extension but part of the kernel, manages the log itself). The extent of this unpleasant effect depends strongly on the actual load, but to get a feel for the problem, let’s try several times to delete and add some lines, interspersed with a cleanup (vacuum). To estimate the size of journal entries can be as follows: at the beginning and at the end, remember the position in the journal with the function pg_current_wal_location (before version 10, pg_current_xlog_location) and then look at the difference.
Here, of course, you have to keep in mind a lot of factors. We need to make sure that only one user is working in the system, otherwise "extra" records will get into the calculation. Even in this case, we take into account not only RUM, but also changes in the table itself and the index which supports the primary key. The values of configuration parameters (here we used the replica log level, without compression) are also affected. But let’s try it anyway.
fts=# select pg_current_wal_location() as start_lsn gset
fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)
select parent_id, sent, subject, author, body_plain, tsv
from mail_messages where id % 100 = 0;
INSERT 0 3576
fts=# delete from mail_messages where id % 100 = 99;
DELETE 3590
fts=# vacuum mail_messages;
VACUUM
fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)
select parent_id, sent, subject, author, body_plain, tsv
from mail_messages where id % 100 = 1;
INSERT 0 3605
fts=# delete from mail_messages where id % 100 = 98;
DELETE 3637
fts=# vacuum mail_messages;
VACUUM
fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)
select parent_id, sent, subject, author, body_plain, tsv from mail_messages
where id % 100 = 2;
INSERT 0 3625
fts=# delete from mail_messages where id % 100 = 97;
DELETE 3668
fts=# vacuum mail_messages;
VACUUM
fts=# select pg_current_wal_location() as end_lsn gset
fts=# select pg_size_pretty(:'end_lsn'::pg_lsn - :'start_lsn'::pg_lsn);
pg_size_pretty
----------------
3114 MB
(1 row)

So, we got about 3 GB. And if you repeat the same experiment with the GIN index, it’s only about 700 MB.
Therefore, I would like to have another algorithm which finds the minimum number of insert and delete operations that can be used to bring one state of a page to another, similar to how the diff utility works. Such an algorithm has already implemented Oleg Ivanov his patch is discussed. In the example above, this patch, at the cost of a small slowdown, reduces the size of the logs by half, to 1900 MB.

Properties

Traditionally, let’s look at the properties of the rum access method (queries have been given before ), noting the differences from gin.
Method Properties :
amname | name | pg_indexam_has_property
--------+---------------+-------------------------
rum | can_order | f
rum | can_unique | f
rum | can_multi_col | t
rum | can_exclude | t -- f for gin

Index Properties :
name | pg_index_has_property
---------------+-----------------------
clusterable | f
index_scan | t -- f for gin
bitmap_scan | t
backward_scan | f

Note that RUM, unlike GIN, supports index scanning – otherwise it would be impossible to get exactly the required number of results in queries with the phrase limit. Accordingly, there is no need for an analogue of the gin_fuzzy_search_limit parameter. Well, and as a consequence, the index can be used to support exclusion constraints.
Column level properties :
name | pg_index_column_has_property
--------------------+------------------------------
asc | f
desc | f
nulls_first | f
nulls_last | f
orderable | f
distance_orderable| t -- f for gin
returnable | f
search_array | f
search_nulls | f

The difference here is that RUM supports ordering operators. Although not for all classes of operators : e.g., for tsquery_ops it will be false.
Continued

You may also like