Home High Performance Text index by quotes in memory on Go

Text index by quotes in memory on Go

by admin

Recently needed to implement a search on the beginning of a string, essentially WHERE name LIKE 'beginning%' This was a search by stock symbol name (AAPL, AMZN, EUR/USD, etc.).I wanted the search to work quickly, and not to load the database unnecessarily. In the end I came to the realization of tree search in memory, and I’ll tell you about it.

Text index by quotes in memory on Go

In my case, the search is done on about 55000 short strings (stock characters). That is, the index on these data without any problems fully fit into RAM. You just need to form it carefully, so you can quickly find the data on request.

Below is a description of the implementation of the search. I should note right away that this is not a balanced BTree, but a prefix tree (also bor, ray, loaded tree, trie), where at each level is another string character. Implementation of this variant of the tree is very simple and efficient enough for many tasks, when you want to index not long strings.

In this article I have tried to describe the search algorithm without sample code. I’ll give you the link to the code at the end of the article.

Tree structure

In each node of the tree we store the children elements as a sorted hash table and the value of the node itself as a list.

Text index by quotes in memory on Go

The value of a node is stored exactly as a list, in order to make it easier to handle cases where the values may contain different data with the same key. In our case these are, for example, symbols in different financial markets :

  • AAA (BetaShares Australian High Interest Cash ETF, ASX),

  • AAA (All Active Asset Capital LTD, LSE).

Then we will write one entry in Index.Data, inside of which in Index.Data there will be a list of two AAA characters.

In SearchIndex.Data you can store data of any structure. Thus, you can index any data by a string key.

Build a tree for searching

The tree needs to be built once, then you can search by this ready-made index.

Before the tree is built, the keys are pre-processed as follows.

  1. All delimiters are replaced by spaces.

  2. Double spaces are replaced by single spaces.

  3. Crops whitespace around the edges of lines.

  4. Lines are converted to lower case.

  5. Similar characters are reduced to the same format, e.g., ã → a.

  6. Excludes stop words (optional).

The data is then sorted, by default, in ascending key order, in alphabetical order.

The sorted data is then grouped by key to put in Index.Data the finished list.

Everything is now ready to add the data to the index. Our task now is to form Index.Children so that on each level of the tree lies another piece of the key (in our case a piece of the key is a character of the string being indexed). For example, if we add the AAPLsymbol to the index, we form this tree structure (red line)

Text index by quotes in memory on Go

Here on the red line are the elements [A], [A], [P], [L] at each vertex of the tree. Square brackets denote the keys that are used at each vertex for indexing. Without brackets there are full keys, which are obtained by going from the root to the vertex.

Since we form a tree, it is convenient to add to the index recursively. We add intermediate nodes if there are none. And in the tree sheet we add the value we are indexing.

In the same way, we sequentially add all values to the index.

Tree search

When the tree is already built, searching through it is a fairly easy task. To do this, let’s follow three steps.

  1. Let us pre-process the string we are looking for using the same algorithm as we did for the keys before indexing (see above).

  2. Find the element in the tree where the key matches the string you are looking for.

  3. Then by consecutive search of child nodes we get the result we are looking for.

For example, if we enter AA in the search, then for the tree described above we expect to get the result of such strings in this sequence :

  • AA

  • AAA

  • AAAL

  • AAALF

  • AAAP

  • AAB

  • AAP

  • AAPJ

  • AAPL

  • AAPT

To get this result, in the first step of the search we move to the vertex AA. In the second step we sequentially search all child vertices from left to right from top to bottom.

Finding the right exchange symbol

The above algorithm is good for finding a similar WHERE name LIKE 'beginning%' This was not enough to make it easy to find the right symbol in the financial markets, and the following points had to be taken into account.

  • If you type "EUR" in search, then the output should include "EUR", "EUR/USD", "USD/EUR". That is, the search must work not only from the beginning of the string, but from the beginning of each word in the string.

  • The search should work not only by the name of the symbol, but also by the name of the company. For example, if you enter "APL" in the search, you should get "APL", "AAPL" (Apple) in the results.

  • The first characters to appear in searches are popular characters.

To get not only "EUR", "EUR/USD", but also "USD/EUR" for "EUR", I decided to index several instances of values with different keys: substrings starting from each word of the indexed string. For example, when indexing the string "USD/EUR", the index includes the following keys: "usd eur", "eur". When indexing the string "Grupo Financiero Galicia SA", the index includes the keys "Grupo Financiero Galicia SA", "Financiero Galicia SA", "Galicia SA", "SA".

Also, in order to take into account the nuances described above, it was necessary to perform a 4-step search.

  1. Search for characters by exact match with the string you are looking for.

  2. Find and add to the results the most popular characters, which have the same beginning of the string as the one you entered for search.

  3. Search for characters by company name, adding the results to the ones obtained above.

  4. Find characters which have only the beginning matching the search string, and add those results to the final list.

To get the desired result, I was able to use the index described in the previous sections. For this purpose, I created three indexes.

  1. Index SearchSymbolIndex on symbols from all financial markets.

  2. Index SearchPopularIndex only by popular characters (10% of all characters).

  3. Index SearchInstrumentIndex by company name.

The following is just a sequential search for each index, with a slight difference in the criteria.

var searchedData []searchindex.SearchDatasearchedData = r.SearchSymbolIndex.Search(searchindex.SearchParams{Text: key, OutputSize: outputSize, Matching: searchsymbol.Strict, })searchedData = r.SearchPopularIndex.Search(searchindex.SearchParams{Text: key, OutputSize: outputSize, Matching: searchsymbol.Beginning, StartValues: searchedData, })searchedData = r.SearchSymbolIndex.Search(searchindex.SearchParams{Text: key, OutputSize: outputSize, Matching: searchindex.Beginning, StartValues: searchedData, })searchedData = r.SearchInstrumentIndex.Search(searchindex.SearchParams{Text: key, OutputSize: outputSize, Matching: searchindex.Beginning, StartValues: searchedData, })

StartValues - values found in the previous step, they are passed on to the next step, so that duplicates are not added to the output, and unnecessary iterations of the search are not performed if there is already enough data ( OutputSize ).

searchindex.Strict - search for exact matches.

searchindex.Beginning - searching for matches at the beginning of a string.

Total

All in all, we have a compact implementation of a fast line search in RAM. The code currently takes less than 300 lines. And the index implementation is pretty universal, so that it can be used for various similar tasks.

I haven't done big benchmarks, but with my data of 55000 rows it takes about 2 seconds to create 3 indexes, including database fetching and extra actions. And 4 successive iterations of three indexes search takes 100-200 nanoseconds (if you exclude http request processing time and consider only search time), which is more than enough for my task.

Code as ready-made package

This search is now used in the financial api for developers, an example can be seen on this page , where you can try it live.

Text index by quotes in memory on Go

You may also like