There has already been published recently on the hubra description of the integration PostgreSQL and MSSQL.But, the details there is categorically lacking. Therefore, the purposes of this publication are as follows :
- expand and deepen the publicly available information about FDW for MSSQL called tds_fdw : talk about the difference in major versions and describe basic compatibility issues;
- describe the optimization capabilities of queries that use external tables;
- touch on the topic of caching external data in materialized views;
- say a few words about exotic approaches to integrating PostgreSQL and MSSQL.
TDS FDW Installation and Setup
The guys at PostgresPro have had enough said about this process, I won’t repeat myself. I’ll just leave a few links to the official PostgreSQL documentation and examples from tds_fdw:
- Create an extension in PG: offdocumentation ;
- server creation : offline documentation , example from tds_fdw ;
- creating a user mapping (user mapping): offdocumentation , example from tds_fdw ;
- creating a foreign table (foreign table): offdocumentation , example from tds_fdw
One more thing : please don’t do as stated in tds_fdw installation instructions
sudo make USE_PGXS=1 install
Save the kittens pack up the deb and enjoy life:
sudo USE_PGXS=1 checkinstall
Differences between major versions of TDS FDW
At the moment there are two current versions of FDW: stable 1.0.7 and 2.0.0-alpha, which is in fact the master branch and where everything interesting happens. Here is a small list of their differences :
- 2.0.0 finally has pushdown support for conditions in the WHERE block that relate directly to the outside table; however, it does not work well with the query inside the foreign_table declaration;
- support for tds 7.4 appeared (I’ll describe below why this is necessary and important);
- there are some problems with DISTINCT on an external table ( here is an issue on GitHub ), although the truth is not known: whether my hands grow from the wrong place or the bug is pretty tricky and occurs only under certain circumstances.
Until recently, tds_fdw did not work with tds versions above 7.3. But in the course of writing this article, support for version 7.4 had to to find So now, starting with commit 3a803c , tds_fdw supports all current versions of tds.
So why is support for this version so important? Personally for me it’s important because I need to work with MSSQL 2012. To make a long story short: Ruby on Rails uses the following library to connect to MSSQL activerecord-sqlserver-adapter which in turn uses the tiny_tds , which uses FreeTDS , which knows how to communicate with MSSQL. The trouble is, that for RoR 3 and corresponding major versions of libraries, using tds 7.1 version is nailed and you can change it via config only in 4+ version. At the same time version 7.1 works fine with MSSQL 2008, but when communicating with MSSQL 2012 the following errors appear :
DB-Library error: DB #: 20017, DB Msg: Unexpected EOF from the server
ActiveRecord::LostConnection: TinyTds::Error: closed connection: ...
TinyTds::Error: Adaptive Server connection failed
And the like.
I wanted to avoid them by switching to using FDW, because updating RoR is categorically longer and more expensive. But tds_fdw didn’t support the right version and had to do something about it.
As for the bugs, they all appear randomly and grow from the same place; they add some "variety" to the application, making it crash at random places at random times. The only cure for this mess is using the right version of tds. For MSSQL 2012 this is tds 7.4.
Here is the first problem: tds 7.4 support is implemented in FreeTDS since version 0.95. But out of the box in Ubuntu 14.04 and 16.04 come versions 0.91-5 and 0.91-6.1build1 respectively. And you can get a newer version of FreeTDS in two ways :
In the second case, there is a nuance: in this repository, there is a package only for Ubuntu 14.04 (which is trusty). For 16.04 (which is xenial), there is nothing there. But, in general, nothing is fatal and if you correct in 16.04 /etc/apt/sources.list.d/jamiewillis-freetds-trusty.list to something like this :
deb http://ppa.launchpad.net/jamiewillis/freetds/ubuntu trusty main
Then it will be possible to install the package in the latest Ubuntu (and yes, it works without problems).
If you have CentOS, under that you can easy to find FreeTDS up to and including version 0.95. Anything older will have to be built from source.
Temporary solution to compatibility problem
If error number 20017 and its derivatives really bother you and you can not use required tds version, you can handle exception thrown by PostgreSQL and restart block/method/etc that address MSSQL via FDW. In my case for a RoR application, it looked like this :
def retry_mssql_operation(tries = 5)beginyieldrescue ActiveRecord::StatementInvalid => eif e.message =~ /^PG::FdwUnableToCreateExecution/ tries > 0tries -= 1retryelseraiseendendend
Saves the day at first, but for a long-term solution is categorically unsuitable.
A little bit about pushdown and how FDW works "on your fingers"
Before we start with optimization questions I want to say some words about pushdown. Somehow this mechanism is not in demand in Russian-speaking resources (or I am not familiar with its correct translation, and Triceps push down on the block that’s obviously not the one). That is why I would like to tell you about it briefly.
In the simplest case, when we run a query in PG of the form :
SELECT column_nameFROM foreign_table WHERE column_id = 42;
In fact, the following happens in the database :
- From a table (or non-table) associated with a foreign_table on a foreign server, all the content is extracted to postgres;
- then, the retrieved data is filtered based on the conditions from the WHERE.
Not a very efficient scheme, especially if you want to get just one row out of a table with a few million rows. And this is where pushdown comes in. This mechanism allows you to reduce the number of rows that we get from the remote server. This is done by constructing query to external database considering what we want on PG side, i.e. considering what is specified in WHERE, JOIN, ORDER etc. In other words, FDW parses the original query in PotsgreSQL, select from it what can understand the remote data warehouse and build a new query, according to these conditions. An obvious consequence: pushdown is not applicable for all FDWs (e.g. for file_fdw pushdown is almost useless, but for postgres_fdw or tds_fdw – exactly the opposite).
Bottom line : pushdown is cool, it allows you to use external storage mechanisms, it reduces the amount of data circulating between PG and external storage, thereby speeding up queries, but it is a separate mechanism, so it needs to be implemented, maintained and it’s quite a non-trivial task.
Acceleration of queries
We are done with installation, configuration, and matey. Now, let’s start describing how we can quickly extract data from MSSQL.
This approach is useful for simple queries that are not burdened with JOINs and other SQL tricks. The latest version of tds_fdw (currently 2.0.0-alpha) has support for simple pushdown for WHERE.
Let’s look at the table for an example simple_table from MSSQL database. This table has two fields: id and data. The definition of the external table for it will be as follows :
CREATE FOREIGN TABLE mssql_table (id integer, custom_data varchar OPTIONS (column_name 'data'))SERVER mssql_svrOPTIONS (schema_name 'dbo', table_name 'simple_table', row_estimate_method 'showplan_all', match_column_names '1');
In this case, the first column has the same name in PostgreSQL and in MSSQL: id. The second column has different names in PG and in MSSQL, so the option column_name This parameter explicitly sets the mapping of columns from PostgreSQL to columns in MSSQL. Also, at the end there is the parameter match_column_name , which is responsible for implicit naming mapping of column names, that is, thanks to it the id column is mapped.
That’s it, now if you run the query :
SELECT custom_data FROM mssql_table WHERE id = 42;
FDW must handle the condition specified in WHERE and assemble the correct query in MSSQL. For example this :
SELECT data FROM simple_table WHERE id = 42;
In the case of tds_fdw version 1.0.7 and below the query in MSSQL will be different :
SELECT id, data FROM simple_table;
Once again : pushdown, at the moment, only works for WHERE;for JOIN, ORDER and other functions like MAX, LOWER, etc.it won’t take off.
One more thing: how do I know which query was actually executed on the MSSQL side? When I use FDW for example for MySQL, I get this line in the explain :
Remote query: SELECT `id`, `customer_id`, `order_date` FROM `data`.`orders`
And that’s handy.In tds_fdwthere is no such thing yet and you have to go a longer way through FreeTDS logs. By default, FreeTDS logs are disabled, but this is easily fixed by digging into /etc/freetds/freetds.conf There you can find lines like this :
; dump file = /tmp/freetds.log; debug flags = 0xffff
Which have to have a semicolon removed at the beginning.
Now for any query in MSSQL from PG, FreeTDS will log everything it can. This will slow down execution of all external queries and may create a lot of logs (in my case regular SELECT made ~300Mb log, and JOIN barely managed ~1.5Gb). But the logs will show what actually happened in MSSQL. Moreover, you can reduce logging volume by playing with `debug flags`. More about logging in FreeTDS write here , and the details about `debug flags` lie here
Materialized representation (hereinafter MV) is a regular view + table with data. This approach will help in case of complex queries with joins of external and internal tables, with functions, preference and courtesans.
The upside of MV is that it is a native PG object, which means that it works perfectly with the rest of PostgreSQL and can be indexed and parsed regardless of the data source that populated it. There are also disadvantages: MV needs to be updated. It can be updated by internal triggers, by external events, it can be completely recreated and so on. But, in any case, MV generates lag between PG and data source.
For the above external table MV can be created as follows :
CREATE MATERIALIZED VIEW materialized_mssql_table ASSELECT id, custom_dataFROM mssql_table;
Now all the data from MSSQL is in PostgreSQL, which means you can index it however you want ( B-tree , GIN and GiST etc.), it becomes available to them to statistics , you can see details about the query execution plan and many more nice things from PG.
You can update the MV with the standard INCERT/UPDATE/DELETE commands, or just recreate all the content with
REFRESH MATERIALIZED VIEW CONCURRENTLY materialized_mssql_table;
The CONCURRENTLY option allows you to update the MV without blocking competing read requests, but requires more time and resources. Also, the target MV must meet some requirements to be able to use CONCURRENTLY. These can be found at the corresponding documentation page
To be honest, the devil knows whether this approach can take off, maybe the esteemed public will tell something interesting on this subject. In any case, I think it should be said about it, because on profile resources most questions about integrating two databases are answered with "use FDW" and no variety is expected, even if you wish the strange.
So, when you might need this : in case all the above options did not work due to various constraints. For example :
- obscenely large external database and impossibility to clone it in PG;
- strict performance requirements and the availability of an optimal query to an external database;
- If you want to use parameterized query, which is analogous to query option for FDW, but with dynamic parameter, for example you want to use full-text search on MSSQL side via CONTAINS ;
- Anything else out of the ordinary.
What to use : dbi-link or dblink-tds These are analogues of dblink but with support for multiple DBMS: PostgreSQL, MySQL, MSSQL Server and Oracle in case of dbi-link and just TDS in case of dblink-tds.
How it works: as a kind of highly specialized analogue of FDW in the form of a PG function which collects internally the necessary query based on the arguments passed, executes it in the external database through the above mentioned tools, gets data, processes it and returns it to the PG as pipeline-function That is, hypothetically, you can execute exactly the query that you want and present its result in a form digestible for further processing in PG.
All of the above are pure theoretical considerations. If you have actual experience using these or similar tools, please share your knowledge with the world.
At the moment, there is only one simple and working solution for docking PostgreSQL and MSSQL. It’s tds_fdw It has a lot of flaws, but the project is developing, bugs are being fixed, features are being rolled out and that’s great. So tds_fdw may solve most of the problems related to fetching data from MSSQL via PG. For those who want faster, more optimized and with courtesans, PostgreSQL and its rich arsenal of optimization tools will help. And those who want very strange and want to do everything inside the database with a minimum of external services will have a hard time. The toolkit is ancient, there is no documentation, no support,
is populated by robots and nothing but reading source code will help you.