VTB had recently changed some hardware and software components of its document management system.The changes were too significant to continue without a full-scale stress test: any problem with the document management system is fraught with huge losses.
Intertrust specialists tested VTB’s CDS on Huawei equipment – a complex consisting of a server farm, data network and solid-state storage. For the tests, we created an environment that reproduced real-world scenarios with the highest possible load. See the results and conclusions below.
Why do you need a document management system in a bank and why test it
VTB’s EDMS is a complex software system supporting all the key management processes. The system provides common documentation services, electronic interaction and analytical functions. Properly organised document flow speeds up management decisions, ensures transparency and control over decision making, improves management quality and enhances the bank’s competitive edge.
An EDMS must ensure the clear execution of decisions in accordance with the adopted regulations. This requires high performance, fault tolerance, and flexible scaling. The system has high requirements in terms of access control, the volume of simultaneously processed documents and the number of users. There are 65, 000 users in VTB’s CDS and this number continues to grow.
The system is constantly evolving: the architecture changes, obsolete technologies are replaced by modern ones. And recently, some of the system components have been replaced with import-independent, proprietary software. Will the new architecture of the CDS based on CompanyMedia software and Huawei hardware cope with the increased load? We could only unambiguously answer this question without waiting for such a situation to occur in reality by means of load testing.
In addition to testing the new software product for stress tolerance, we had the following tasks :
- Determine the exact parameters of horizontal and vertical sizing of servers under the bank’s load profile;
- check components for fault tolerance under high load conditions;
- identify the entropy coefficient of the inter-cluster interaction at horizontal scaling;
- Try scaling read requests when using the platform balancer;
- determine the horizontal scaling factor of all nodes and components of the system;
- Determine the maximum possible hardware parameters of servers of different functionality (vertical scaling);
- investigate the application load profile of the technical infrastructure, and approximate the results for information system development planning;
- Investigate the impact of consolidating application data on a single storage system on resource optimization, reliability, and performance improvements.
Methodology and equipment
Load testing of electronic document management systems is often conducted in simplified scenarios. They simulate quickly finding and opening document cards that are not linked to other files and have no lifecycle history. In this case, as a rule, no one takes into account access rights and other resource-consuming factors that are characteristic of real-world environments.
Often such disconnected from reality tests are performed by solution vendors. It is understandable: it is important for the vendor to demonstrate high system performance and speed to the potential customer. No wonder that simplified test models show record-breaking system response times, even if the number of users and documents increases significantly.
We had to reproduce real operating conditions. That is why at first we collected statistics for a whole month: recorded user activity, observed the background work of all the services. The monitoring systems integrated into the CDS were a great help in this matter. The resulting data on the document flows were adjusted by the bank employees, at that we made adjustments to the predicted growth of the flows.
The result was the methodology of testing, which was able to simulate the processes running in the system, taking into account all the really existing loads. We simulated the most common business operations and the most time-consuming queries, both individually and in various combinations. During the performance test, all components were exposed to load. We performed operations to calculate user access rights to system objects, open documents with a complex branched hierarchy and a large number of links, search through the system, and so on.
Load testing profile :
- registered users : 65 thousand with an increase to 150 thousand;
- frequency of user logins (authentications): 50, 000 per hour;
- users working simultaneously in the system : 10 thous;
- of documents to be registered : 10 million per year;
- file attachment volume : 1TB per year;
- document approval processes : 1.5 million per year;
- Consent participant visas : 7.5 million per year;
- Resolutions and assignments : 15 million per year;
- of reports on resolutions and assignments : 15 million per year;
- custom tasks : 18 million per year.
These applications were consolidated on a single Huawei OceanStor Dorado 6000 V3 storage system with 117 SSD drives of 3.6 TB each, the total useful volume being over 300 TB. Computing powers were placed on the Huawei E9000 modular server system, and data transmission was carried out via the network based on the Huawei CE series data center level switches. During the test, we monitored all the system performance indicators around the clock. All results, including historical data, were recorded in the form of graphs and tables for further analysis.
Hardware infrastructure load testing servers
Due to the high performance of the Huawei OceanStor Dorado 6000 V3 storage system, latency during any user requests rarely exceeded 1 ms. Such speed of the application disk subsystem operation inspired us to carry out additional research. We decided to analyze historical data to determine the impact of different types of workload on the technical infrastructure. The results obtained allow us to flexibly and accurately plan system development according to hardware platform requirements.
In terms of scaling, we checked :
- vertical scaling limit application server (CMJ) , resources by criticality : number of cores and frequency, amount of RAM;
- horizontal scaling support application server (CMJ) by duplicating functionally identical services and balancing between them;
- Vertical and horizontal scaling limits client server (Web-GUI) ;
- vertical zoom limits file storage (FS) , resources by degree of criticality : network bandwidth, disk speed;
- horizontal scaling support file storage (FS) due to the distributed file system – CEPH, GLUSTERFS;
- vertical zoom limits databases (PostgreSQL) resources by degree of criticality: amount of RAM, disk speed, number of cores and frequency;
- horizontal scaling support databases (PostgreSQL) : read load scaling by slave servers, write load scaling by division into functional modules;
- Vertical and horizontal scaling limits message broker (Apache Artemis) ;
- Vertical and horizontal scaling limits search servers (Apache Solr)
Problems and solutions
One of the main tasks was to identify possible problems with the functionality of the CDS. During the work the following bottlenecks were identified and ways to eliminate them were found.
Log synchronous record locks. The logging operations in the standard WildFly configuration are synchronous and severely affect performance. It was decided to switch to asynchronous logger, and at the same time to write to ELK log aggregation system instead of the disk.
Initialization of unnecessary sessions when working with data storage. Every thread that received data from the repository initialized the SSO authentication session twice. This operation is resource-intensive and greatly increases the execution time of the user request, as well as reduces the overall server bandwidth.
Locks when working with application cache objects. The application used rather heavy reentrantLock (Java 7) locks, which had a negative impact on query execution speed. The lock type was changed to stampedLock, which significantly reduced the time to work with the cache objects.
After that, we again ran load testing to determine the average time of typical operations in the CDS system with relational storage on the bank profile. We obtained these results :
- user authorization in the system – 400ms;
- progress view of execution – 2.5 s on average;
- creation of a registration and control card – 1.4 s;
- registration of the registration and control card – 600 ms;
- resolution creation – 1s.
In addition to identifying problems, the load testing also confirmed some of our assumptions.
- The system performs much better on the Linux family of operating systems.
- The stated resiliency principles work at the level of each component in "hot" mode.
- The key component, the business logic service (which accepts user requests), has "mirrored" horizontal scaling and almost linear bandwidth scaling as the number of instances grows.
- Optimal sizing of business logic service for 1200 concurrent users – 8 vCPU for the service and 1.5 vCPU for the DBMS.
- Consolidation of application data on a single storage system significantly increases performance and reliability, and improves scalability.
We’ll be happy to answer your questions in the comments – maybe there are some aspects of the testing you’d like to know more about.