Home IT infrastructure Let the deluge fall, but 1C must work! Agreeing with the business on DR

Let the deluge fall, but 1C must work! Agreeing with the business on DR

by admin

Imagine: you service the IT infrastructure of a large shopping mall. It starts raining in the city. Torrents of rain tear through the roof, the water fills the retail space up to your ankles. We hope that your server room is not in the basement, otherwise there will be problems.
The story described is not a fantasy, but a collective description of a couple of events in 2020. Large companies have a disaster recovery plan, or disaster recovery plan (DRP), on hand at all times. In corporations, business continuity specialists are responsible for it. But in medium and small companies, it is up to the IT services to handle these tasks. You have to figure out the business logic yourself, understand what and where it can go down, come up with protection, and implement it.
It’s great if an IT professional is able to negotiate with the business and discuss the need for protection. But more than once I have seen a company save money on a disaster recovery (DR) solution because they think it is redundant. When a disaster struck, the long recovery time threatened losses and the business was not prepared. You can say "I told you so" as much as you want, but it will still be up to the IT department to restore services.
Let the deluge fall, but 1C must work! Agreeing with the business on DR
From the point of view of an architect I will tell you how to avoid this situation. In the first part of the article I will show the preparatory work : how to discuss with the customer the three questions for the choice of protection tools :

  • What are we protecting?
  • Protecting against what?
  • How much defense are we doing?

In the second part, we’ll talk about the options for answering the question: what to defend yourself with. I will give examples of cases, how different customers build their defense.

What we protect : figuring out critical business functions

It’s best to start your preparation by discussing your disaster recovery plan with your business customer. This is where the biggest challenge is finding common ground. The customer usually doesn’t care how the IT solution works. He is concerned if the service can perform business functions and bring money. For example: if the site is working, but the payment system is "lying", there are no receipts from the customers, and the IT professionals are still the "guilty" ones.
The IT professional may have difficulty in these negotiations for several reasons :

  • IT is not fully aware of the role of the information system in the business. For example, if there is no available business process description or transparent business model.
  • Not the whole process depends on IT. For example, when contractors do some of the work and IT has no direct influence over them.

I would structure the conversation this way :

  1. Explain to businesses that accidents happen to everyone, and it takes time to recover. Better yet, demonstrate situations how it happens and what consequences are possible.
  2. Showing that not everything depends on IT, but you are willing to help with a plan of action in your area of responsibility.
  3. Ask the business customer to answer : if the apocalypse happens, which process should be restored first? Who is involved and how?
    A simple answer is needed from the business, for example : you need the call center to keep registering requests 24/7.
  4. Ask one or two users of the system to describe this process in detail.
    It is better to enlist the help of an analyst, if your company has one.
    To begin with, the description may look like this: the call center receives requests by phone, by mail and through messages from the site. Then they enter them into 1C via a web interface, and from there they are picked up by production in this way.
  5. Then we look at what hardware and software solutions support the process. For comprehensive protection, consider three levels of :
    • applications and systems within the site (program level),
    • the very site where the systems spin (infrastructure level),
    • network (it is often forgotten at all).
    • We identify possible points of failure: the nodes of the system on which the performance of the service. Separately, we note the nodes supported by other companies: telecom operators, hosting providers, data centers, and so on. With this you can go back to the business customer for the next step.

    What are we protecting against?

    Next, we find out from the business customer, which risks we are protecting ourselves from in the first place. All the risks can be divided into two groups:

    • lost time due to service downtime;
    • Data loss due to physical effects, human error, etc.

    Businesses are scared of losing both data and time, all of which lead to loss of money. So again we ask questions about each group of risks :

    • Can we estimate, for this process, how much in money is lost data and lost time?
    • What data can’t we lose?
    • Where can we not allow downtime?
    • Which events are most likely and most threatening to us?

    After discussion, we’ll understand how to prioritize points of failure.

    How much do we defend : RPO and RTO

    When the critical points of failure are clear, calculate the RTOand RPO.
    Let me remind you that RTO(recovery time objective) – is the allowable time from the crash to the complete restoration of the service. In business parlance, this is the allowable downtime. If we know how much money the process was bringing in, we can calculate the loss from each minute of downtime and calculate the allowable loss.
    RPO (recovery point objective) – is a valid data recovery point. It defines the time in which we can lose data. From a business point of view, data loss can lead to fines, for example. Such losses can also be translated into money.
    Let the deluge fall, but 1C must work! Agreeing with the business on DR
    The recovery time must be calculated for the end user: in what time he will be able to log in. So first we add up the recovery time of all the links in the chain. This is where they often make the mistake of taking the RTOof the provider from the SLA, and forget about the other terms.

    Let’s look at a concrete example. A user logs into 1C, the system opens with a database error. He contacts the system administrator. The database is in the cloud, the sysadmin reports the problem to the service provider. Let’s say it takes 15 minutes for all communications. In the cloud, a database of this size is restored from backup in an hour, hence the RTO on the side of the service provider is an hour. But that’s not the final deadline, for the user, 15 minutes to find the problem was added to it.
    Next, the system administrator needs to check that the database is correct, connect it to 1C and run services. This takes another hour, so the RTO on the administrator side is already 2 hours and 15 minutes. The user needs another 15 minutes: to log in, check that the necessary transactions appear. 2 hours and 30 minutes is the total time to restore the service in this example.

    These calculations will show the business what external factors the recovery time depends on. For example, if an office is flooded, it would first need to locate the leak and repair it. It will take time, which is not dependent on IT.

    Protecting with what: selecting tools for different risks

    After discussing all the points, the customer already understands the price of the accident for the business. Now you can choose the tools and discuss the budget. I will show you on examples of client cases, what tools we offer for different tasks.
    Let’s start with the first group of risks: losses due to service downtime. The solution options for this task should provide a good RTO.

    1. Place the application in the cloud
      For starters, you can just move to the cloud – where the high availability issues have already been thought through by the provider. Virtualization hosts are clustered, power and network are reserved, data is stored on fault-tolerant storage, and the service provider is financially responsible for downtime.
      For example, you can host a virtual machine with a database in the cloud. The application will connect to the database from the outside over the established channel or from the same cloud. If there is a problem with one of the cluster servers, the VM will restart on a neighboring server in less than 2 minutes. After that, the DBMS will come up, and in a few minutes the database will be available.
      RTO : measured in minutes. You can write these times in your agreement with your provider.
      Cost : Consider the cost of cloud resources for your application.
      What it won’t protect you from : from massive failures at the ISP site, e.g., due to city-level crashes.
    2. Cluster the application
      If one wants to improve the RTO, the previous option can be strengthened and put a clustered application in the cloud right away.
      The cluster can be implemented in active-passive or active-active mode. Create several VMs based on vendor requirements. For more reliability, we distribute them across different servers and storage systems. If the server with one of the databases fails, the reserve node takes over the load in a few seconds.
      RTO : measured in seconds.
      Cost : slightly more expensive than a normal cloud, additional resources will be needed for clustering.
      What won’t it protect you from : will still not protect against massive site failures. But the local failures will not last as long.

      From practice : A retailer company had several information systems and websites. All databases were located locally in the company office. No thought was given to any DR until the office was without power several times in a row. Customers were unhappy with the failures on the sites.
      The problem with availability of services was resolved after moving to the cloud. Plus, we were able to optimize the load on the databases by balancing traffic between nodes.

    3. Move to a disaster-resistant cloud
      If you need to ensure that even a natural disaster on the main site does not interfere with operations, you can choose a disaster-resistant cloud In this option, the provider spreads the virtualization cluster already to 2 data centers. There is constant, synchronous, one-to-one replication between the data centers. The channels between data centers are reserved and go on different routes, so such a cluster is not afraid of network problems.
      RTO : tends to 0.
      Cost : the most expensive option in the cloud.
      What it won’t protect you from : It will not help against data corruption and human factor, so it is recommended to make backups at the same time.

      From practice : One of our clients developed a comprehensive disaster recovery plan. This is the strategy he chose :

      • Catastrophic cloud protects the application from failures at the infrastructure level.
      • Two-level backup provides protection in case of human error. There are two types of backups: "cold" and "hot". "Cold" backups are off, taking time to deploy. A "hot" backup is ready to work and recovers faster. It is stored on a dedicated storage system. A third copy is written to tape and stored in another room.
      • Once a week, the client tests protection and verifies that all backups, including those from tape, are working. Annually, the company tests the entire disaster-resistant cloud.

      • Organize replication to another site
        Another option to avoid global problems at the main site: provide geo-redundancy.In other words, create backup virtual machines at a site in another city. Specific DR solutions are suitable for this: we use VMware vCloud Availability (vCAV) in our company. It can be used to set up protection between multiple cloud provider sites or recover to the cloud from an on-premise site. I’ve covered the vCAV scheme in more detail before here
        RPOand RTO : from 5 minutes.
        Cost More expensive than the first option, but less expensive than hardware replication in a disaster-resistant cloud. The price is made up of vCAV license cost, administration fee, cost of cloud resources and resources under the PAYG model (10% of the cost of working resources for disabled VMs).

        From practice Client had 6 virtual machines with different databases in our cloud in Moscow. At first backup protection was provided by our backups: some of them were stored in Moscow cloud, some were on our site in St. Petersburg. Over time, the databases have grown in size and recovery from backups began to require more time.
        VMware vCloud Availability-based replication has been added to backups. Virtual machine replicas are stored at a backup site in St. Petersburg and are updated every 5 minutes. If there is a failure at the primary site, employees independently switch to the virtual machine replica in St. Petersburg and continue to work with it.

      All of the above solutions provide high availability, but do not save from data loss due to a crypto-virus or an accidental mistake of an employee. In this case, we need backups to provide the right RPO.
      5. Don’t forget about backups
      Everyone knows that you have to do backups, even if you have the coolest disaster tolerant solution. So just a quick reminder of a few things.
      Strictly speaking, a backup is not a DR. And here’s why :

      • That’s a long time. If the data is measured in terabytes, it will take more than an hour to recover. You have to restore, assign a network, check to see what turns on, see if the data is okay. So you can only provide good RTO if the data is scarce.
      • Data may not be recovered the first time, and you need to allow time for it to be recovered again. For example, there are cases where we do not know the exact time of data loss. Let’s say a loss is noticed at 3 p.m. and copies are made every hour. From 3 p.m. we look at all the recovery points: 2 p.m., 1 p.m. and so on. If the system is important, we try to minimize the age of the restore point. But if we cannot find the data we need in the fresh backup, we take the next restore point – this is extra time.

      In doing so, the backup schedule can provide the right RPO For backups, it is important to provide geo-redundancy in case of problems with the main site. It is recommended to keep some of the backups separately.
      The final disaster recovery plan should have a minimum of 2 tools :

      • One of the options 1-4, which will protect systems from crashes and falls.
      • Backup to protect data from loss.

      It’s also worth taking care of a backup link in case your primary ISP goes down. And voila! – DR on minimums is ready to go.

      You may also like