Home Distributed Systems A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses

A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses

by admin

Today’s networks and data centers are moving briskly toward a complete and total software-defined scheme where it actually doesn’t matter what hardware you stuff inside, everything will be on software. Cellular operators began with the fact that they did not want to put 20 antennas per house (they have nodes reconfigured, change frequencies and parameters simply by updating configurations), and in data centers first with the virtualization of servers, which is now a masthead, and then continued and virtualization of storage.
But back to Russia 2015.Below I’ll show you how to "use the tools at hand" (x86 machines and any "storages") to save money, increase reliability and solve some more typical tasks for sysadmins of medium and large business.
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
This diagram shows both architectures in question.SDS – two red controllers in the center with any backend, from internal disks to FC shelves to clouds. And the virtual SAN, in the Hyper-converged storage diagram.
Most importantly :

  • You don’t care what the hardware is worth : disks, SSDs, a zoo of manufacturers, old and new models… – it all goes to the orchestration software and it brings it to the virtual architecture you need in the end. Roughly speaking, combines it into one volume, or lets you slice it up however you want.
  • You don’t care what interfaces these systems have. SDS will build from the top.
  • You don’t care which functions your keepers could and couldn’t do (then again, now they can do what they need to do : it’s up to the software on top to decide).

At the same time, let’s look at a couple of typical tasks with specific iron and prices.

Who needs it and why

In fact, SDS software for data storage creates a management server (or cluster) that connects different types of storage: disk shelves, disks and RAM of servers (as a cache), PCI-SSD, SSD shelves, as well as separate "cabinets" of storage systems of different types and models from different vendors and with different disks and connection protocols.
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
From this point all this space becomes shared. But the software understands that the "fast" data should be stored there, and the slow archive data should be stored over there. You as a sysadmin, roughly speaking, stop thinking in terms of "RAID group on the storage" and start thinking in terms of "there is a set of data, it is necessary to put it into profile FAST". Of course, agreeing with the wizard or presetting that this profile FAST is on such disks of such storage.
The same software uses RAM of servers (virtual storage controllers) as a cache. It uses regular x86 RAM up to 1TB in size, it reads and writes from the cache, plus there are useful features like preventive reading, block grouping, multithreading and a really interesting Random Wrire Accelerator (but more about that below).
The most frequent uses are as follows :

  • When an inquiring mind and/or unfortunate experience led to the realization that you can’t do without honest synchronous replication between storage systems.
  • A trivial performance and budget deficit at the same time.
  • A lot of storage has accumulated. A big park, or rather, a zoo: there is no unified standard, the protocol is changing, and in general sometimes it feels like you have to do everything from scratch. You don’t have to, virtualization is enough. The same zoo, but on the host side (a bunch of different OSes and two or even three different hypervisors). And the desire or more often the need to use iSCSI and FC at the same time. Or when using classic storage, in which case you can also use a lot of different OSes.
  • You want to use old hardware so you don’t have to throw it away. As a rule, even 8 years old servers are good enough to handle node roles – RAM speeds haven’t changed much since then, and you don’t need more, the bottleneck is usually the network.
  • You have a lot of random writes or need to write to many threads at once. Random writes turn into sequential writes if you use the cache and its features. You can even use an old, very old storage system as a quick "file dump" for the office.

What is Software-defined Data Center and how SDSs fit into the SDDC philosophy

The difference between software-defined infrastructures and conventional "static" is about the same as the difference between the good old electrical circuits on tubes and the "new" transistors. That is very, very significant, but it is quite difficult to master at first. You need new approaches and a new understanding of architecture.
I should note that there is nothing fundamentally new in the concept of Software-defined itself, and the basic principles were applied at least 15 years ago. It was just called differently and wasn’t found everywhere.
In this post we are discussing SDS (Software Defined Storage), just talking about storage, disk arrays and other storage devices and their interfaces.
I’m going to talk about technologies based on DataCoresoftware. It’s not the only vendor, but it covers almost all data storage virtualization tasks completely.
Here are a few other vendors that solve storage problems on software-defined architectures :
EMC with their ScaleIO allows you to consolidate any number of x86 servers with disk shelves into a single fast storage unit. Here’s theory , and here’s practice of a fault-tolerant system for domestic servers that are not the most reliable.
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
Domestic RAIDIX Here’s about them and their mushrooms
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
Their architecture, replacing for a number of specific tasks like video editing for 10-20 thousand dollars storage for 80-100 thousand
Riverbed has has a cool solution with which we and he bank’s all branches in the Moscow NAS were united like this then they saw it on their LAN in the city and did quasi-synchronous replication through the cache.
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
Servers in cities 1 and 2 address the storage in Moscow as their "in-case" disks with LAN speeds. If necessary, you can work directly (case 3, office disaster recovery), but that already means the usual signal delays from the city to Moscow and back.
– In addition, similar solutions are available from Citrix and a number of other vendors, but as a rule, they are more tailored to the company’s own products.
Nutanix solves the problems of hyperstorage, but is often expensive because they make a hardware-software complex, and there the software is separated from the hardware only on very, very large volumes.
RED HAT offers CEPH or Gluster products, but these seemingly red-eyed guys supported sanctions.
I have the most experience with DataCore , so please forgive me in advance (and add more) if I inadvertently bypass someone else’s cool features.
Actually, what you need to know about this company: Americans (but did not join the sanctions, because they are not even listed on the stock exchange), on the market for 18 years, all this time saw under the guidance of the same man, as in the beginning, a single product – software for building storage systems – SANsymphony-V, which I will further call for brevity SSV. Since their CEO was an engineer, they were geeking out over the technology, but didn’t think about marketing at all. As a result nobody knew them as such until the last year, and they earned their money by incorporating their technology into other people’s partner solutions not under their brand.

About the symphony

SSV is software storage. From the consumer (host) side SSV looks like an ordinary storage, in fact – as a disk plugged directly into the server. In our case in practice it is usually a virtual multiport disk, two physical copies of which are accessible via two different DataCore nodes.
Hence, the first basic function of SSV is synchronous replication, and most of the actually used DataCore LUNs are fault-tolerant disks.
The software can be hosted on any x86 server (almost), almost any block devices can be used as resources: external storage (FC, iSCSI), internal disks (including PCI-SSD), DAS, JBOD, up to connected cloud storages. Almost, because there are hardware requirements.
SSV can present virtual disks to any host (exception IBM i5 OS).
Simple application (virtualizer / FC/iSCSI targeting):
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
And here’s more interesting:

Sweet functionality

SSV has a whole set of features – caching, load balancing, Auto-Tieringand Random Write Accelerator.
Let’s start with caching. The cache here is all free RAM of the server where the DC is installed, works both for writing and reading, the maximum size is 1Tb. The same ScaleIO and RAIDIX do not use RAM, but load disks of "their" servers or controllers. This provides a faster cache.
This DC-architecture is all about speed and reliability. In my opinion, the fastest and most affordable cache available today for practical mid-sized business tasks.
In the same cache, based on the servers RAM works the function of random write optimization, e.g. for OLTP-loading.
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
The optimizer has the principle is very simple: the host puts random blocks of data (like SQL) on the virtual disk, they get into the cache (RAM) which is technologically able to write random blocks quickly by sequencing these blocks. When enough sequential data is accumulated, it is transferred to the disk subsystem.
Also around here are preemptive reading, block grouping, multithreading, block consolidation, boot/login – storm protection, blender effect. If the control software understands what the host application is doing (e.g. reading a VDI image according to a standard scheme), then the reading can be done before the host requests the data, because the same thing it reads the last few times in the same situation. It’s wise to put this in the cache at the point where it becomes clear exactly what it’s doing there.
Auto-Tiering – is when any virtual disk is based on a pool that can include a variety of media, from PCI-SSD and FC storage to slow SATA and even external cloud storage. We assign a level from 0 to 14 to each of the media, and the software automatically redistributes the blocks between the media, depending on how often the block is accessed. This means that archive data is put on SATA and other slow carriers, and hot DB fragments, for example, on SSD. It automatically and optimally uses all available resources, it is not a manual file processing.
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
Evaluating statistics and then moving blocks is done by default once every 30 seconds, but if it does not create delays for current read and write tasks. Load balancing is present as an analogue of RAID 0 – striping between physical media in the disk pool, as well as the ability to fully utilize both nodes of the cluster (active-active) as the primary, allowing more efficient loading of adapters and the SAN.
With SSV, you can, for example, organize a metro cluster between storage systems that do not support this capability or require additional expensive equipment for this. And at the same time not to lose (in the presence of a fast channel between the nodes), and to grow in performance and functionality, plus have a reserve of performance.

Architecture

A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
SSV has only two architectures.
The first is SDS, software-defined storage. Classic "heavy" SDS is, for example, a physical rack, where there are RISC servers and SSD-factory (or HDD-arrays). In addition to the actual price of the disks, the cost of this rack is largely determined by the difference in architectures, very important for solutions of increased reliability (e.g. banks). The price difference between the Chinese workers’ x86 model and a similar in volume storage of the same EMC, HP or other vendor is from about two to two at a similar set of disks. About half of this difference is due to the architecture.
So, of course, it is possible to combine several x86 servers with disk shelves into one fast network and teach it to work as a cluster. There is special software for this, such as EMC Vipr. Or you can build a single x86 server as a storage system, with a lot of disks in it.
SDS is actually such a server. The only difference is that 99% of the time in practice it will be two nodes, and the backend can be anything you want.
Technically, these are two x86 servers. They have Windows OS and DataCore SSV, between them are synchronization (block) and control (IP) links. These servers are placed between the host (consumer) and storage resources, for example – a bunch of shelves with disks. The limitation is that there must be block access to both.
The most understandable description to the architecture would be the block write procedure. The virtual disk is presented to the host as a normal block device. The application writes a block to the disk, the block goes into the RAM of the first node (1), then it is written to the RAM of the second node (2) via the synchronization channel, then written (3), written (4).
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
As soon as the block appears in two copies, the application gets write confirmation. The configuration of the DC platform and backend depends only on the load requirements of the hosts. How properly the system performance is limited by the resources of the adapters and the SAN.
The second is Virtual SAN, i.e. virtual SAN. DC SSV is placed in a virtual machine with Windows Server OS, at the disposal of DC storage resources connected to this host (hypervisor). This can be both internal disks and external storage, such nodes can be from 2 to 64 in the current version. DC allows you to pool resources "under" all hypervisors and dynamically allocate this volume.
The physical copies of each block are also two, as in the previous architecture. In practice, these are most often internal disks of the server. The practice is to build a fault-tolerant mini data center without using external storage : it is 2-5 nodes that can be added when new compute or storage resources are needed. This is a particular example of the now fashionable idea of a hyperconverged environment, which is used by Google, Amazon and others.
Simply put, you can build an Enterprise environment, or you can take a bunch of not the most reliable and not the fastest x86 hardware, stuff the machines with disks, and fly off to take over the world at petty price.
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
This is what the resulting system can do :
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses

Two practical tasks

Task #1. Building a virtual SAN. There are three virtualization servers hosted in the Data Center (2 servers) and the Backup Data Center (1 server).

It is necessary to integrate VMware vSphere 5.5 into a single geographically distributed virtualization cluster, provide fault tolerance and backup functionality using :
– VMware High Availability technology;
– VMware DRS load balancing technology;
– Data link redundancy technology;
– virtual storage network technology.
Provide the following modes of operation :
1) Normal mode of operation.
Normal operation mode is characterized by the functioning of all VMs in the DPC and RDC Virtualization Subsystem.
2) Emergency operation mode.
The emergency operation mode is characterized by the following condition :
(a) All virtual machines continue to operate in the data center (RDC) in cases of :
– network isolation of the virtualization server within the Data Center (RDC);
– failure of the virtual storage network between the DPC and the RDC;
– failure of the local network between DPC and RDC.
b) all virtual machines are automatically restarted on other cluster nodes in the data center (RDC) in cases of :
– failure of one or two virtualization servers;
– failure of a data center site (RDC) in case of a disaster (failure of all virtualization servers).

Characteristics of server equipment

Server№ 1

Server

HP DL380e Gen8

2 x processors

Intel Xeon Processor E5-2640 v2

RAM capacity

128 GB

10 x NWMD

HP 300GB 6G SAS 15K 2.5in SC ENT HDD

Network Interface

2* 10Gb, 4*1Gb

Server№ 2

Server

HP DL380e Gen8

2 x processors

Intel Xeon Processor E5-2650

RAM capacity

120 GB

8 x NWMD

HP 300GB 6G SAS 15K 2.5in SC ENT HDD

Network Interface

2* 10Gb, 4*1Gb

Server # 3

Server

IBM x3690 X5

2 x processors

Intel Xeon Processor X7560 8C

Type of RAM

IBM 8GB PC3-8500 CL7 ECC DDR3 1066 MHz LP RDIMM

RAM capacity

264

16 x NWMD

IBM 146GB 6G SAS 15K 2.5in SFF SLIM HDD

Network Interface

2* 10Gb, 2*1Gb

Solution :
– Use existing hardware to create a virtualization subsystem.
– Based on the same hardware and internal server storage devices, create a virtual storage network with synchronous volume replication using DataCore software.
Each virtualization server deploys a virtual server – a DataCore node, with virtual disks created on the local disk resources of the virtualization servers additionally connected to the DataCore nodes. These disks are combined into disk resource pools, on the basis of which mirror virtual disks are created. Disks are mirrored between two DataCore Virtual SAN nodes – thus, one node’s disk resource pool contains the "original" disk and the other node’s "mirror copy". The virtual disks are then presented to virtualization servers (hypervisors) or virtual machines directly.
It is cheap and easy (colleagues tell me: competitive solution for the price) and without any additional hardware. Besides solving the immediate problem, the storage network gets a lot of useful additional functionality: increased performance, ability to integrate with VMware, snapshots for the entire volume, and so on. As you continue to grow, you only need to add virtual cluster nodes or upgrade existing ones.
Here’s the diagram :
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
Task #2. Unified Storage System (NAS/SAN).
It all started with a Windows Failover cluster for a file server. The customer needed to make a file balloon for document storage – with high availability and data backup and near-instant recovery. The decision was made to build a cluster.
The customer had two Supermicro servers (one of which was connected to SAS JBOD). The disk capacity of the two servers was more than enough (about 10 TB per server), but shared storage was needed to organize the cluster. It was also planned to have a backup of the data, because one storage is a single point of failure, preferably with CDP with workweek coverage. The data should be available at all times, with a maximum downtime of 30 minutes (and heads can fly). Standard solution included buying a storage system, another server for backups.
Solution :
– DataCore software is installed on each server.
In the DataCore architecture, Windows Failover Cluster can be deployed without using shared SAN storage (using the servers’ internal disks) or using DAS, JBOD or external storage with full implementation of the architecture – DataCore Unified Storage (SAN NAS), taking full advantage of Windows Server 2012 and NFS SMB (CIFS) and providing SAN service to external hosts. This was the architecture that was eventually deployed, and the disk capacity not used for file server was not presented as a SAN for ESXi hosts.
A big literacy: distributed storage systems in a practical tie-in for admins of medium and large businesses
It was very cheap compared to traditional solutions, plus :

  • Failover storage resources (including in the context of Windows Failover Cluster operation). Two mirror copies of the data are available at any given time.
  • Thanks to the virtualization features, Windows Cluster DataCore itself already provides a mirrored multiport virtual disk, which means that the problem of long chkdsk ceases to exist as such.
  • Further scaling of the system (e.g., growth of the data volume requires an additional disk array or an increase of the volume in the server itself) is an extremely simple process and does not stop the service.
  • Performing any maintenance work on the cluster nodes – without stopping the file access service.
  • CDP in SANsymphony-V10 is a built-in feature and is limited only by available disk space.
  • Storage performance is enhanced with DataCore’s Adaptive Caching feature, using all available RAM on the servers as a cache for attached storage, with a write/read cache, but more importantly for ESXi hosts than fileballs.

Main Principle

The basic principle of storage virtualization is on the one hand to hide the entire backend from the consumer, on the other hand to provide any backend with a single really competitive functionality.

Important practical notes specifically on Datacore

  • In the second architecture, DataCore does almost exactly the same thing as, for example, ScaleIO.
  • If you have the task of providing data storage for branches across Russia, but you don’t want to tie it to your current storage and don’t want to put new storage in the branches, there is a simple as logs method. Take three servers of 24 disks each and you get acceptable capacity and acceptable performance for most services.
  • If you need to organize a zoo – cheaper and easier option will be hard to find. Timing is as follows: the last example – 8 days to implement. Designing – about another week before that. Equipment we used existing, not fundamentally new bought. We only bought new memory for the cache. Datacore’s license was a little over a week old, but we are saving two.
  • DC is licensed by volume – bundles of 8, 16TB. If interested, I can send you the conditional street price (important: in fact, all quite flexible, licensing is carried out by hosts or TB, and you can choose some functions, so the actual prices are very different, if you need to calculate for your case – also write in person).
    In one of our tasks we had to organize DMZ segment of a large company. To save storage and optimize access to local storage we put just a DC. Shared SAN core with DMZ was not shared, so we had either to buy SAN and one more gland like firewall with hinges for secured access inside or get creative. We combined 3 virtualization hosts into a single space plus optimized access. It turned out cheap and fast.
    In another task we had to optimize access for SQL. There we were able to use the old slow storages. The writes went ten times faster thanks to the DC cache than directly.
    RAID Striping is useful for some situations (it is possible to build the software RAID0 and RAID1 even if the connected DAS can’t do it).
    Good reports – visual and statistical tools for providing data on all sorts of performance parameters and bottlenecks.
    QoS Control allows DBMS and critical applications to run faster by setting I/O traffic limits for less important workloads.
    I’ve been knocking out nodes on power tests (within reason, keeping the number of nodes needed to recover correctly), but I couldn’t catch the corrapts. Synchronous Mirroring Auto Failover – there is synchronous replication and mechanisms for automatic and configurable failover recovery. The self-healing function of disk pools is as follows: if a physical disk in the pool fails or is marked for replacement, DataCore automatically restores the pool to available resources. Plus a change log on the disk with the ability to recover from any state or from any time. Scheduled disk replacements, of course, are done without interruption. As usual, data is smeared among other instances, and then when the first opportunity arises, the volume "crawls" to the new disk as well. Approximately the same principle is used for migration between data centers, only "disk swapping" is mass-produced.
    There is Virtual Disk Migration – simple and effective data migration from physical to virtual resources and back. And without stopping applications.
  • Everyone can connect to a live installation and touch the console with their own hands.

You may also like