Home Other Implementing a singleton in a multithreaded application

Implementing a singleton in a multithreaded application

by admin

Implementing a singleton in a multithreaded application

Introduction

At the moment it is hard to imagine software running in a single thread. Of course, there are a number of simple tasks for which one thread is more than sufficient. However this is not always the case and most tasks of medium or high complexity use multithreading in one way or another. In this article I will speak about using singletons in a multithreaded environment. Despite the seeming simplicity, this topic contains a lot of nuances and interesting issues, so I think it deserves its own article. This article won’t discuss why you should use singletons and how to use them correctly. To clarify these issues, I recommend referring you to my previous articles on various issues related to singletons [1] , [2] , [3] This article will talk about the impact of multithreading on the implementation of singletons and discuss issues that come up during development.

Problem Statement

The following singleton implementation has been considered in previous articles :

template<typename T> T single(){static T t;return t;}

The idea of this function is quite simple and uncomplicated:for any type T we can create an instance of this type on demand, i.e. “lazily”, and the number of instances created by this function does not exceed the value 1. If we don’t need an instance, there are no problems from the point of view of multithreading (and from the point of view of lifetime and other problems)at all. But what happens if in our multithreaded application 2 or more threads want to call this function with the same type T at the same time?

Standard C++

Before answering this question from a practical point of view, I suggest you get acquainted with the theoretical aspect, i.e. get acquainted with the C++ standard. There are 2 standards which are supported by compilers at the moment:2003 and 2011.

$6.7.4, C++03
The zero-initialization (8.5) of all local objects with static storage duration (3.7.1) is performed before any other initialization takes place. A local object of POD type (3.9) with static storage duration initialized with constant-expressions is initialized before its block is first entered. An implementation is permitted to perform early initialization of other local objects with static storage duration under the same conditions that an implementation is permitted to statically initialize an object with static storage duration in namespace scope (3.6.2).Otherwise such an object is initialized the first time control passes through its declaration; such an object is considered initialized upon the completion of its initialization. If the initialization exits by throwing an exception, the initialization is not complete, so it will be tried again the next time control enters the declaration.If control re-enters the declaration (recursively) while the object is being initialized, the behavior is undefined.
$6.7.4, C++11
The zero-initialization (8.5) of all block-scope variables with static storage duration (3.7.1) or thread storage duration (3.7.2) is performed before any other initialization takes place. Constant initialization (3.6.2) of a block-scope entity with static storage duration, if applicable, is performed before its block is first entered. An implementation is permitted to perform early initialization of other block-scope variables with static or thread storage duration under the same conditions that an implementation is permitted to statically initialize a variable with static or thread storage duration in namespace scope (3.6.2). Otherwise such a variable is initialized the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization. If the initialization exits by throwing an exception, the initialization is not complete, so it will be tried again the next time control enters the declaration. If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization(*) If control re-enters the declaration recursively while the variable is being initialized, the behavior is undefined.
(*) The implementation must not introduce any deadlock around execution of the initializer

( highlighted by me )
In short, the new standard says that if during initialization of a variable (i.e. instance creation) a second thread tries to access the same variable, it (the thread) must wait until initialization is complete and the implementation must not allow a deadlock situation. The earlier standard, as you can see, doesn’t say a word about multithreading.
Now it remains to find out which compilers really support the new standard and which ones just try to pretend to do so. To do that let’s make the following experiment.

Experiment

When using multi-threaded primitives, I will use the framework Ultimate++ It is quite lightweight and easy to use. For the purposes of this article, this is not crucial (you can, for example, use boost ).
For our experiment, let’s write a class that takes quite a long time to create :

struct A{A(){Cout() << '{'; // marker to start class creationThread::Sleep(10); // wait 10 msif (++ x != 1)Cout() << '2'; // error marker :re-initialize objectCout() << '}'; // end of class creation}~A(){Cout() << '~'; // destroying the class}int x;};

At the initial moment of class creation the value of x is 0, because we plan to use it only from the singleton, i.e. with the word static, which will initialize all POD-types with 0. Then we wait for some time, emulating the duration of the operation. At the end we check for the expected value, if it’s different from one we give an error message. Here I used the character output to show the sequence of operations more clearly. I purposely didn’t use messages because this would have required extra synchronization in multithreading, which I wanted to avoid.
Next, write a function that is called when new threads are created :

void threadFunction(int i){Cout() << char('a'+i); // the beginning of the function is the small letter of the English alphabet, starting with the firstA a = single<A> (); //call our singletonif (a.x == 0)Cout() << '0'; // error marker :the singleton is not initializedCout() << char('A'+i); // function termination - the corresponding capital letter}

And we will call threadFunction from 5 threads simultaneously, thereby emulating the situation where there is competitive access to the singleton :

for (int i = 0; i < 5; ++ i)Thread::Start(callback1(threadFunction, i));Thread::ShutdownThreads();

For my experiment I have chosen 2 compilers which are quite popular today : MSVC 2010 and GCC 4.5. Testing was also done with the compiler MSVC 2012 compiler, the result was exactly the same as in 2010, so I won’t mention it hereafter.
Startup result for GCC :

ab{cde}ABCDE~

Startup result for MSVC :

ab{0cB0dCe00DE}A~

Discussion of experimental results

Let’s discuss the results obtained. For GCC the following happens :

  1. Start threadFunction for thread 1
  2. threadFunction start for thread 2
  3. start of singleton initialization
  4. Starting threadFunction for thread 3
  5. threadFunction start for thread 4
  6. threadFunction for thread 5
  7. singleton initialization complete
  8. threadFunction exit sequentially for all threads 1-5
  9. terminating the program and destroying the singleton

No surprises here: the singleton is initialized only once and the threadFunction is terminated only after the singleton is initialized => GCC Initializes the object correctly in a multithreaded environment.
The situation with MSVC is somewhat different :

  1. Start threadFunction for thread 1
  2. threadFunction start for thread 2
  3. start of singleton initialization
  4. error : singleton not initialized
  5. threadFunction start for thread 3
  6. Exiting the threadFunction for thread 2
  7. error : singleton not initialized
  8. threadFunction exit for thread 5
  9. singleton initialization complete
  10. Exiting threadFunction for thread 1
  11. terminating the program and destroying the singleton

In this case the compiler starts initializing the singleton for the first thread, and for the other threads it returns an object that didn’t even have time to initialize. Thus, MSVC does not work properly in a multithreaded environment.

Analysis of experimental results

Let’s try to find out the difference between the results obtained by the reviewed compilers. To do this, let’s compile and disassemble the :
GCC :

5 T single()0x00418ad8 <+0> : push %ebp0x00418ad9 <+1> : mov %esp, %ebp0x00418adb <+3> : sub $0x28, %ebp6 {7 static T t;0x00418ade <+6> : cmpb $0x0, 0x48e0700x00418ae5 <+13> : je 0x418af0 <single<A> ()+24> 0x00418ae7 <+15> : mov $0x49b780, %eax0x00418aec <+20> : leave0x00418aed <+21> : ret0x00418af0 <+24> : movl $0x48e070, (%esp)0x00418af7 <+31> : call 0x485470 <__cxa_guard_acquire> 0x00418afc <+36> : test %eax, %eax0x00418afe <+38> : je 0x418ae7 <single<A> ()+15> 0x00418b00 <+40> : movl $0x49b780, (%esp)0x00418b07 <+47> : call 0x4195d8 <A::A()> 0x00418b0c <+52> : movl $0x48e070, (%esp)0x00418b13 <+59> : call 0x4855cc <__cxa_guard_release> 0x00418b18 <+64> : movl $0x485f04, (%esp)0x00418b1f <+71> : call 0x401000 <atexit> 8 return t;9 }0x00418b24 <+76> : mov $0x49b780, %eax0x00418b29 <+81> : leave0x00418b2a <+82> : ret

You can see that before calling the A object constructor, the compiler inserts a call to the timing functions : __cxa_guard_acquire/__cxa_guard_release, which allows the single function to safely run simultaneously when initializing the singleton.
MSVC :

T single(){00E51420 mov eax, dword ptr fs:[00000000h]00E51426 push 0FFFFFFFFh00E51428 push offset __ehhandler$??$single@UA@@@@YAAAUA@@XZ (0EE128Eh)00E5142D push eaxstatic T t;00E5142E mov eax, 100E51433 mov dword ptr fs:[0], esp; initialization check00E5143A test byte ptr [`single<A> '::`2'::`local static guard' (0F23944h)], al00E51440 jne single<A> +47h (0E51467h); flag update to "initialized"00E51442 or dword ptr [`single<A> '::`2'::`local static guard' (0F23944h)], eax00E51448 mov ecx, offset t (0F23940h)00E5144D mov dword ptr [esp+8], 0; constructor call : object creation00E51455 call A::A (0E51055h)00E5145A push offset `single<A> '::`2'::`dynamic atexit destructor for 't'' (0EED390h)00E5145F call atexit (0EA0AD1h)00E51464 add esp, 4return t;}00E51467 mov ecx, dword ptr [esp]00E5146A mov eax, offset t (0F23940h)00E5146F mov dword ptr fs:[0], ecx00E51476 add esp, 0Ch00E51479 ret

Here the compiler uses a variable at address 0x0F23944 to check for initialization. If the object has not been initialized so far, this value is set to one and then the singleton initialization is called without intricacy. You can see that no synchronization is provided, which explains the result of our experiment.

Simple Solution

We can use a rather simple solution to solve our problem. To do this, we will use mutex to synchronize access to the object before we create it :

//class for automatic work with global mutexstruct StaticLock : Mutex::Lock{StaticLock(): Mutex::Lock(mutex){Cout() << '+';}~StaticLock(){Cout() << '-';}private:static Mutex mutex;};Mutex StaticLock::mutex;template<typename T> T single(){StaticLock lock; // call mutex.lock() firststatic T; // initialize the singletonreturn t; // call mutex.unlock() and return the result}

Startup result :

ab+{cde}-A+-B+-C+-D+-E~

Sequence of Operations :

  1. Start threadFunction for thread 1
  2. threadFunction start for thread 2
  3. taking a global lock : mutex.lock()
  4. start of singleton initialization
  5. Starting threadFunction for thread 3
  6. threadFunction start for thread 4
  7. threadFunction for thread 5
  8. singleton initialization complete
  9. global unlocking : mutex.unlock()
  10. Exiting threadFunction for thread 1
  11. taking a global lock : mutex.lock()
  12. global lock release : mutex.unlock()
  13. Exiting threadFunction for thread 2
  14. threadFunction exit for thread 5
  15. terminating the program and destroying the singleton

This implementation completely avoids the problem of returning an uninitialized object: mutex.lock() is called before starting initialization, and mutex.unlock() is called when initialization is complete. The other threads wait for initialization to complete before starting to use it. However, this approach has a significant disadvantage: the lock is always used, whether the object is already initialized or not. To improve performance, we want synchronization to be used only when we want to access an object which has not yet been initialized (as implemented for GCC ).

Double-checked locking pattern

To implement the above idea, an approach often used is called Double-checked locking pattern ( DCLP ) or “double-check locking” design pattern. Its essence is described by the following set of actions :

  1. Condition check: is it initialized or not? If yes, we return reference to the object
  2. take a lock
  3. check the condition a second time, and if it is initialized, release the lock and return the reference
  4. initialize the singleton
  5. change the condition to “initialized
  6. unlock and return the link

From this sequence of operations, it is clear where the name comes from: we check the condition 2 times, first before locking and then immediately after. The idea is that the first check may not mean that the object is not initialized, for example, in case 2 threads have entered the function at the same time. In that case, both threads get the status : “not initialized”, and then one of them takes a lock and the other is waiting. So, the waiting thread on the lock, if you don’t do an extra check, will re-initialize the singleton, which can lead to unfortunate consequences.
The DCLP can be illustrated by the following example :

template<typename T> T single(){static T* pt;if (pt == 0) // first check, outside mutex{StaticLock lock;if (pt == 0) // second check, under the mutexpt = new T;}return *pt;}

Here the pointer to our type is a condition: if it’s zero, the object must be initialized. It would seem that everything is fine: no performance problems, everything works fine. However, it turned out that not everything is so nice. At one time it was even considered that this is not a pattern, but anti-pattern which means you shouldn’t use it, because it leads to hard-to-find errors. Let’s try to find out what is wrong here.
Well, first of all, the singleton will not be deleted, but it’s not a big problem: the lifetime of the singleton matches the lifetime of the application, so the operating system will clean it up (unless you need some non-trivial handling, like writing a logging message or sending a certain query to the database to change the application’s state record).
The second more serious problem is the following line :

pt = new T;

Let’s take a closer look at this. This line can be rewritten as follows (I will omit exception handling for brevity):

pt = operator new(sizeof(T)); // allocate memory for the objectnew (pt) T; // call constructor on already allocated memory

That is, first memory is allocated, and then the object is initialized by calling its constructor. So it may turn out that the memory has already been allocated, the pt value has been updated, but the object hasn’t been created yet. So if any thread executes the first check outside the lock, the function single will return a reference to the memory which has been allocated but not initialized.
Let’s now try to fix both of the problems described above.

Proposed approach

Let’s introduce 2 functions to create a singleton: one to be used as if we had a single-threaded application, and one for multithreaded use :

//unsafe function in multi-threaded environmenttemplate<typename T> T singleUnsafe(){static T t;return t;}// function for use in a multithreaded environmenttemplate<typename T> T single(){static T* volatile pt;if (pt == 0){T* tmp;{StaticLock lock;tmp = singleUnsafe<T> ();}pt = tmp;}return *pt;}

The idea is as follows. We know that our original implementation (now a singleUnsafe function) works fine in a single-threaded application. Therefore, all we need is the serialization of the calls, which is achieved by the correct use of locks. In a sense there are also 2 checks going on here, only the first check outside the lock uses a pointer and the second one uses an internal variable that is generated by the compiler. The volatile keyword is also used here to prevent reordering operations in case of over-optimization by the compiler. It’s also worth noting that the assignment of the pt pointer takes place outside of locks. This is done so that operations are not reordered by the processor during code execution.
The result of compiling such an implementation is shown below :

template<typename T> T single(){; exception handling00083B30 push 0FFFFFFFFh00083B32 push offset __ehhandler$??$single@UA@@@@YAAAUA@@XZ (0A13B6h)00083B37 mov eax, dword ptr fs:[00000000h]00083B3D push eax00083B3E mov dword ptr fs:[0], esp00083B45 push ecxstatic T* volatile pt;if (pt == 0); first check outside the lock00083B46 mov eax, dword ptr [pt (0E3950h)]00083B4B test eax, eax00083B4D jne single<A> +7Dh (83BADh){T* tmp;{StaticLock lock;; calling the EnterCriticalSection function to take the lock00083B4F push offset staticMutex (0E3954h)00083B54 mov dword ptr [esp+4], offset staticMutex (0E3954h)00083B5C call dword ptr [__imp__EnterCriticalSection@4 (0ED6A4h)]tmp = singleUnsafe<T> ();00083B62 mov eax, 100083B67 mov dword ptr [esp+0Ch], 0; second check00083B6F test byte ptr [`singleUnsafe<A> '::`2'::`local static guard' (0E394Ch)], al00083B75 jne single<A> +68h (83B98h)00083B77 or dword ptr [`singleUnsafe<A> '::`2'::`local static guard' (0E394Ch)], eax00083B7D mov ecx, offset t (0E3948h)00083B82 mov byte ptr [esp+0Ch], alObject initialization : constructor call00083B86 call A::A (1105Fh)00083B8B push offset `singleUnsafe<A> '::`2'::`dynamic atexit destructor for 't'' (0AD4D0h)00083B90 call atexit (60BB1h)00083B95 add esp, 4}Call the LeaveCriticalSection function to release the lock00083B98 push offset staticMutex (0E3954h)00083B9D call dword ptr [__imp__LeaveCriticalSection@4 (0ED6ACh)]pt = tmp;; Write pointer to pt after all locks00083BA3 mov dword ptr [pt (0E3950h)], offset t (0E3948h)}return *pt;}00083BAD mov ecx, dword ptr [esp+4]; return result in eax register00083BB1 mov eax, dword ptr [pt (0E3950h)]00083BB6 mov dword ptr fs:[0], ecx00083BBD add esp, 10h00083BC0 ret

I’ve added comments to the assembly code to make it clear what’s going on. It’s interesting to note the exception handling code : quite an impressive piece. You can compare it with the code of GCC where tables are used when spinning up the stack with zero overhead in the absence of a generated exception. If we look at the code for the x64 compiler platform MSVC , you will see somewhat different approach to handling exceptions :

template<typename T> T single(){000000013F401600 push rdi000000013F401602 sub rsp, 30h000000013F401606 mov qword ptr [rsp+20h], 0FFFFFFFFFFFFFFFEh000000013F40160F mov qword ptr [rsp+48h], rbxstatic T* volatile pt;if (pt == 0)000000013F401614 mov rax, qword ptr [pt (13F4F5890h)]000000013F40161B test rax, rax000000013F40161E jne single<A> +75h (13F401675h){T* tmp;{StaticLock lock;000000013F401620 lea rbx, [staticMutex (13F4F58A0h)]000000013F401627 mov qword ptr [lock], rbx000000013F40162C mov rcx, rbx000000013F40162F call qword ptr [__imp_EnterCriticalSection (13F50CCC0h)]; nop !!!000000013F401635 noptmp = singleUnsafe<T> ();000000013F401636 mov eax, dword ptr [`singleUnsafe<A> '::`2'::`local static guard' (13F4F588Ch)]000000013F40163C lea rdi, [t (13F4F5888h)]000000013F401643 test al, 1000000013F401645 jne single<A> +65h (13F401665h)000000013F401647 or eax, 1000000013F40164A mov dword ptr [`singleUnsafe<A> '::`2'::`local static guard' (13F4F588Ch)], eax000000013F401650 mov rcx, rdi000000013F401653 call A::A (13F401087h)000000013F401658 lea rcx, [`singleUnsafe<A> '::`2'::`dynamic atexit destructor for 't'' (13F4A6FF0h)]000000013F40165F call atexit (13F456664h); nop !!!000000013F401664 nop}000000013F401665 mov rcx, rbx000000013F401668 call qword ptr [__imp_LeaveCriticalSection (13F50CCD0h)]pt = tmp;000000013F40166E mov qword ptr [pt (13F4F5890h)], rdi}return *pt;000000013F401675 mov rax, qword ptr [pt (13F4F5890h)]}000000013F40167C mov rbx, qword ptr [rsp+48h]000000013F401681 add rsp, 30h000000013F401685 pop rdi000000013F401686 ret

I specifically noted the nop instructions. They are used as tokens when unwinding the stack in case of a generated exception. This approach also has no code execution overhead in the absence of a generated exception.

Conclusions

So, it’s time to draw some conclusions. The article shows that different compilers treat the new standard differently : GCC tries in every way to adapt to the modern realities and correctly handles initialization of singletons in a multithreaded environment; MSVC lags slightly, so the neat singleton implementation described in the paper is required. The above approach is a versatile and efficient implementation without major synchronization overhead.

P.S.

This article is an introduction to multithreading issues. It solves the access problem in the case of creating a singleton object. Other serious issues arise when using its data later on, which will be discussed in detail in the next article.

Update

Corrected the singleton implementation to reflect the comments on the article.

References

[1] Habrahabr : Using the singleton pattern
[2] Habrahabr : Singleton and the lifetime of an object
[3] Habrahabr : Dependency reversal and generative design patterns
[4] Final Committee Draft (FCD) of the C++0x standard
[5] C++ Standard — ANSI ISO IEC 14882 2003
[6] Ultimate++: C++ cross-platform rapid application development framework
[7] Boost C++ libraries
[8] Wikipedia: Double-checked locking
[9] Meet the antipattern double-checked locking

You may also like