lundi 23 avril 2012

MultiCore - Benchmarks - MemoryAllocation

When writing multi-threaded code, you face different challenges.


One of them is Memory Management.


0. A little story.



You need to protect your shared variables.
So you lock around them. With no deadlock, cause you are a hero.
Easy peasy.



Then you need better performances, because locking can kill you. Well, at least your perf.
So you compare different locking solutions. [cf previous post]
And you use the best available, according to your context.
Cool.



But you still need to share variables. And that costs a lot.
So you avoid shared variables as much as you can. You rethink your algorithms.
OK.



But you STILL need to share variables, at least some of them.
So you try atomic operations and lockfree code.
Great.

Not to mention waitfree code.
Duh!

And you try the nifty lock free list, among others.
Yeay.



But then you need to allocate/free nodes from multiple threads, and memory management in a multithread environment can kill your perf.
So you investigate a little more, and try different solutions, in different contexts, and get some numbers.
You love numbers.


Here are some of those tests...




1. Memory allocation methods


We will compare 5 different solutions.


a. Description of the solutions


Solution 1 : the default new / delete

  • That's it.

Solution 2 : a lockfree memory pool - the holy grail ?

  • An array of data blocks.
  • An array of integers to index the blocks : 0 is free, 1 is used; 
  • When we search a free block index, we start searching after the latest used index - round robin like.
  • Atomic operations are used to modify the indices, with loops in order to retry if the state of the index has been changed before we could complete the transaction - typical lockfree stuff.

Solution 3 : locking a naive memory pool - to compare the lockfree one

  • An array of data blocks.
  • An array of integers to index the blocks : 0 is free, 1 is used;
  • When we search a free block index, we start searching after the latest used index - round robin like.
  • We lock the array of indices when we access it (allocate/free)

Solution 4 : locking a memory pool - lower foot print, faster access

  • An array of data blocks.
  • An array of bitfields to index the blocks : a bitfield is an 32 bits integer that indexes 32 blocks; a bit at 0 is "used", 1 means "free".
  • When we search a free block index bit, we start searching after the latest fully used bitfield index - round robin like.
  • We lock the array of bitfields when we access it (allocate/free)

Solution 5 : locking a constant access time memory pool - fastest access ?

  • An array of data blocks.
  • An hierarchy of bitfields, stored in a flat array : each bitfield at a given level of the hierarchy indicates the state of its corresponding lower level, 0 means "used", 1 means "free or partially free"
  • The lowest level works just like in solution 4.
  • We lock the hierarchy of bitfields when we access it (allocate/free)
  • We lock the array of bitfields when we access it (allocate/free)



b. Pros (+) and cons (-) of these solutions


Under our lense : locking, memory footprint, searching a free index, fragmentation.


Solution 1 : the default new / delete

  • ... is here as a reference.

Solution 2 : lockfree memory pool

  • locking : (+) has no locks
    (-) but retries can be expensive 
  • memory footprint : (-) 32bits integer per index is required for atomic operations (where 1 bit is required to encode the free/used state)
  • searching a free index : (+) it can be fast if the "next indices" are free
    (-) but it can/will be slow otherwise, since we will have frequent cache misses while iterating over 32bits integers 
    • especially if the pool is almost full; worst case : we will go through all indices
  • fragmentation : (-) it will fragment the pool , unless the allocations all have the same lifespan 
    • in this best-case scenario, the pool would behave like a nicely balanced producer-consumer queue

Solution 3 : locking a naive memory pool - to compare to the lockfree one

Just like Solution 2, except for the lockfree retries that are replaced by a raw "lock the whole indices array" that will work on the first try.
  • locking :   (+) no retries !(-) has locks
  • memory footprint : (-) 32bits integer per index (where 1 bit is required to encode the free/used state of 1 block)
  • searching a free index : (+) it can be fast if the "next indices" are free
    (-) but it can/will be slow otherwise, since we iterate over the 32bits indices, which triggers some frequent cache misses
    • especially if the pool is almost full; worst case : we will go through all indices
  • fragmentation : (-) it will fragment the pool , unless the allocations all have the same lifespan 
    • in this best-case scenario, the pool would behave like a nicely balanced producer-consumer queue

Solution 4 : locking a memory pool - lower foot print, faster access

Just like Solution 3, except that we use 1 bit to encode the "free"/"used" state, leading to minimal footprint & minimized cache misses.
  • locking : (+) no retries !
    (-) has locks
  • memory footprint : (+) 1bit per index is optimal
  • searching a free index : (+) we search through 32 indices at the same time by looking at an 32bits integer (bitfield)
    (+) it can be fast if the "next indices" are free
    (-) but it can be can/slow otherwise, since we iterate over the 32bits bitfields, which triggers some cache misses
    • especially if the pool is almost full; worst case : we will go through all indices
  • fragmentation : (-) it will fragment the pool , unless the allocations all have the same lifespan 
    • in this best-case scenario, the pool would behave like a nicely balanced producer-consumer queue

Solution 5 : locking a constant access time memory pool - fastest access ?

  • locking : (+) no retries !
    (-) has locks
  • memory footprint :  (+) > 1bit per index is great
  • searching a free index :
    (+) alwayse in constant time, whatever the state of the pool : we always have to go through the whole hierarchy of bitfields once.
    (-) the bigger the pool, the more levels in the hierarchy
    (-) constant time is great for the general case, but can still be slower than Solution 4 when "the pool would behave like a nicely balanced producer-consumer queue".
  • fragmentation :
    (+) it will keep the pool fragmentation to a minimum, each time reusing the first free index





2. Tests & results



System w/ 8 CPU cores, on Win7 64bits
Nb threads : 8
Nb allocations / frees per thread : 1 000 000
Allocated data block size : 12 bytes.

We will run the tests in 2 different "memory contexts" :

Context (F) : in a clean & fresh "memory environment" :
  • no allocations have been made prior to the test
Context  (P) : in a polluted context :
  • for memory pools, we set to 'used' half of the indices, interleaving a 'used' slot with a 'free' slot along the whole pool.
  • for the default allocator, we allocate half of the targeted max allocation count.



Test 1 : short life allocations - each thread : N times (allocate and free immediately)


Solution Total time (s)Time per task (s)
(F)S2 - lockfree1.1171340.000000
(P)S2 - lockfree1.5002970.000000
(P)S3 - lock naive pool3.0025920.000000
(F)S3 - lock naive pool3.0505320.000000
(F)S4 - lock 1bit pool3.0805930.000000
(P)S4 - lock 1bit pool3.1261270.000000
(P)S5 - lock constant access pool4.0813110.000001
(F)S5 - lock constant access pool4.1912930.000001
(F)S1 - default8.0337400.000001
(P)S1 - default14.3086950.000002







Test 2 : long life allocations - each thread : N times (allocate) then N times(free)


SolutionTotal time (s)Time per task (s)
(F) S2 - lockfree1.1581070.000000
(P) S2 - lockfree1.4307780.000000
(P) S3 - lock naive pool3.1205700.000000
(F) S3 - lock naive pool3.1555540.000000
(F) S4 - lock 1bit pool3.3767850.000000
(P) S4 - lock 1bit pool3.3826940.000000
(F) S5 - lock constant access pool4.1308110.000001
(P) S5 - lock constant access pool5.5456040.000001
(F) S1 - default249.5863650.000036
(P) S1 - default263.0467830.000038






Without the default allocator :




Test 3 : like Test 2 but on a single thread - no locking
- just because we might still have single thread context here and there !


Solution Total time (s)Time per task (s)
(F) S3 - naive pool0.0270420.000000
(F) S4 - 1bit pool0.0301310.000000
(P) S3 - naive pool0.0308180.000000
(P) S4 - 1bit pool0.0313710.000000
(F) S5 - constant access pool0.0704100.000000
(P) S5 - constant access pool0.0704720.000000
(F) S2 - lockfree0.0793830.000000
(P) S2 - lockfree0.0964210.000000
(F) S1 - default64.3395460.000009
(P) S1 - default67.7448810.000010









Without the default allocator :






3. Quick analyze




In the multi-threaded Tests, we have a few obvious facts :

  1. Interestingly, whatever the number of allocations tested, the results are pretty stable.
  2. The LockFreeMemoryPool outperformed any other solution, in all situations.
    And that's a great win \o/
  3. The default alloc/free is always the slower, by far - and that's normal, it fits a general purpose, so you can't expect it to compete with specific solutions like the various pool allocators
    But it also cannot handle huge amount of allocations properly : it totally sank with the 8 times 1 000 000 operations in polluted environment.
  4. The HBFIMemoryPool generally works with constant allocation times indeed, but it's always the worst pool solution in these tests.
    ... I guess the implementation could be optimized.


Moreover, the NaiveMemoryPool did pretty good. Usually better than the (1bit) MemoryPool. 
Since the MemoryPool and the HBFIMemoryPool both do bits manipulation, I think this can be the bottleneck.


Of course, had I provided another type of pollution, such as 'the first half of the pool is already used, and we start searching for free indices at the beginning', we could have seen different results.
In such case, there would be much more cache misses for some of the memory pools, and maybe the HBFI would have been a little more powerful.




When looking at the single threaded test, we clearly see the cost of the atomic operations involved in the LockFreeMemoryPool, especially when compared to its unsafe equivalent, the NaiveMemoryPool.
I'd wish the LockFreeMemoryPool had outperformed any other solutions, in all contexts, even when running on a single thread alone.
But that's not how it works.




It really proves, if that was still unclear to some of you, how important the context is to choose the right tool.
Always make a decision with the context in mind.
Always.
Period.


PS : I would like to try a TLS based solution some day... (who said hazard pointers ?)
PPS : Damn, this is a long post >.<


mardi 17 avril 2012

MultiCore -- Benchmark 01 - Increment shared variable



Quick test on a simple scenario.

Description 

I have a shared integer, that every thread tries to increment multiple times.
The information I need is the final value, sum of all these increments.

I prevent over subscription and under subscription by having as many software threads as logical cores.
I use basic thread functions, so there is no particular overhead (no task scheduling or whatsoever).

Proposed solutions 

solution 1 : use a CRITICAL_SECTION to protect the increment operation
solution 2 : use a SRWLOCK to protect the increment operation
solution 3 : use a custom SpinLock to protect the increment operation
solution 4 : use an atomic operation and no lock
solution 5 : use TLS (thread local storage) to have a local counter per thread, and sync the results with an atomic operation in the end

Results 

I ran the test multiple times, and always had about the same numbers.
Here are the results.

#######################################################################
 Infos summary :
        System w/ 8 CPU cores
        Nb 'tasks' : 80 000 000
        Nb (worker) threads : 8
        Average work load per thread : 10 000 000 tasks
=============================================================  
Results summary :
TEST_TLS_ATOMIC 0.295073 s
TEST_SRWLOCK 2.548695 s
TEST_ATOMIC 2.701041 s
TEST_CRITICAL_SECTION7.383443 s
TEST_SPINLOCK18.363033 s

#######################################################################

Explanations 

With no surprise, solution 5 (TLS) wins, since there is almost no contention.

On the contrary, I was quite surprised to see that the SWRLOCK solution was about the same speed as the atomic increment one (sometimes slightly faster, sometimes slightly slower).
The only explanation I have is that each atomic operation inccurs a cache update & flush, so here we have 80M of these.
On the other hand, using SRWLOCK might lead to fewer cache update & flush, e.g. if the same thread increments the shared value multiple times before another thread manages to grab the lock.

And the SpinLock... damn, this is so slow. I wonder if my (naive) implementation is right >.<.


jeudi 12 avril 2012

MultiCore -- CRITICAL_SECTION are damn fast !


When talking about multicore programming (or "threaded programming", if you prefer), one of the first things that comes to mind is "mutex", "semaphore", and such.
In other words, synchronization primitives.

I won't go into much details or explanations here. I will rather assume you have a decent knowledge of all that.

As I was working on my TaskManager, hoping to make it lockfree, I was really surprised at how efficient the CRITICAL_SECTION can be.

After watching a video conf from Intel where they claimed how Microsoft new CONDITION_VARIABLE and SWRLOCK were super efficient, I ran a few benchmarks with various "safety" mechanisms, some locking while others lock-free.
The hardware was an Intel 4-core i7 with hyperthreading. So basically, 8 hardware threads.


Obviously CRITICAL_SECTION are a lot faster than (the aging) Windows Mutex mechanism.

But CRITICAL_SECTION were also :

  • faster than Windows SWRLOCK in exclusive mode
    -- this is surprising considering the MSDN docs, and those tests http://nasutechtips.blogspot.fr/2010/11/slim-read-write-srw-locks.html
  • faster than my custom SpinLock
    -- yeah sure that can depend on the kind of operations you are running concurrently... but still.


Even more surprising :
  • the pair {std::list<int>, CRITICAL_SECTION}, with search in the list, was even faster than my LockFreeQueue<T> with no search.


God donut !  >.<


I'll try to gather the data and present them properly here - in an update.
I wish I could run some tests on other hardware & OS.



Welcome !

This blog is all about programming, and more specifically about game related programming.

Preferred topics :

  • graphics rendering
  • animation
  • multi-core programming
  • and various optimization stuff
Ok, here we go...