## Automatic parallelism and data dependency

After a whole year I have finally found the time to start working on some very interesting features for bp. I developed a data dependency parser which allows you to analyze how data flows in your program and which data depends on what.

This is the GraphViz output from what I call a DTree (short form for Dependency Tree). The little test program had a few variables in the main function whose values depend on each other in a certain pattern. Note that if a tree splits into two subtrees the compiler would be able to automatically parallelize them.

Let’s say that the compiler checks the DTree and finds out that a function is calculating the values of 4 variables which do not depend on each other. Combined with the knowledge that the calculation of each of these 4 subtrees has no side-effects and a “sequential time guess” to verify that it’s worth having the overhead of multiple threads the compiler can automatically parallelize such functions. I call this parallelizing algorithm “DTree Splitting”.

Obviously this fine grained parallelism (which gave very satisfying speedups on my dual core laptop) is not scalable. The previous example does not scale beyond 4 processors. Thus I aim for a combination of both fine grained and coarse grained parallelism because the latter can easily provide us what is missing on the first: Scalability.

for i = 0 to n - 1
search("some text", dataBase[i])

The typical “search in a database” example is infinitely scalable. Its sequential runtime $O(n)$ can turn into $O(\frac{n}{p})$ when distributing the work upon multiple threads where $p$ stands for the amount of processor cores. It is definitely possible to let a compiler figure out by himself whether he can safely replace this with a coarse grained parallelism routine (spawn p threads working on n/p array cells each). The compiler does not know whether n is large enough to justify the spawning of multiple worker threads but there are two possible solutions to check the runtime of the loop:

• Add “real runtime in milliseconds” metadata to each instruction which gets updated when the program is run in Debug mode so that we have a rough estimation of how much time the loop usually consumes
• Simply add an if n >= MINIMUM_ITERATIONS to the code which switches to sequential computation if n is too low

The only problem is that most current languages (except Clojure) force you to write object-oriented code with mutable data structures which all have a state. Modifying that state by multiple threads can be done safely via several techniques such as Software Transactional Memory, atomic operations, condition variables, barriers and locks – however synchronization between threads is very costly and should be avoided in general. Note that the fine grained parallelism resulting from an analysis of the DTree requires no data synchronization at all.

The automated coarse grained parallelism, however, will most likely make use of one of the above mentioned techniques to ensure that loops operating on mutable data can still use multithreading when run on a machine with multiple processors.

An alternative approach for automatic parallelism which involves quite a lot of synchronization overhead is to not start looking at the DTree from the last node (the result of the calculation) but from the start of it:

1. Select all nodes in the whole program which do not depend on any other data (the leaves of the tree, no child nodes).
2. Let each of them be calculated in a separate thread.
3. Call this step “Step X” and make a barrier that will only proceed the calculation when all threads finished their work
4. Select all the variables (or: nodes) whose values depend on the variables in the previously calculated step.
5. Repeat this algorithm from the second action on.

Instead of splitting the DTree you try to define the steps the program has to go through much like in any sequential algorithm except that in this case the instructions in one step are executed in parallel.

After each step all threads have to wait for all other threads to finish their work. This synchronization can be realized via barriers (e.g. boost::barrier for all those C++ friends) which themselves are implemented via condition variables. The downside of what I call “Step based parallelism” is that its use is limited and only justifies the synchronization overhead if there are enough actions to be done per thread in one step. In contrast to DTree splitting which can be used in any kind of program this type of parallelism can only be used for Input -> Output style programs which do nothing but compute something and print the output on the console. It can not be used for game development because games aren’t programs that calculate something once and end themselves, they consist of a “main loop” which draws the current frame and continue doing so until the player is satisfied.

I like automated parallelism because you can still write code the old-fashioned sequential way and let it profit from being run on multiple processor cores. While programming you only have to keep in mind that immutable data structures and functions without side effects will make parallelizing a lot easier for the compiler. It’s not a “Hey-I-Don’t-Care-About-You-I-Just-Write-Some-Sequential-Code-And-You-Parallelize-It-Okay?” lay back style because in the end only a few parts of your program can be parallelized and your programming style will heavily determine how much performance boost you can gain from it. DTree splitting (fine grained parallelism) which was shown earlier in this article will be included in bp and I am pretty sure by combining it with coarse grained parallelism and STM blitzprog can achieve a remarkable performance.

Posted Tuesday, April 24th, 2012 under Posts.