Development of parallel implementation of tree-based operations for applications that use large scale tree data structures

10 min readDec 6, 2022

Abstract

In todays era, the amount of data available is huge in number and along with that there comes a need to organize all these data-information in an organized way. One of the most important data structure used in algorithm design and programming is the balanced search tree which are used widely for organizing data. With time there are many different challenge’s which have been thrown up by the large volume and ever changing data, due to which the need for parallelism, concurrency and persistence. Through this blog we will be introducing to such techniques for supporting some functionalities on trees, which will also be including some of the various parallel algorithms, multi-versioning and concurrency. We will be focusing on such type of algorithmic framework which will be dedicated to mainly parallel balanced binary trees, as it work’s for various multiple balancing schemes, including red-black trees, weight based trees, and treaps.

Introduction

In recent times with the development in shared memory, the ability to perform various heavy computations in parallel and in memory have been made easy with the use of multi-core machines. As a result, it is highly necessary to have such a data structure which is simple, efficient and parallel for the purpose to organize and process data. One of the most famous data structure which is the Binary Search Tree(BST) for the purpose of organizing data and also very helpful in order to maintain various form of abstract data types which are the sets and maps. We will be discussing various methods and technique’s to support parallelism and concurrency in BST for ordered maps/sets operations. We will be introducing such an algorithmic framework for parallel BST which will act as a purpose for creating a bases for all of the single primitive join. some of the balancing schemes for which these frameworks are extensive are as follows: red-black trees, AVL trees , treaps and weight balancing trees and all of the algorithms are generic across balancing schemes except the join. In this blog we will be demonstrating various solutions in parallel which will perform the function of bulking the operations in trees, which are MAPREDUCE,UNIION,FILTER. All of the programs and algorithms on trees can be said to be work-efficient with poly-logarithmic parallel depth.

The structure or the frame work of the parallel tree which we are currently talking about is combined into library in C++ which is named as PAM(Parallel Augmented Maps). We will be discussing the methods of how to use this library and the framework in order to solve some of the real word problems. This above mentioned complex tree structure is applicable in variety of applications in different domains. we can use them with the use of abstract datatypes called the augmented map. This library was implemented with the general purpose for the parallel tree structure, this library directly gets implemented in many different applications including the rectangle search, inverted index searching 2D range and more. with making the use of this library, every applications only require about hundred of lines of high level code in order to get a better and highly-optimized implementations.

If we talk about the both theory and practice the benefit of the library and the join framework falls under its efficiency, simplicity and generality.

As we discussed earlier so multiple real-world problems can be easily resolved based on the augmented map abstraction and the PAM library as discussed. This indirectly helps users to adopt this tree structure as a black box for their own applications. with the use of these, users can easily deal with the problems of higher level of abstraction without talking any headache about the details in its implementation because of the various functionalities supported by it such as multi-versioning, concurrency, persistence.
All of the algorithms applicable for various multiple balancing schemes except the join algorithm. This helps in reducing the workload for the creation of the tree structures with special requirements in case when it is necessary. At last it allows for resilience to various other parallel algo’s and the balancing schemes.

Background

From a while or so, we have seen that there has been the development of various parallelization methodologies. Our goal was to take many key components of the various number of existent strategies and perform the task of combining them into a very highly intelligent implementation which will allow the scientists to formulate various parallel tree based applications which would take a lot of less time as it would have taken if it would have been implemented from scratch. so the idea is to combine multiple parallelization strategies on yo a single roof which could benefit a specific or desired community effectively.

The main idea is to see whether such type of selective deployment strategy could be easily extended from a specialized application to a type of more optimized and generalized general- purpose tree library which could be highly straight-forward and scalable for them to use. we will be also giving light on the various pros and cons of the various multiple type of parallelization techniques. the literature part mainly offers a broad range of methodologies which can be used for the case for effective parallelization of data trees. Many compilers have been developed over the years in order to enlighten the programmers with the view of how the computation is performed. if we talk about the HPC arena it is very rare to see that these were mainly implemented for shared memory architecture as they are highly rare in HPC. Another main serious issue is that there has been problem in scaling beyond the thousand processors as they become highly ruled by the overhead which is mainly used in spawning and in release of computational threads.

so the compiler works in such a way that it mainly starts to proceed along with a single thread at start till it gets the instructions that it can get to parallel computation. but the drawbacks are always there and here spawning and synchronizing are very expensive tasks and sometimes in order to execute they can take nearly about millions of CPU cycles on a distributed memory machines. some parallel compilers which are mainly agenda based provide with higher-level capabilities. if we take an example of this strategy than we can take the case of ZPL, which allows people to do manipulation in n-dimensional shared array in terms of spatially awaking mannerism. but it is very unfortunate to say that all of the applications can be expressed with the use of this formalism and we find ZPL too much restrictive. so the problem or the limitation is that the problem must get mapped to the high-level instructions and the structure which are provided by the compiler. ZPL shows that it is easy to scale well with the use of the agenda-based parallelism by being provided that each of he agenda easily maps onto the computation .

Algorithms

we will be showing the join-based algorithmic framework. The join function mainly gets operated on two balanced binary trees T(L) and T(R) and a entry point e. this will be resulting in the formation of the new balanced binary tree where the in-order traversal is the in-order traversal of T(L) then e and along side the in-order traversal of T(R).

In the case of search trees the key of e should be very large than in T(L) and smaller than all keys in T(R). So all types of issues such as the rebalancing and rotation can easily be dealt by the other algorithms. We can say that all the algorithms are mainly very much similar across multiple balancing schemes and most of them are also parallel.

Persistence and concurrency

There have been many tree structures which are supporting the in-place concurrent update but the drawback is that they only enable for atomic updates for a single operation. The another option which can be considered will be the use of transactions with multi-versioning, in this each concurrent thread works on a version. The first case is the using of version chains, in this we only deal with a tree node which performs the task of maintaining their values in a chain and along with that the history of version. But the main draw back caused is that reading mainly requires the check of each version in the specific chain which can be a highly expensive task. The second is highly considerable as it leads to the functional data structures and helps in avoiding and not updating the internal data of any of the existing tree node and helps in preventing contention which is caused by concurrency. PAM mainly makes the use of the path copying with the use of the single writer and it is also very much possible to batch all of the updates and also run them all in parallel.

In cases rather than using the concurrent data structures we can relay on the batching using the bulk operations with the use of path copying technique. The batch latency which is controlled for the case of PAM is 50ms which can be said that it is totally same as the network latency and is very less likely to be ruling the cost. rather than using the concurrent data structures, the PAM with batching helps in showing huge hike in the performance especially for those types of workloads which are like read-heavy.

Explicit message passing

These libraries are highly beneficial to use as they provide generality and along with that they also provide limitless flexibility. They mainly get to this by enforcing the developer to do the programming at even more lower level than as compared to any other parallel compiler. the programmer can easily use all of his skills to maximize the granularity and also to reduce the network latency as the programmer is under the control of the communication. so because of all these reasons most of the applications tend to be running on modern MPPs and make the use of the passing libraries. MPI is one of the most common library in use today. but drawbacks are always there of everything so in case of the MPI the main drawbacks are like MPI’s are a lot time consuming. The computation tasks tend to get more and more difficult if the thread domains with the use of structures which will be more complex a compared to the grids gets decomposed because as the result of it it becomes highly difficult for the MPI to make use of the MPI’s collective communication facilities with ease. but there is an interesting fact regarding MPI which is it enables large asynchronous execution, so the more the approach is synchronous the more easy it gets to express all of it in MPI. when with time as ones algo reaches the ideal of few barriers it really gets very much challenging to do the implementation of it in MPI.

Remote method invocation

ARMI is one of the inter-processor paradigm which is mainly used for messaging. It is one of the library in C++. They mainly act as an organization which has been built for the purpose of launching procedures. if we talk about the MPI’s they mainly have a data-centric view in which the transmission of the information takes place from one point to another. but if we talk about the RMI, in this we pass an executable procedure which gets with it the data, within physical locations. But one thing to note is that each of these approaches are interchangeable. some of the few benefits which ARMI has on MPI are that they tend accumulate multiple remote into a single piece of message which results in reducing the amount of overhead in the communication. but the programming done in ARMI does not assure us of scalability. The navigation of data-structures is mainly counted as a serial operation in which an individual looks at the node or level and then uses that particular information to advance to the other location. so this means that the message aggregation does not only helps us for the extreme scalability for tree. we will be seeing that how RMI provides a mechanism for using a agenda based approach in order to provoke individuals functions on various processors and rather than those processors which are being provided by the compiler.

Conclusion

so at last we can say that the PAM library in use with the augmented map can be extensively used in various different applications. Along with the efficient implementation the results which it provides are also very concise and good. we have discussed about the 2D range queries using the PAM and comparing it with the sequential libraries such as the BOOST and CGAL. So, PAM is better and outlasts both libraries with good results. The speed-up of PAM is about the range from 60–100.

The tree implementation is very much efficient and also very much easy to use in case of serial computations. On the other hand Ntropy helps in providing a seamless upgrade for the researchers and allow them to run any type of their application on any type of the platform by siting at their workstation onto a parallel supercomputer by reducing the development time for the data analyzation purpose Ntropy also enables to have a lot of knowledge discovery on the massive points such as the datasets. So for load balancing purposes the virtualization of divide and conquer like tree walks can be highly useful and these get implemented on the mechanism of data caching.

so for the programmers in order to make or do the implementation of their tree walk in such type of coarse-gained type of fashion RMI is highly useful in creating such an agenda-based approach for the parallelism.