I will start with a little note on large scale graph processing, as described in the paper “One Trillion Edges: Graph Processing at Facebook Scale”.
Graph processing tasks are very common in analyzing various kinds of data, such as network topology of interconnection of people. Social media imposes challenges for such systems and algorithms due to the sheer amount of data. For instance, Facebook has more than 1.39 billion registered users and each users can represented by a vertex in a graph, making the graph consist of billions or even a trillion edges. This explains Facebook’s high interest in large scale graph processing systems.
Researchers at Facebook investigated a number of existing frameworks and systems for graph processing on the large scale and realized that none of the existing approaches were capable of handling the graphs of the Facebook scale. Thus, they decided to take action and develop a system adequate for the needs of the company. The result of their work was a significant contribution to the Apache Giraph project, greatly improving its performance and scalability.
Apache Giraph was chosen as the basis of Facebook’s graph processing system for a number of reasons:
- Good initial performance compared to the competing products
- Giraph is written in Java and facilitates the use of existing MapReduce infrastructure
- Ease of debugging due to the underlying graph processing mode used
Facebook introduced many significant improvements to the Apache Giraph project both in terms of functionality and performance. Among the changes are the ability to use multiple sources for vertex and edge data, more fine-grained parallelism, memory utilization optimizations, addition of a more refined worker phases and introduction of centralized master computation.
Ability to use multiple sources for the input data is important for many large scale users, as the data required for a graph may originate from different sources or stored on various hardware and software stacks. Such feature, although not directly affecting the graph computation, can significantly reduce the time needed to load the graph structure by eliminating the need of data aggregation jobs that bring data from multiple sources into one storage infrastructure.
One of the performance improvements was in how the system handles parallelism. Since Apache Giraph is built on top of the MapReduce framework, the only way to provide more parallelism to the computation in early versions of the system was by increasing the number of worker nodes. Facebook changed Giraph to allow more fine grained parallelism on the work node level by allowing each worker execute tasks with multiple threads thus better utilizing multiple CPUs available on the worker machines. Such model introduces a second level of parallelism, so in addition to running multiple workers, each workers can handle its own parallel execution. Although such model allows for more fine-grained tune up of the system and improvements in hardware utilization, it reduces the overall maintainability of the system and makes development and debugging more complicated due to the existence of two distinct parallel execution mechanisms.
Apache Giraph is based on the Bulk Synchronous Processing (BSP) model of parallel computations. Although the model is widely used in many applications, Facebook introduced many extensions to it in the Apache Giraph Project. In particular, the addition of additional compute stages allows for a better code management and overall structure and allows for better optimization of the tasks executing in the system. Master computation enables users to perform certain centralized computation in predefined time intervals. Such master computation is useful when a certain piece of code needs to execute only once and does not need to run on the worker nodes, since master can communicate the results to the workers. In addition, master node computation allows for some synchronization between the nodes, because it relies on the clock of only the master machine to run. As such it allows to partially address, the clock skew problem in worker machines, although such synchronization is still very coarse.
Performance of the Apache Giraph was tested against Apache Hive tasks in the PageRank, label propagation and computing friends of friends score running on the same infrastructure. All of these algorithms showed significant improvements in CPU time and elapsed time as perceived by user when executed using on the Apache Giraph stack compared to Hive. Unfortunately, no comparison with other specialized graph processing systems was provided, so it is not clear how it fares against the major competition. But the fact that a PageRank algorithm was able to run over one trillion edges in less than 3 minutes per iteration on 200 machines sounds impressive.