Random Thougts on Distributed and Grid Computing

2020-05-05

Prelude

Recently, some discussions #1, #2 about how to speed up BackTrader backtesting framework jumped inside my eye-sight.

With more investigations #1, #2, #3, the a distributed framework like Dask seems to be the best answer for the discussion above, and a new terminology Grid Computing also showed up many times.

And all these technical readings revoke the memory that I used Spark framework to play with Machine Learning models back in College years ago.

Definitions

Distributed computing uses a centralized resource manager and all nodes cooperatively work together as a single unified resource or a system. (e.g. Distributed Backtesting cluster built on Dask)
Grid computing utilizes a structure where each node has its own resource manager and the system does not act as a single unit. (e.g. World Computer built on Ethereum)

Which Stack?

+-------------------------+
|  Spark / Dask / Hadoop  |
|                         | Layer 1
|   Data Processing       |
+-------------------------+

+-------------------------+
|  Yarn / Mesos / Slurm   |
|                         | Layer 2
|   Resource Management   |
+-------------------------+

+-------------------------+
|      HDFS / S3 / GCS    |
|                         | Layer 3
|    Distributed Storage  |
+-------------------------+

From my perspective, every single distributed computing cluster need at least three main layers:

Data Processing Layer for create data processing jobs.
Resource Management Layer for actually schedule and process all the distributed jobs.
Distributed Storage Layer for large file scale up and replication.

And there are two main stacks (Spark + Yarn + HDFS) and (Dask + Slurm + HDFS) to finish a distributed task.

Slure versus Yarn

From Slurm Roadmap 2013:

Work being performed by Intel
Eliminates need for dedicated Hadoop cluster
Better scalability
Launch: Hadoop/YARN (~N), Slurm (~log N)
Wireup: Hadoop/YARN (~N2), Slurm (~log N)
No modifications to Hadoop
Completely transparent to existing applications

Hmm, tremendous difference, there is no doubt that Slurm got a better performence, at least back at 2013, but intuitively, JVM offers more overhead than a pure C application.

Dask versus Spark

From A performance comparison of Dask and Apache Spark for data-intensive neuroimaging pipelines:

Overall, our results show no substantial performance difference between the engines. Interestingly, differences in engine overheads do not impact performance due to their impact on data transfer time: higher overheads are almost exactly compensated by a lower transfer time when data transfers saturate the bandwidth. These results suggest that future research should focus on strategies to reduce the impact of data transfers on applications.

The main overhead is the bandwidth, and these two computation framework shows the equal power.

But for the sake of flexibility and productivity, Dask offers more data model for users from different background, even for Spark users.

Dask provides:

dask.array to embrace NumPy/XArray users
dask.dataframe to embrace Pandas users
dask.bag to embrace Spark/PySpark users
dask_ml to embrace Scikit-Learn users

Conclusion

After some non-strict research and code-playing, I think Dask + Slurm are the best friend for me to setup a distributed computing engine, especially for financial modelling.

If this stack is still not fast enough, then we can migrate from CPU to GPU which is also even more expensive after then.

There are also some interesting stacks like Berkeley Ray and Son of Grid Engine deserve my time to find out their shining part, but, aye, this should be another story later then :)