A Comparison of Computational Frameworks: Spark, Dask, Snowflake, more

by Enrico Rotundo , Associate Data Scientist

Winder.AI worked with Protocol.AI to evaluate general-purpose computation frameworks. A summary of this work includes:

  • Comprehensive presentation evaluating the workflows and performance of each tool
  • A GitHub repository with benchmarks and sample applications
  • Documentation and summary video for Bacalhau documentation website

The video below summarizes the work.

Compute Landscape

The traditional compute landscape counts several dozens of frameworks capable of processing generic workloads. Some are specifically designed to take advantage of data locality by bringing the computation close to where data lives. This landscape analysis reviewed a selection of these tools to summarize their pros and cons.

The full slide deck contains a detailed overview of the compute frameworks and include sample code snippets.

The Python data stack includes tools like Pandas and Dask that offer a very convenient data structure named Dataframe, particularly suitable for handling tabular data.

The database world offers a variety of choices optimized for different use cases (e.g. tabular data, real-time time series, etc.). This research looked at Postgres and Snowflake, a couple of fairly generic tools in this space.

Big data tools like Apache Spark and Hadoop are also part of this analysis. They are capable of processing structured and unstructured data in very large clusters. This category introduced first the concept of data-locality to avoid data transfers over the cluster network.

Last but not least some web3 tools are also part of this analysis. They aim at supporting distributed storage and computation. Note that at the time of writing they’re under heavy development. In many cases, it’s still unclear how they work and what direction they’ll take in the future.

Unfortunately, many of these systems are far from being easy to operate on your localhost or at scale. Traditional frameworks are plagued by significant operational overhead resulting in inefficient resource usage. Moreover, there’s often a significant setup burden even to running a getting started guide, setting a relatively high barrier to entry.

The table below summarizes their score in terms of different requirements. That rating is based on the experience of setting up and running the code described in the next section, find more details on the slides.

Why is Bachalau not on this list?

This analysis is not a direct comparison between Bacalhau and existing frameworks. Instead, this research aims at helping the Bacalhau community to learn the benefits and drawbacks of traditional systems.

Code repository

Sample code

A good starting point to navigate the compute waters is taking a look at the code repository where you’ll find working examples of embarrassingly parallel workloads (e.g. word count, dataset aggregation, etc.). Take a look at the dedicated folder for viewing the demos in a notebook format, no installation is needed. Alternatively, you find the collection of examples on the slides.

It’s informative to compare the verbosity and complexity between APIs. For example, implementing a simple word count job in Pandas is concise and can be achieved just by chaining methods, while the Hadoop implementation is far less intuitive, mainly because it’s bound to use the Map-Reduce paradigm.

Setup instructions guide you through the installation process in case you’d like to run the examples yourself, and please give it a try to get an idea of how a simple single-node setup works.

Benchmarks

The code repository ships also benchmark scripts that run a parallel workload on a large dataset, time its execution, and log resource usage. Explore the related section to familiarize yourself with the rough edges of the installation process.

You can choose to spawn either a single-node or multi-node cluster. Trying out both options is particoularly instructive for a firsthand experience with the local-to-cluster hurdles, as well as facing the complexities in installing a framework such as Hadoop.

The benchmarked task is a word count job processing a dataset containing +1.7B words. The plot below reports the benchmark running time for each framework, a missing bar implies that the tool doesn’t support a fully-fledged multi-node set-up (i.e can only scale vertically). Performance across the landscape can vary 10x, that’s expected because Pandas is not a big-data tool, and Hadoop was not designed to perform well on a single-node setup. However, it’s surprising that only Spark and Snowflake provide a quite easy setup combined with quick processing and very low resource usage.

Check out the slides for a complete report on the benchmark results or dive into the code repository to spin up a cluster and run the benchmarks yourself.

Key findings

General

  • Modern frameworks must be intuitive and can’t impose a byzantine approach like MapReduce (see Hadoop). High-level API is a must.
  • Installation is often a hurdle, must be stripped down to allow anyone to complete a getting-started.
  • Declarative pipelines are not quite a thing in these technologies.
  • Support for repeatable containerized jobs is needed. All frameworks don’t provide an effective way to package dependencies (see Spark/Hadoop).
  • The local to multi-node cluster journey is quite rough and requires additional installations, configurations and a different approach to writing your code. This must be overcome going forward.

Benchmark

In Multi-node setups, you may need to optimize the cluster size, which is a bummer (see Dask, Hadoop)

  • Tested vanilla configurations: tweaking knobs improve performance but add complexity. Modern tools must provide an out-of-the-box experience (see Snowflake setup).
  • Except for Spark, these tools have poor performance with unsharded large files (i.e. strive to parallelize), thus a real-life use case will require additional data preparation upstream.
  • Good to keep in mind these tools are optimized to tackle different use-cases and a different task may vary results. Anyway, some strive to be as generic as possible (see Spark, Dask).

Contact

If this work is of interest to your organization, then we’d love to talk to you. Please get in touch with the sales team at Winder.AI and we can chat about how we can help you.

More articles

Scaling StableAudio.com Generative Models Globally with NVIDIA Triton & Sagemaker

Learn from the trials and tribulations of scaling audio diffusion models with NVIDIA's Triton Inference Server and AWS Sagemaker.

Read more

Big Data in LLMs with Retrieval-Augmented Generation (RAG)

Explore how Retrieval-Augmented Generation (RAG) enhances Language Models by utilizing indexing, retrieval, and generation for up-to-date data access.

Read more
}