Nvidia GPUs for data science, analytics, and distributed machine learning using Python with Dask
Nvidia wants to extend the success of the GPU beyond graphics and deep learning to the full data science experience. Open source Python library Dask is the key to this.
Nvidia has been more than a hardware company for a long time. As its GPUs are broadly used to run machine learning workloads, machine learning has become a key priority for Nvidia. In its GTC event this week, Nvidia made a number of related points, aiming to build on machine learning and extend to data science and analytics.
Nvidia wants to “couple software and hardware to deliver the advances in computing power needed to transform data into insights and intelligence.” Jensen Huang, Nvidia CEO, emphasized the collaborative aspect between chip architecture, systems, algorithms and applications.
Therefore, Nvidia is focused on building a GPU developer ecosystem. According to Huang, the GPU developer ecosystem is growing fast: the number of developers has grown to more than 1.2 million from 800,000 last year. And what can you build a developer ecosystem on? Open source software.
Nvidia announced the CUDA-X AI SDK for GPU-Accelerated Data Science at GTC. Nvidia touts CUDA-X AI as an end-to-end platform for the acceleration of data science, covering many parts of the data science workflow. The goal is to use GPUs to parallelize those tasks to the degree possible, thus speeding them up.
A key part of CUDA-X AI is RAPIDS. RAPIDS is a suite of open-source software libraries for executing end-to-end data science and analytics pipelines entirely on GPUs. And a key part of RAPIDS is Dask. Dask is an open source framework whose goal is to natively scale Python.
As Python is the language of choice for most data science work, you can see why Nvidia chose to make this a key part of its strategy. ZDNet had a Q&A with Dask creator, Matthew Rocklin, who recently started working for Nvidia.
Rocklin said that Dask grew out of the PyData ecosystem (Pandas, NumPy, Scikit-Learn, Jupyter, etc). PyData is well loved by data scientists both because it’s fast (mostly written in C/C++) and easy for non-experts to use. The problem was that PyData generally didn’t scale well, which was a limitation as dataset sizes increased.
Dask was designed to scale the existing PyData projects and bring them into parallel computing, without requiring a complete rewrite. Rocklin started off with Dask while working at Anaconda Inc (called Continuum at the time), but Dask quickly grew to be a community project with contributors from most major PyData libraries.
Anaconda was invested in the project because they had many clients who loved Pandas, Scikit-Learn, and friends, but were struggling with larger data sizes. Rocklin said there was clearly both a community and a market need to make PyData handle larger scales, but there were also many technical and social challenges:
“We had to plug holes in the ecosystem that were missing on the big data side. There were a number of technologies in the big data space that weren’t really available in Python at the time: data formats like Parquet, Orc, and Avro, data storage systems like HDFS, GCS, S3, deployment solutions like YARN.”
He went on to add that in the Python ecosystem, many technologies had little to no support, rendering them not really useful in a performance-oriented context. Dask started out focusing on the single-node heavy workstation use case. The goal was to scale Python to all of the cores of a modern CPU and extend the data size limit from the amount of available RAM (~100GB) to the amount of available disk (~10TB) on single workstations.
At the time, this was viewed as being a sweet-spot of high-productivity. Rocklin said that “Dask was Python-native and easy to work with and lots of groups that had Numpy/Pandas/Scikit-learn style problems at this scale (which was most people) really liked it.”
He went on to add, however, that people’s ability to scale out to larger datasets seems to be insatiable, so after about a year Dask was extended to work in a distributed computing environment which has more or less the same scaling properties as a system like Spark.
Apache Spark is a hugely popular open source platform for data science and machine learning, commercially supported by Databricks. Spark supports in-memory processing and scales well via clustering. The problem with Spark for many Python users, however, is that support for Python is limited, and deploying and tuning clusters is not trivial.
Dask was developed to address this. So how do Dask and Spark compare? Rocklin said this is a question he gets often, so often a detailed answer for this exists. The takeaway is that Dask is smaller and more lightweight. Dask has fewer features than Spark and is intended to be used in conjunction with other Python libraries.
DASK HELPS ACCELERATE THE ENTIRE DATA SCIENTIST WORKFLOW WITH GPUS
Recently, Rocklin joined Nvidia in order to help scale out RAPIDS with Dask. He noted that Nvidia had already been doing this for a while, but wanted more firepower behind the Dask effort:
“NVIDIA has been a supportive player in the open source ecosystem recently, so I was excited about the opportunity. With RAPIDS, NVIDIA wants to extend the success of the GPU beyond graphics and deep learning to the full data science experience. This means building GPU-accelerated libraries for a broader set of tasks.”
The planned outcome is to be able to accelerate the entire data scientist workflow, not just the final training and prediction parts when deep learning is involved. Traditional machine learning algorithms, dataframe operations like groupby-aggregations, joins, and timeseries manipulation. Data ingestion like CSV and JSON parsing. And array computing like image processing, and graph analytics:
“When I first heard about this I was pretty skeptical. I had always thought of GPUs as being good only for computations that were highly regular and compute intensive, like the matrix multiplies that happen within deep learning.
However when I saw that NVIDIA had built a CSV parser (a highly irregular computation) that was a full 20x faster than the Pandas standard I was sold. The performance numbers coming out of these machines is pretty stunning.”
Rocklin said that today RAPIDS uses Dask (and MPI) to scale some of these libraries to multiple GPUs and multiple nodes. In the same way that Dask can manage many Pandas dataframes across a distributed cluster of CPUs it can also manage many GPU-accelerated Pandas dataframes across a distributed cluster of GPUs. This ends up being easier than one might expect for two reasons, he went on to add.
First, because Dask was designed to parallelize around other libraries, so it handles things like load balancing and resilience, but doesn’t impose any particular computational system, which suits NVIDIA’s needs well, given it’s desire to define its own on-node computation.
Second, Dask already has algorithms that work well for Numpy, Pandas, Scikit-Learn, and friends, and so many of them work out-of-the-box for the equivalent RAPIDS libraries, which copy those APIs as well.
DISTRIBUTED MACHINE LEARNING DO’S AND DON’TS
But can it really be that easy? Recently, Viacheslav Kovalevskyi, technical lead at Google Cloud AI, warned against the use of distribution unless absolutely necessary. We wondered what Rocklin’s view on this is, and what he sees as the greatest challenges and opportunities for distributed machine learning going forward.
Rocklin pointed out that computing efficiently in this space still requires quite a bit of expertise. Certainly there are many machine learning problems, like the parameter servers used in deep learning applications, for which you can suffer a huge performance penalty if you go distributed without being very careful.
He added that there are very real human and administrative costs to using a cluster. Clusters are complex, typically managed by another team outside of data science, they are difficult to debug, and generally place roadblocks between an analyst and insight.
Single large machines are highly productive, Rocklin thinks, both from a performance/hardware perspective and also from a human/administrative barrier perspective. If you can solve your problem on a non-distributed system then that’s almost always the right choice:
“However, Machine Learning is a broad term these days. From a computational perspective, Machine Learning has a lot of variety. Some things work well for some problems, others don’t. It’s difficult to make blanket statements.
There are tons of problems where scaling out is very natural, and provides almost no performance penalty. There are critical applications in our society that genuinely require large scaling to make progress (think genomics, climate change, imaging). Ideally the tools we use on small/medium data on a single machine can be the same tools we use on thousand-node clusters.”
CARS, ENGINES, AND SOCIAL DYNAMICS: DASK AS A SCHEDULER FOR THE PYTHON WORLD
Speaking of clusters, and to return to the comparison between Dask and Spark, there is another difference which becomes obvious when going through the Dask API: it’s a lower-level API. Things such as scheduling tasks, for example, are more nuanced to implement. It looked to us like Dask may be a “quick and dirty” way for Python users to distribute their code (and data), but the tradeoff is more fiddling with the nitty-gritty.
Once more, like in the case of PyTorch, the answer to this was that this is a feature, not a bug. Rocklin said that Dask’s original goal, to provide high-level parallel implementations of Numpy and Pandas, required a decently sophisticated task scheduler that would handle things like load balancing, resilience, task placement, network communication and so on.
What they found was that for roughly half users, that was great. They were doing pretty traditional business analytics or scientific workloads, and didn’t need anything more. However the other half were doing computations that were much more complex. For them, the parallel Numpy/Pandas implementations weren’t useful, but the underlying task scheduler was:
“It’s like we had built a fancy car, and people said “Great! I’ll just take the engine please”. That’s because they were building something far different than was anticipated, like a rocket ship or a submarine or a segway. They wanted the internal scheduler machinery, but wanted to plug that into their own highly customized solution. That’s where the lower-level APIs came from.
Today most users still start with Dask’s high level APIs, but they then hit a complexity wall and shift into the lower level APIs. Dask was designed to make it easy to switch back and forth, so people mix-and-match as they need.”
Rockin noted that a large fraction of the Python community today does work that is well outside of the traditional business intelligence space. They’re inventing new kinds of models, and building sophisticated systems that haven’t really been built before. The engine (task scheduler) was still useful, but the wheels, steering column, and transmission (dataframe algorithms) weren’t applicable to where they wanted to go.
That’s something to keep in mind wrapping up with Dask. As the Java world has its scheduler in YARN, and the cloud-native world has Kubernetes, the Python world seems to be using Dask for this, too. Rocklin said that working on Dask has been a great social and community building experiment:
“We and several other groups are trying to move the entire PyData ecosystem, which is both huge in terms of scope, and also fractured into hundreds of distinct groups. We spend a lot of time building consensus, figuring out what end-users need, what institutions are willing to participate, and who is on the ground that is able and willing to perform the work. It has been exciting seeing a resurgence in progressive activity within the stack over the last few years coming from a wide variety of different groups.”