10x Apache Spark performance improvement in Microsoft Fabric | endjin

Ian Griffiths 11th June 2024

Tutorial

Boosting Apache Spark Performance with Small JSON Files in Microsoft Fabric. Learn how to achieve a 10x performance improvement when ingesting small JSON files in Apache Spark hosted on Microsoft Fabric.

Ian Griffiths, Technical Fellow at endjin, shares insights and techniques to overcome Spark's challenges with numerous small files, including parallelizing file discovery and optimizing data loading. Follow along for detailed steps and tips to significantly enhance your Spark data processing workflows using Apache Spark in Microsoft Fabric.

00:00 Introduction to Performance Improvement in Apache Spark
00:20 Understanding the Problem with Small Files in Spark
00:38 Our Scenario: Performance Telemetry Collection
01:20 Initial Approach and Disappointment
01:40 Exploring the Root Cause
05:27 Parallelization: The Key to Performance Boost
08:51 Implementing the Solution in Spark
12:43 Conclusion: Balancing Complexity and Performance

Transcript

I'm going to show you how we got a 10x performance improvement when ingesting lots of small JSON files in Apache Spark hosted in Microsoft Fabric. My name's Ian Griffiths, I'm a Technical Fellow at endjin, and if you find this video useful or interesting, please like and subscribe.

Apache Spark can analyze massive amounts of data at amazing speed. But if you listen to the received wisdom, it has a weak spot. It struggles if your data is spread across lots of small files. There is some truth in this idea. But it doesn't have to be as bad as people often think. I'll quickly describe our scenario.

We were collecting performance telemetry across a large number of services deployed to the cloud. We ran a whole load of instances to be able to collect data more quickly, but an upshot of this is that for a single test run, collecting about a hundred thousand data points, we end up with data that's That's spread across about 30,000 files, all scattered across many thousands of directories.

We used Apache Spark to aggregate and analyze this data, and as it happens, we use Microsoft Fabric to host it. But the principles here would apply to any Spark setup. Now these are fairly small volumes of data, so I was not expecting Spark to struggle. Sadly, Spark did struggle. I was taking the obvious approach.

I handed Spark a wildcard path that would find all the files, and I asked it to load them into a data frame. I was disappointed when this took around 45 minutes. And this isn't really what you could call big data. Spark shouldn't even break a sweat. So, I did what any developer would do. I searched the internet to see if anyone else had already solved this problem.

Instead, what I found was that lots of people had run into this kind of issue, and all had been met with a counsel of despair. Alas, alack, Spark is just bad at this kind of thing, if the self appointed experts were to be believed. It's not entirely untrue. Spark is optimized to deal with data in fairly large chunks.

For my application, any single investigation involves about 100,000 records. Spark was designed on the assumption that each single input file would contain at least that many records, possibly 10 or even 100 times that number. So having 30,000 individual files, each containing on average three records, is not what Spark's designers had in mind.

The effect is similar to trying to ship 100,000 USB memory sticks by using 30,000 full size shipping containers and putting just a few items in each one. So it is absolutely true to say that this is not playing to Spark's strengths. But where the received wisdom seems to go wrong is to assume that the terrible performance you get in practice is as good as it can get.

This is unnecessarily defeatist. I approach this from a slightly different angle. Peter Norvig wrote an essay entitled Teach Yourself Programming in 10 Years, in which he shows a much reproduced table listing various simple things people get computers to do, and roughly how long you can expect each one to take.

Norvig says that one essential ingredient for success as a programmer is to know how long it takes your computer to read consecutive words from disk and seek to a new location on disk. I think our Spark scenario is an excellent example of why it's important to internalize such things. Anyone with a deeply ingrained feel for how long computers take to do things will raise an eyebrow at the idea that it might take 45 minutes to discover and read 30,000 files.

Sure, it's a lot of files, but it's not that many files. So even if we aren't playing to Spark's strengths here, we still need to ask, how is Spark managing to do such a spectacularly bad job? Sure, we're not going to get the full benefits of Spark's superpower, Its ability to chomp through enormous volumes of data at staggering speed.

When you package the data the way Spark wants, you'd expect to process hundreds of thousands of records a second. But even if we don't align everything in just the way Spark likes it, how has it managed to be so much catastrophically worse than a naive program doing things in a very straightforward way?

Sure, a high performance sports car in a traffic jam can't exploit its potential, but you wouldn't expect it to be ten times slower than a cheap hatchback. How long does the cheap hatchback take? Well, if we use Norvig's numbers as a guide, he tells us that reading a small amount of data from some random location on a storage device takes about 8 milliseconds.

With modern solid state storage, we might expect better. Then again, Norvig figures are for storage devices that are part of your machine, whereas I'm talking to a data lake, which is a completely separate service in the cloud. But this gives us a rough sort of baseline. At eight milliseconds per operation, discovering 30,000 files would take about four minutes.

That's a lot better than the 45 minutes we were observing in practice and happens to align with the four minutes I promised at the start of the talk. Just to check I hadn't miscalculated. I wrote a simple C# program to discover and fetch all 30,000 files from the data lake. Running this from a VM in the same cloud region as the lake took about 13 minutes.

And while that's considerably worse than the 4 minutes predicted using Norvig's numbers, it's a lot better than Spark managed. I then made a very simple improvement to parallelize the file download a little bit, and this got it down to 14 seconds. In fact if I ran the same code from my desktop, which puts it at a major disadvantage, because I'm not even in the same country as the data lake at that point, it's still able to discover and parse 100,000 records across 30,000 files in about 30 seconds.

So running a very simple program from my desktop outperforms Spark. By a factor of 80, what on earth is going on? The performance boost I got from parallelisation is the big clue here. The data lake is perfectly capable of serving up all of this data in well under a minute, but only if we parallelize. I took a look at the Spark source code to see if it was even attempting any similar parallelization, and I discovered that although it does have a mode for parallelizing file discovery, this only kicks in if you pass a sufficiently large list of files.

I was passing a single wildcarded path, and although that represented 30,000 files, Spark saw it as a single input and decided that there was no need for parallel execution. It does have a faster, parallelized execution path for file discovery, it just chose not to use it. Now, to be fair to Spark, there isn't a good, general way to parallelize the discovery of files in a data lake.

It doesn't know what files are there until it looks. The reason I was able to modify my code to discover and process all the files in under 15 seconds is that I exploited some knowledge. about my folder structure. I've written an adapter here that lets me use the Reactive Extensions for .NET (Rx .NET) because that makes it really easy to express the structure of the parallelism I want.

I happen to know for any single Azure subscription I can dive straight down to a certain level in the folder structure to get a list of all the resource groups and enumerating that one folder is a sequential process but then I'm feeding that into Rx's select many operator which enables processing of each subfolder to proceed in parallel.

So this folders variable is a stream of folders, but it's being generated by a large number of parallel operations. This code makes no attempt to bound the parallelism, but that's fine, because the client library I'm using to talk to the lake does provide a configurable limit on concurrent work. At a certain point in the hierarchy, I stop adding parallelism.

This last step here takes each folder from the previous stream and performs an exhaustive recursive walk. And it actually does that sequentially. But we can have loads of those sequential recursive walks all happening at the same time. And this gives me enough parallelism. to get a useful performance boost, but enough sequential work that the program doesn't waste time by producing tasks that do too little work.

And this, in practice, seems to be the most efficient way to use this Data Lakes API. And that's why this takes just 15 seconds. Instead of 45 minutes. The problem with this code is that I've adapted it to a very specific folder layout. There's no way Spark could have guessed that this would be the optimal strategy.

So we can't reasonably expect Spark to match this bespoke code's performance. But we can give Spark a nudge in the right direction. The trick is to supply Spark with a big enough list of starting points that it will use its parallelized execution path. I can exploit my knowledge of the folder structure just like I did with my C# code.

Here's the Python notebook I'm using to read all those JSON files. I'm using Microsoft Fabric here, but this would work for any environment offering Spark and Python notebooks. And this cell here is the trick. It's almost embarrassingly simple. But this little bit of code is what delivered the Order of Magnitude performance improvement.

All I'm doing is using Python's glob library to expand this wildcarded path into a list of folders. I'm exploiting exactly the same knowledge that I did in my C# example. I know that if I go to a certain depth in the folder structure, I'll end up with a list of folders that is reasonably large.

This step is at a slight disadvantage. The C# version was able to exploit parallelism thanks to the reactive extensions for NET, whereas Python's glob works sequentially. However, Data Lake file system access from a Python notebook in Microsoft Fabric seems to have some pretty effective caching, so this runs in under a second.

And at this point, we've got a list of paths that's long enough to convince Spark that it should be parallelizing this load operation. The result is that this takes only a few minutes to find and process all of the JSON files. It's still quite a lot slower than my simple little C# program, but now we really are up against the fact that Spark just wasn't designed to process data in this form.

The people crying alas alack weren't entirely wrong. If your input or data arrives in this form with lots of files, each containing very few records, Spark won't shine. But what the Doom merchants all missed is that it doesn't have to be as bad as it first seemed.

Your mileage may vary with this approach. The way my data was split across folders happened to make this technique work very well. You might not get so lucky. You might need to get more creative in how you build up the list of files you pass to Spark. And in any case, this is still a long way from optimal.

If you're dealing with significantly larger volumes of data, it might be worth biting the bullet and writing a program that pre processes the input into a form that suits Spark better. In this example, we could just concatenate all of the individual JSON files into a single file. Normally, you can't just concatenate JSON, but this is JSONL, where each line is a separate JSON object, so this very simple concatenation approach works.

It only takes a fairly small modification to my C# program to do that, and it still runs in well under a minute. And for our application, a single investigation results in a file roughly 150 megabytes in size, and Spark is able to load that into a data frame in about four seconds. That's a pretty good illustration of how much better Spark performs when your inputs match its design expectations.

The obvious downside of this is that it requires an extra pre processing step. I wrote a C# program to discover and concatenate all the JSON, but how would I incorporate that into a production data processing pipeline? For large volumes, it would probably be worth finding a way, but for this application, the extra complexity just wasn't justified.

The original 45 minute import time was intolerably slow, but for this application, 4 minutes was good enough. And the simplicity of being able to keep it entirely within Spark, driven by a Python notebook, meant that in this particular case, this was the right balance between complexity and performance.

So in conclusion, it is true that Spark copes much better with a small number of big files than with lots of tiny files. But it might not have to be as bad as you think. If you're parsing a single wildcarded path that represents thousands of files, you should see whether partially expanding the wildcards before handing it off to Spark improves matters.

In our scenario, this unlocked Spark's ability to parallelise this work, which gave us an order of magnitude performance improvement.