Working locally with spark dev containers
Feedback loops are vital. If we can't discover quickly when something is not right, it's going to cause problems. When we first started using Spark from Python notebooks at endjin, it seemed to compromise some important software development feedback loops, so over the past few years we have developed ways of working with Spark that fix this.
If you want to skip this introduction and go straight to the part where I show how to set up a dev container supporting local Spark development, then you could read the second post in this series. In this first part, I explain why you might want to do this.
Which feedback loops does Spark compromise?
The two most important feedback loops that notebook-driven Spark development causes problems for are the inner dev loop, and developer-driven testing.
The inner dev loop
When we write code, there's a recurring cycle. We have an idea for how to do something, we express that idea in code, and then we need to find out whether it works. This is sometimes called the inner dev loop, and it happens at the scale of a single developer (or two developers if you are pair programming).
A fast inner dev loop offers two major benefits. The most obvious is that sitting and waiting for the computer to do something is not a productive use of time. The less time it takes to find out whether our idea worked, the more productive we will be, and the more likely we'll be able to enter, and remain in a flow state. The second, more subtle benefit is that if iterative experimentation feels fast and easy, we will experiment more, which may help us find better solutions.
Python has traditionally been very strong on this front. If you just run the Python command with no arguments, you can type Python expressions and it will immediately evaluate them. This is feature is sometimes called a REPL, a Read, Eval, Print Loop, and it's common in programming systems that encourage experimentation. Those of us old enough to have experienced the home computer revolution of the 1980s are familiar with this interactive form of programming in which feedback is instant. It's very different from programming systems where you have to set up project files and build processes. Some languages make you wait for minutes to see the effects of your work.
It's easy to take instant feedback for granted. My father spent his whole career in computing, and when he got started, his inner dev loop was roughly as follows. If he wanted to try out an idea, he had to prepare a stack of punched cards with his source code, load them into the back of his car, drive to the far side of London to get to the IBM data centre in Deptford, having pre-booked some time on their systems. He would carry the stack of punched cards into the room with the card reader, submit them to the operator, and then wait until they loaded the job into a card reader, and then wait some more until the job was executed, and the operator handed back his stack of cards and a printout of the results. You might be able to repeat this cycle once a day, at best. The idea that you could just type code into a computer sitting on your desk and get a result in seconds was a revolutionary upshot of the advent of personal computers in the 1980s.
It's not just Python itself: Python notebooks extend this tradition. It takes just seconds to create a new notebook and try something. The feedback—discovering whether your idea works—comes very quickly.
The fast-iterating experimental approach enabled by this inner dev loop may well explain Python's popularity in data science and analytics. Widely used analytical libraries such as NumPy and pandas fit well with this experimentation-driven approach, empowering individuals to try out ideas quickly.
Working with Spark tends to be...slower.
Spark is more about the throughput. Much like a freight train, it has phenomenal capacity, but it can feel like it takes forever to get moving.
Spark can process massive datasets because it can distribute work across large clusters of computers. Some organizations might have enough work to keep such clusters busy 24x7, or maybe they can afford to leave them running even when they are idle. But for some usage patterns, it will be more cost effective to host your Spark cluster in the cloud so that you only pay for the machines while you are using them.
And this often puts the first major dent in Python productivity. When you have an idea and want to try it out, instead of getting a result in seconds, you might now have to wait 10-15 minutes while a Spark cluster is provisioned, and boots up.
Even if a cluster is already running, starting a new Spark session can sometimes take a minute or two. And it also reports that it's ready before it really is: the first time you ask a Spark session to do some actual work, there's typically a further delay (although that one's usually only a few more seconds).
In some ways it seems churlish to complain that after deciding you'd like a new cluster of servers to be set up and configured to your liking, you have to wait as long as 15 minutes before being able to use them. Not so long ago, you might have had to wait months while your department fought internal battles over budgets, and the procurement department haggled with server vendors.
The fact that we can provision a Spark cluster on a whim is a modern marvel. But if you're used to working with pandas, the 15 minute wait will be pretty frustrating, and this lengthening of the inner dev loop lowers productivity.
Feedback from automated tests
A series of fast 'try something, see the effect' iterations is great when we're exploring possible solutions. But there's another, inherently slower feedback loop of a different kind that is vital to long-term quality: testing. Whereas in the inner dev loop we're asking "Will this work?" with testing we're asking "Does this still work?"
There's a problem with the very interactive, experimental style of development that Python can enable: a lack of repeatability. You can get a notebook into a state where it seems to do what you want, but only because of something you did earlier. You can run some code that puts something into a variable, and then change or even delete that code, and the remaining code cells might be working only because of what's already in that variable. If you save the notebook and then come back to it some days later, it might no longer work.
(This is a very unusual case in which turning it off and back on again causes a computer to stop working correctly. Python notebooks seem to be violating a fundamental law of computing here.)
This is where a more pedestrian approach has some advantages. Programming systems with a build process tend to behave consistently. If I clone and build a C# or C++ repository, I expect it to exhibit the same behaviour that its developers see. But if I obtain a copy of someone else's Python notebook, I expect to spend a long time tinkering with my environment to make it sufficiently similar to the original developer's machine to produce the same results.
Python's culture of fast iterating experimentation goes hand in hand with the "It Works on My Machine" problem.
There are ways to mitigate this. If a Python developer diligently uses virtual environments, or something like Poetry for all their work, and makes a requirements.txt
(or pyproject.toml
) available, I'm much more likely to be able to reproduce their results. (Although not always. I've found that the machine learning world seems to have a lot of problems with environmental dependencies that are beyond what virtual environments can help with.)
This is really a case of a Python developer choosing to use a more old-school-build style of working. And it tends to go hand in hand with increasing maturity. Development might begin with a slightly chaotic discovery phase, but as work continues, and the resulting system matures, producing high-quality, repeatable results means adopting certain processes.
One particularly important step in this direction is the addition of repeatable tests.
Some of the experiments in our inner dev loop are discardable: perhaps we're just trying to test whether pandas will be able to join certain kinds of data in the way we would like, and once we've got our answer, we can either proceed or, if it didn't work, try something else. However, some tests are different. If we do something that verifies that our results make sense—perhaps we're looking at some accounting data, and if the data is free from error the sum total of a particular set of numbers should be zero—that might be a test we shouldn't throw away. If we're testing something that sounds like it should always be true, it might be better to ensure we can run that test any time we want.
Automated tests have been described as a kind of ratchet for quality. A test ensures that a system behaves as required in some way. Failure to behave as required is a mark of low quality, which is why testing is crucial to quality. Testing once is better than never testing. But if you automate your tests, and if they run quickly enough, you can run them every time you change anything.
This gives you rapid feedback whenever you make a change that causes your system to misbehave. By making the test permanently available, and running it regularly, you ensure that having elevated your system's quality to the point where it passes that test, it will never sink back below that level again (hence the ratchet metaphor).
Automated test suites, executed regularly, provide a crucial feedback loop. Once a system reaches a certain level of complexity, it becomes impossible to hold all of it in your head. We can easily forget that one part of the system relies on some specific characteristic of another part of the system. We might fix a problem, but unwittingly break some other part of the system (perhaps one we haven't worked on for months) at the same time. Automated test suites enable you to discover this kind of mistake quickly.
It's easy to incorporate automated tests into traditional old-school build processes. For example, if you're writing in C#, the dotnet
command line tool that we use to build projects (dotnet build
) also knows how to find and execute all the automated tests in our projects (dotnet test
).
What if critical logic executes in Spark? Your tests will need to be able to execute that logic if they are to be useful. But it's not always clear how to do that. If you're using something like Databricks or Azure Synapse to supply a Spark cluster, the easiest way to use Spark from Python is typically to write a notebook that is fully hosted by Databricks or Synapse. There is no build process—you just edit the notebook and run it in a web browser.
It's not at all obvious how to bring automated test suites into the mix in this world. It's not that it can't be done. It's more that the tools don't provide an obvious way to do it, and there doesn't seem to be a strong culture around this in the Spark development community.
Fixing Spark's feedback loops
So there are two problems when developing systems that use Spark: the slow inner dev loop, and the challenges with testing.
How might we fix these?
Run Spark locally to improving the inner dev loop
The single biggest speed reducer for the inner dev loops is often the time it takes to get access to a Spark cluster. But do we really need to wait?
We know that highly efficient inner dev loops for data analysis are possible. Technologies such as pandas make this possible by harnessing the absurd amounts of processing power available in a typical laptop or desktop computer. Spark is slower because it takes the much more old-school approach of submitting work to the data centre (although at least it doesn't require submissions to be in the form of a deck of punched cards).
What if we ran Spark locally?
As computers have become ever more powerful, problems that we once might have described as involving "big data" might now be handled easily by a single individual on a laptop. Even a fairly modestly-specced developer laptop is easily capable of running a local Spark service. And although our application might need a full Spark cluster when processing real production data, we might be able to do a lot of useful testing on smaller datasets that can fit comfortably on a developer's local machine.
At endjin, we've found that this can be a very effective way to improve the dev inner loop. I will talk more about this in a later post.
Enabling local testing through packaging
Automated test suites work best if they run quickly, and can be executed with a single, simple action (or, if possible, they run automatically, with no specific developer action required). There are two things we want to avoid here:
- Manual steps to set things up before we can even run the test
- Extra steps to get code under development from wherever the developer is working on it into an environment where tests can be run
The practical effect of getting either of these wrong is that developers won't run test suites regularly as a matter of habit. This lengthens the feedback loop—it will take longer to discover when a change has unintended consequences that break something. (And in the worst case, if tests are too hard to run, developers might simply ignore them, completely destroying this particular feedback loop.)
When critical application behaviour lives in a Python notebook in Databricks, or Synapse, or Fabric (or any other hosted Spark service) both the problems described above tend to occur.
The solution we've developed is to move the behaviour we want to test into separate packages. We develop and test these locally, and then deploy these packages into the hosted Spark environments. Our notebooks become more like orchestrations: instead of containing application logic, they just load these packages and invoke methods on them. (This still requires testing of course—there could be orchestration bugs. But this can be handled with system testing, and by pushing all non-trivial logic down into the packages, we keep the notebooks themselves very simple.)
Of course, if the logic under test is using Spark to get things done, we'll need to provide this code with access to a Spark context when developing and testing locally.
So this also requires us to run Spark locally.
I'll talk about this package-based approach to enabling better and more frequent testing of Spark-based logic in a later post.
Local development is typically easier than working in a browser
Another benefit of running code outside of a hosted notebook is that the development tooling tends to be better. Hosted Spark environments tend not to offer very good debugging facilities for notebooks. But if you can run the code locally, tools such as VS Code can offer far better debugging.
A local development experience is also often just less frustrating. Hosted notebooks often have problems such as login timeouts, and connections dropping out. They can sometimes mysteriously run slowly or even hang (sometimes just because the web browser has too many tabs open). Local developer tooling tends to provider a smoother experience.
Dev containers
So it's looking like we might want to run Spark locally. However, installing Spark on a machine is not a trivial move. So it would make sense to put it in a container.
We want to minimize the overhead for individual developers. Ideally they should be able to clone a repository, open it up, and just start working. We don't want to make them work through a set of instructions on how to install and configure a container with a local Spark instance. Better yet, any Visual Studio extensions and other related tools for enhancing the development experience should ideally be installed automatically, so that developers can be maximally productive from the start.
This is where dev containers come in. It's possible for a repository to include a file that describes how to create and configure a container that provides whatever is needed to start work. And if you open such a repository in a dev-container-aware tool such as VS Code, it will find that file, and will offer to set up a container for you. The only prerequisite is that you must have a container provider such as Docker installed on your system.
I'll explain how in the next post in this series.
Conclusion
Feedback loops are vital if we want to build high-quality systems. Two of the most important feedback loops are the inner dev loop, and automated testing. The ability to run Spark locally can enhance both of these, so next time I'll show how to set that up.