Getting Started with Distributed Systems Using Python & Ray
How to get started building distributed Python applications without having to start from scratch
May 19, 2021
Distributed systems 101
You’ve doubtlessly come across the phrase “distributed or decentralized systems.” Most engineers have some experience with distributed systems; but all most of us remember was the constant headaches of message brokers going down or getting out of sync. Constantly maintenancing your system to ensure message integrity and making your application performant with concurrency is hard work. Distributed systems can get a bad rap because of this. But even if you haven’t built your own distributed system from scratch, at some point you’ve heard of parallelism or concurrency and know how amazing it can be for performance. So maybe learning the ins and outs of distributed systems isn’t all bad.
In this post we are going to cover what a distributed system is, the pros and cons of distributed systems, and the basics of building distributed systems in Python using Ray. By building a distributed app with Ray you can gain experience with distributed systems without toiling on the network and messaging management! A win-win!
Important distributed system concepts
First and foremost, you need to understand what exactly a distributed system is and the component parts.
What is a distributed system?
Basically a distributed system is any system which has components running on different machines that communicate via messages and work together towards a single end. Creating a system like this not only allows for better performance due to its simple scalability and builtin concurrency, but it also allows for an incredibly flexible and resilient application that doesn’t have a single point of failure.
The simplest, and most relatable, example of a distributed system would be... the Internet! It's a bunch of machines that send messages to each other to achieve the goal of being the Internet—pretty sweet system. This same concept can be applied to a plethora of problems - machine learning, real time streaming - anywhere there is a need for large-scale compute then distributed systems are a must.
Another genre of popular distributed systems are blockchain systems. These are essentially large peer-to-peer systems that don’t rely on specialized machines to serve requests, but rather EVERY machine contributes to the system.
As you can see, distributed systems have a variety of use cases and incredible benefits but at a cost. How should we evaluate the pros and cons of distributed systems?
Pros of distributed systems
- Fault tolerance - Since the system’s tasks are distributed across multiple machines, that means that in the event of a machine failure, another agent is available for the system to delegate the work to instead. This prevents a single source of failure.
- Scalability - Distributed systems follow best practices for scaling in that they are focused on horizontally scaling loads rather than vertically. This means that as the load increases, you simply add more machines to your system to do the work rather than increasing the memory and CPU of the old machines to handle the larger input.
- Efficiency - Since the system can be more performant than a non-distributed system, you have the flexibility to use fewer resources to get the same task done, even removing machines reactively to the input. This allows for a better cost fitting to usage in the cloud.
Cons of distributed systems
- Messaging - Since there are different machines processing at the same time, there is a large risk of overhead necessary to synchronize them and keep all the applications running appropriately and the data consistent across the network. This is tied closely to your overall design but is especially difficult when there are issues that need to be debugged within the system but machines stop or fail seemingly at random.
- Reliability - I know that I called out fault tolerance as a benefit of distributed systems, but it is worth noting that the network management is often a behemoth task that is tuned particularly to your application and thus can be sensitive to changes or outages in the network.
- Maintenance - Running an application where the number of machines scales requires new architectural and managerial considerations. Traffic needs to be directed correctly to machines via a load balancer of some kind and every machine needs its own application AND infrastructure monitoring, logging, delivery, and testing.
Resources to learn more about distributed systems
This is just scratching the surface of what distributed systems are and how they work. There are plenty of online courses and resources out there to learn more about distributed systems, but if you’re looking to take a deep dive into the more general world of distributed systems, here are some resources I personally recommend checking out:
- Enterprise Integration Patterns by Gregor Hohpe is an incredible project/resource that prowls through history and distills the basic gist of distributed systems over the decades.
- freeCodeCamp’s Intro to Distributed Systems. I’m a big fan of freeCodeCamp as a jumping off point because the resources are well vetted and well...free.
- AnyScale Ray Crash Course. Okay, I said general distributed systems but you’re here for Python so I wanted to highlight this video from the Ray documentation!
Distributed systems and Python
Now it turns out that Python specifically has issues with performance when it comes to distributed systems because of its Global Interpreter Lock (GIL). This is basically the soft underbelly of Python that only allows for a single thread to be controlled by the interpreter at a time. Many libraries have tried to work around this limitation (threading, multiprocessing, asyncio) to help fight against the battle of resource lock. Depending on your design these libraries can help, but in order to make them hold up at scale you will encounter a high bar of complexity and administrative work to ensure all of the threads are playing nicely. What's more, running these libraries that leverage multiple cores, gives close to no visibility into the GIL without a highly specialized compute-only workload.
This often leaves Python developers with the complex task of building a distributed system from scratch leveraging message handling apps like RabbitMQ or ZeroMQ to act as message brokers across machines. Often making the application isn’t too difficult, but managing the administrative overhead associated with a distributed system requires a lot of tuning, networking management, and maintenance to ensure the application’s resiliency. This comes with all of the cons listed above, and also requires a lot of infrastructure expertise to spin up, manage, and maintain well. This is definitely doable and educational, but in the case of a proof of concept or a developer looking to create distributed applications swiftly, this would take a long time to get working reliably.
Example solution using Ray
Enter Ray! Ray is a first of its kind open source distributed execution framework built by UC Berkeley’s RISELab. The team supporting it calls itself Anyscale! Ray is actually an API that you can call with your Python code that spins up a cluster providing multiple cores for your Python code to run a process, each with their own GIL. What this amounts to is almost an Infrastructure as a Service (IaaS) model where you create your Python apps and “distributify” them.
Here is an example calculating pi with and without Ray from Anyscale’s Github Repository:
First they create a Python example leveraging NumPy:
Generate random samples for the x coordinate
Generate random samples for the y coordinate
Like Python's "zip(a,b)"; creates np.array([(x1,y1), (x2,y2), ...]).
Create a predicate over all the array elements
Selects only those "zipped" array elements inside the circle
Return the number of elements inside the circle
The Pi estimate
xs = np.random.uniform(low=-1.0, high=1.0, size=num_samples)
ys = np.random.uniform(low=-1.0, high=1.0, size=num_samples)
xys = np.stack((xs, ys), axis=-1)
inside = xs*xs + ys*ys <= 1.0
xys_inside = xys[inside]
in_circle = xys_inside.shape
approx_pi = 4.0*in_circle/num_samples
The idea here is that you are using NumPy to select random points in a grid that has a circle drawn on it. The more random points you have, the closer approximation you have to calculating pi. That being said, you need something like 10,000 points before you can even get close to the actual value of pi. This can become a TEDIOUS and compute-intensive process.
Now what does it look like add Ray to this Python function?
refs = [ray_estimate_pi.remote(n) for n in [100, 1000, 10000]]
That’s it. Simply add a decorator to a new function for your code et voila, you have an asynchronous distributed execution of your function that will run close to x25 faster! This is possible because of the way Ray defines and manages remote functions. You can learn more about Ray’s internals here!
At the end of the day, even if you aren’t looking for an SRE position, gaining a working knowledge of distributed systems will doubtlessly come in handy because the future is distributed! One of the biggest considerations with implementing them, however, is considering how much administrative overhead you’re willing to balance against the value you get from a distributed design. Regardless of how performant distributed systems are, the maxim “simple is best” holds true. In that vein, if Python is your main language, then Ray is a great choice for getting exposure to distributed systems quickly and easily without worrying about all the overhead.
Abstract vector created by rawpixel.com - www.freepik.com