Modern software systems are often composed of multiple independently operating components. It has become popular to split monolith services into microservices that live in different processes, containers, machines, or even different datacenters. Failure of individual components from time to time is all but guaranteed. Consistent uptime for the system as a whole depends on an application design that is resilient to these failures.
Precipice is a library designed to provide the building blocks for improving system resiliency. It offers composable metrics and back pressure mechanisms for isolating and handling failure of individual tasks of execution. Additionally, it offers tools for developing patterns of execution to best avoid, and if necessary, handle failures.
There are a number of other libraries in the Java ecosystem which are designed to provide system resiliency. Precipice tends to be lower level than some of these alternatives. Precipice does not impose any execution model upon your application. It is not strictly coupled to threadpools, actors, communicating sequential processes, or any other concurrency model. Instead, Precipice is designed to be able to be used in conjuction with–or as a part of–one of these higher level libraries.
Introducing the GuardRail
The basic abstraction provided by Precipice is the GuardRail. A GuardRail isolates the execution of tasks that have failure conditions.
A GuardRail is parameterized by two different enum types. One type defining the possible outcomes of execution. And another type defining reasons why execution might be rejected.
Additionally, it has five main attributes.
- Name - used for identification purposes.
- Result metrics - counts of the different task execution results.
- Rejection Metrics - counts of the different execution rejection reasons.
- Latency Metrics - optional latency metrics for the different execution results.
- Back pressure - Zero or more back pressure mechanisms informed by these metrics.
A GuardRail can be constructed using the builder.
Using a GuardRail
A GuardRail has semantics similar to a semaphore. When you are interested in accessing the code path isolated by the GuardRail, you must request permits. At this point, the GuardRail will consult with the provided back pressure mechanisms to determine whether the permits can be acquired or if the access must be rejected. If the access is rejected, the rejection metrics are updated.
When permits are successfully acquired, the system can safely proceed with execution. Upon completion, the permits must be released. This can be done manually, or there are a number of contexts which Precipice provides to release permits automatically. The act of releasing the permits updates the result metrics and latency metrics (if present). It also informs backpressure mechanisms that the permits have been released.
As mentioned, there are a number of completable contexts that will make this process less tedious. When complete or completeExceptionally is called in the example below, the permits will be released automatically.
In the example above, the completable can only be written to by a single thread. Precipice provides a threadsafe Eventual for usage across thread boundaries. The threadsafe version can be constructed using the Asynchronous factory.
Integrating Specialized Execution Models
Often Precipice users may be interested in integrating a GuardRail with specialized execution logic opposed to using GuardRails in the adhoc method shown in the example above. For the former case, the namesake Precipice interface can be implemented.
This interface merely indicates that the implementing class has a GuardRail. The actual acquiring and releasing of permits and execution must be implemented.
There are a couple of provided examples for how this can be done:
- CallService - This class is in the core module and will isolate the call method on Callables with a GuardRail.
- ThreadPoolService - This class is similar to the CallService. However, it executes Callables on a threadpool opposed to the calling thread.
- HttpAsyncService - This class takes a URL string and a AsyncHttpClient upon construction. Calls to makeRequest take a RequestBuilder, apply the URL to the RequestBuilder, and execute the http request using the provided async http client.
The third example, demonstrates two interesting factors allowed by the design of Precipice.
- By isolating one specific endpoint behind the GuardRail, you can still share a single Netty http client between multiple HttpAsyncService for maximum efficiency. Under the hood a single event loop group can handle all of your application’s IO. This is possible due to the fact that Precipice does not mandate any specific threading model.
- Passing RequestBuilder, opposed to a fully built Request, allows you to utilize a Pattern to submit the request to different endpoints.
Highly available systems often mandate some degree of redundancy. Precipice provides a number of tools to build patterns to utilize these redundant services.
When you construct a Pattern you provide a collection of Precipice implementations and a PatternStrategy. You can call the getPrecipices(long permits) method to return a sequence of Precipices for which permits could be acquired.
The PatternStrategy defines the logic for which Precipices we attempt to acquire permits.
The Pattern will call nextIndices() which returns the indices of Precipices for which to acquire permits. The Pattern will continue until it has acquired permits for the number of Precipices definied by acquireCount() or until the indices have been exhausted. Then it returns a sequence of the Precipices with acquired permits. If the sequence is empty, all acquire attempts failed.
There are two examples provided in the core module.
- A LoadBalancer - this balances acquire calls to different Precipices. It defines an acquireCount() of one.
- A Shotgun - this randomly distributes acquire calls to different Precipices. It allows a configurable acquireCount(). An example of when this strategy might be useful would be duplicating an idempotent http request to multiple endpoints and taking the first response.
There are currently three provided mechanisms of backpressure.
- Semaphore - limits the maximum number of permits that can be acquired at one point in time.
- RateLimiter - limits the maximum number of permits that can be acquired over a period of time.
- CircuitBreaker - starts rejecting permit acquisition attempts if failures are occuring.
Users can also implement their own mechanisms of back pressure using the BackPressure interface.
As (briefly) mentioned above, there is a Precipice implementation that isolates callable execution behind a GuardRail and on a threadpool. Out of the box, this should work for many different use cases.
There are multiple metric options provided. Some keep total counts for the entire application lifetime. Others are rolling, so you can query the metrics for specific time periods. I am also working on others that are only written to by a background thread. This would be a specialized case for users that demand very low latency on permit acquisition.
A significant point of emphasis in the design of Precipice is to provide back pressure when necessary. However, this is not necessarily the only use for Precipice. It is possible to configure a GuardRail with a rejected type of Unrejectable and no back pressure. The purpose of this usage would be to utilize only the result latency and count metrics components for monitoring execution.
Finally, the metrics allow configurable types. This allows you to create results that are further segemented than just success or failure. The HttpAsyncService provides a good example of a result that defines both status code 200 and status code non-200 as successful results that we would like to moniter. Similarily it segments failures as timeout vs. error. It would be possible to further split out error metrics by specific exception type.
An Emphasis on Performance
Precipice was design with performance in mind. I am still in the process of optimizing it. However, right now it should easily meet the performance demands of most use cases.
There is no locking (or synchronized) on permit acquisition or release. All of the provided metrics and circuit breakers are updated and read in a lock-free manner.
Everywhere nano time is used there is a method arity to pass in the nano time. Essentially for task execution there should only be one call to System.nanoTime() on acquisition and one call on release.
I am also working on variants of the different components that can be written to by background threads. This would allow many of the volatile reads and writes to be removed from the acquisition and release calls.
Finally, there are options with the Pattern class to use thread local iterators when calling getPrecipices(long permits) to avoid object allocation. This should only be used if you are an advanced user and know what you are doing.
Precipice is currently used in production at Staples SparX. It is nearing 1.0. However, the API may still change between now and then.