Advanced Reactive Java

Introduction

In the previous post, we saw how to build a ConnectableObservable which publishes events to its child Subscribers when all of them requested some amount, making them go in a lockstep.

In this blog post, I'm going to detail how one can build a replay-like ConnectableObservable: i.e., ReplayConnectableObservable. The internal structure is very similar to the PublishConnectableObservable but the request coordination is going to be more complicated.

Replay bounded or unbounded

When one wants to create a replay-like operator (or Subject), the decision has to be made whether or not do bounded or unbounded replays. Unbounded replay means that from the time of the connect(), every value is essentially cached/buffered and every subscriber will receive values from the very beginning.

Bounded replay means that the cache will start losing data as time and values go by so a late subscriber will "skip" these early values and only get the newer ones.

However, the data structures supporting these modes are quite different. The unbounded buffer can be any list-like data structure, such as j.u.List or a hybrid linked-array list (to avoid copying when the list grows). The bounded buffer is going to be a linked-list like structure, but j.u.LinkedList can't work here; we need access to the individual nodes.

The reason is twofold: 1) we need a way to tell the "current start" of the buffer as time goes and 2) we have to deal with child Subscribers who lag behind with requests and can't let them miss in between values.

The right data structure is a singly-linked list where nodes hold the actual value. Then we have to keep reference to the head and tail of the list. The head indicates where the replay will start for newcommers and the tail indicates where to append new nodes containing values from the main source.

This structure has two implications: 1) due to the singly linked nature, if the head of the list is no longer referenced by the head or by any child Subscriber, it can be "automatically" garbage collected and 2) if we pin the head pointer and never move it, we get an unbounded replay buffer (although with more overhead due to pointer chasing).

For unbounded buffers, both head and tail are integers, head is zero and tail is the number of available values.

In addition, each subscriber (or its wrapper structure) has to track where it is at replaying: either via an index into the list or a node into the linked list.

Since we'd like to support both modes, which only differ in the buffer management, let's declare a basic interface that captures buffer operations.

interface ReplayBuffer<T> {
    void onNext(T value);
    void onError(Throwable e);
    void onCompleted();
    void replay(ReplayProducer<T> child);
}

The interface is straightforward, it takes the various events and allows replaying to a specific child subscriber (described later).

Unbounded replay buffer

Now let's see the implementation for the unbounded replay buffer:

static final class UnboundedReplayBuffer<T> implements ReplayBuffer<T> {
    final List<Object> values = new ArrayList<>();
    volatile int size;
    final NotificationLite<T> nl = NotificationLite.instance();
    @Override
    public void onNext(T value) {
        values.add(nl.next(value));
        size++;
    }
    @Override
    public void onError(Throwable e) {
        values.add(nl.error(e));
        size++;
    }
    @Override
    public void onCompleted() {
        values.add(nl.completed());
        size++;
    }
    @Override
    public void replay(ReplayProducer<T> child) {
        if (child.wip.getAndIncrement() != 0) {
            return;
        }

        int missed = 1;

        for (;;) {

            // implement

            missed = child.wip.addAndGet(-missed);
            if (missed == 0) {
                break;
            }
        }
    }
}

We simply convert each event into a notification and add it to the list. Incrementing the volatilesize fields acts as a release (no need for atomic increment because the callers of the onXXX methods are serialized), therefore, observing its value means the values list can be iterated up to that point (all resize related operations have been committed). The replay method, so far, is the well known queue-drain pattern: a single thread will enter and do whatever it can to emit values. Let's see the drain part in this method:

// for (;;)

long r = child.requested.get();
boolean unbounded = r == Long.MAX_VALUE;
long e = 0;
int index = child.index;                    // (1)

while (r != 0L && index != size) {          // (2)
    if (child.isUnsubscribed()) {
        return;
    }

    Object v = values.get(index);           // (3)

    if (nl.accept(child.child, v)) {        // (4)
        return;
    }

    index++;
    r--;
    e--;                                    // (5)
}

if (e != 0L) {
    child.index = index;                    // (6)
    if (!unbounded) {
        child.requested.addAndGet(e);
    }
}
// missed = ...

This should also look familiar, let's see the reasoning behind certain lines:

We retrieve the current child requested amount and the current child index. We remember if the request amount was Long.MAX_VALUE and have a counter for emitted values.
We have to try emitting if the child can receive it and we haven't reached the end of the available values.
If both requests and values are available, we get the next event by index.
The NotificationLite.accept will convert the notification object into the proper onXXX call on the child Subscriber and return true if said event is a terminal event.
We increment the index, decrement the remaining requested amount and decrement the emission amount. This latter may look strange but it saves us a negation when we update the child requested amount in (6).
Finally, if there was any emission, we save the new index and if the child request wasn't unbounded, subtract the emitted count from the child request amount.

Bounded replay buffer

Managing a bounded replay buffer is more involved. I'm going to show a size-bound version but you should be able derive your own custom bounding logic based on it. First, we need a Node type that will hold the actual value and the link to the next Node:

static final class Node {
    final Object value;
    final long id;
    volatile Node next;
    public Node(Object value, long id) {
        this.value = value;
        this.id = id;
    }
}

The node holds the actual value, a pointer to the next node and an id field. This field will help with the request coordination later on.

Now let's see the implementation of the BoundedReplayBuffer:

static final class BoundedReplayBuffer<T>
implements ReplayBuffer<T> {

    final NotificationLite<T> nl = 
            NotificationLite.instance();

    volatile Node head;                            // (1)
    Node tail;
    int size;                                      // (2)
    final int maxSize;
    long id;


    public BoundedReplayBuffer(int maxSize) {      // (3)
        this.maxSize = maxSize;
        tail = new Node(null, 0);
        head = tail;
    }

    void add(Object value) {                       // (4)
        Node n = new Node(value, ++id);
        Node t = tail;
        tail = n;
        t.next = n;
    }

    @Override
    public void onNext(T value) {
        add(nl.next(value));
        if (size == maxSize) {                     // (5)
            Node h = head;
            head = h.next;
        } else {
            size++;
        }
    }

    @Override
    public void onError(Throwable e) {             // (6)
        add(nl.error(e));
    }

    @Override
    public void onCompleted() {
        add(nl.completed());
    }

    @Override
    public void replay(ReplayProducer<T> child) {  // (7)
        if (child.wip.getAndIncrement() != 0) {
            return;
        }

        int missed = 1;

        for (;;) {

            // implement

            missed = child.wip.addAndGet(-missed);
            if (missed == 0) {
                break;
            }
        }
    }
}

This kind of buffer has to consider more state:

We have to keep a pointer to the head and tail of the linked node structure. The head has to be volatile because we are going to read it when a child Subscriber subscribes to it; I call this pinning. The tail is only modified from the thread of the main source (already serialized) and is never accessed by the child Subscribers so no need for volatile there.
Since we want to limit the number of values to be replayed, we have to know the current size (without walking the linked list all the time) and the maximum allowed count. In addition, we will tag each node with an unique running identifier that will come into play during request coordination.
In the constructor, we create our first empty node and assign it to both head and tail. This may seem a bit odd but has its reasons: it allows appending to the end of an empty buffer, otherwise an empty buffer would have null pointers and we'd get a discontinuity. There are two small drawbacks: a) this means the start value at any given time is head.next.value, behind an indirection and b) as we will move the head pointer ahead in (5), it will retain one extra value. In other terms, a replay(5) will keep 6 objects alive. This is true for RxJava's replay() and ReplaySubject too. If one really wants to avoid retaining this extra value, you have to apply reference counting to node which itself adds overhead for every value, both when added to the buffer and when replayed.
We will add new nodes of notifications to the linked list via add. The operation is straightforward: create a new node with a new unique identifier, make it the tail and set the next field of the old tail to this new node. The order is important here because next is volatile and acts as a release operation to all changes made before.
Whenever a normal value arrives, we add it to the list and see if we are already at the capacity limit. If not, we can increment the size counter freely. Otherwise, there is no need to change the size anymore as the plus 1 from the add and minus 1 from the remove operation cancels out. This remove operation is basically moving the head forward by one node: given the current head, make the new head the next pointer of the old head. Since the linked structure is guaranteed to have at least one node (due to add()), the new head won't be null and the continuity is preserved.
Since the terminal events are (usually) not part of the size bound, we can simply just add their node and not care about trimming the list.
Again, the outer drain loop has the well known pattern.

Now let's see the inner parts of the drain loop of (7):

// for (;;) {
long r = child.requested.get();
boolean unbounded = r == Long.MAX_VALUE;
long e = 0;
Node index = child.node;

if (index == null) {                       // (1)
    index = head;
    child.node = index;

    child.addTotalRequested(index.id);     // (2)
}

while (r != 0L && index.next != null) {    // (3)
    if (child.isUnsubscribed()) {
        return;
    }

    Object v = index.value;

    if (nl.accept(child.child, v)) {
        return;
    }

    index = index.next;                    // (4)
    r--;
    e--;
}

if (e != 0L) {
    child.node = index;
    if (unbounded) {
        child.requested.addAndGet(e);
    }
}
// missed = ...

At this point, it shouldn't come to surprise the implementation uses the same pattern as with the UnboundedReplayBuffer, but there are a few differences:

Since the nodes are object references, their default is null so the first time the replay is called, we have to capture (pin) the current head of the buffer (and store it in case the requested amount is still zero).
The addTotalRequested will get this first node's unique identifier. The reason will be explained in the request coordination section below.
To see if we reached the end of the available values, we have to check the next field of index.
If the index.next was not null, we have a value for emission and can move the current index ahead by one node.

Request coordination

So far, there shouldn't be anything overly complicated with the classes (apart from a few unexplained methods and the structure of ReplayProducer).

As stated in the previous blog post, generally there are two ways to coordinate requests: lock-stepping and max-requesting. Lock-stepping was quite suitable for the PublishConnectableObservable.

Let's think about lock-stepping in terms of the replay operation we want to implement. If we want to do unbounded buffering, requesting the minimum amount of all child subscribers doesn't really matter as we will retain all values regardless when they are requested; everyone will gets its amount replayed regardless of the others: if there is some Subscriber that can take it all, why not get the values for it?

I we want to do bounded buffering, child Subscribers may come and go at different times, which means the current identifier inside BoundedReplaySubject is different for each one and each Subscriber will essentially request values relative to this identifier. Here, there is no clear definition of minimum request because the request of 5 in an earlier Subscriber and a request of 2 in later Subscriber that arrives after the 2nd source value can't be meaningfully compared.

Based on this reasoning, what we will do is implementing the request coordination to request the maximum amount that any child requests at any time and let the queue-drain deal with the emission.

However, we still have the problem of non-comparable request amounts due to potential time differences. This is where the unique identifier and another structure comes into play: keeping track the total requested (along with the relative requested). Whenever a child Subscriber requests, we will add this request amount to that particular Subscriber's totalRequested amount (ReplayPublisher.totalRequested) and see if this amount is bigger than the total requested amount we are sending to the main source. If bigger, we request only the difference from upstream.

The unique identifier helps with latecomers in our total-requested scheme. Without it, a latecomer's total requested amount would be too low and not trigger upstream requested in certain situations. For example, let's assume we have a child Subscriber on a range(1, 10).replay(1) that requests 2 elements and gets it. Then a new subscriber comes in and requests 2 as well. Clearly, it should receive 2 values (2, 3), but since its total requested amount is just 2, the replay operator won't request the extra value. The solution is the indexing of values and when the current Node is first captured, use the index as the total requested amount for the child as if the child was there from the beginning but ignored values up to that point.

Note: this property of was just recently discovered as as such, RxJava didn't work correctly. The PR #3454 fixes this for the 1.x series and I'll post a PR for 2.x later.

To make this more clear, let's see the implementation of the ReplayProducer.

static final class ReplayProducer<T> 
implements Producer, Subscription {
    int index;
    Node node;                                    // (1)
    final Subscriber<? super T> child;
    final AtomicLong requested;
    final AtomicInteger wip;
    final AtomicLong totalRequested;
    final AtomicBoolean once;                     // (2)

    Connection<T> connection;

    public ReplayProducer(
            Subscriber<? super T> child) {
        this.child = child;
        this.requested = new AtomicLong();
        this.totalRequested = new AtomicLong();
        this.wip = new AtomicInteger();
        this.once = new AtomicBoolean();
    }

    @Override
    public void request(long n) {
        if (n > 0) {
            BackpressureUtils
            .getAndAddRequest(requested, n);
            BackpressureUtils
            .getAndAddRequest(totalRequested, n); // (3)

            connection.manageRequests();          // (4)
        }
    }

    @Override
    public boolean isUnsubscribed() {
        return once.get();
    }

    @Override
    public void unsubscribe() {
        if (once.compareAndSet(false, true)) {
            connection.remove(this);             // (5)
        }
    }

    void addTotalRequested(long n) {             // (6)
        if (n > 0) {
            BackpressureUtils
            .getAndAddRequest(totalRequested, n);
        }
    }
}

Its purpose is to be set on a child Subscriber and mediate the request and unsubscription requests for it:

We want to use the same class for both the bounded and unbounded buffer mode so we have to store the current index/node in fields.
We have the usual set of fields: the child Subscriber, the wip counter for the queue-drain serialization, the current requested amount and an AtomicBoolean field indicating an unsubscribed state. In addition we will track the total requested amount and will coordinate requesting from upstream with the help of it.
Whenever the child requests, we update both the relative requested amount and the total requested amount with the common BackpressureUtils helper that will cap the amounts at Long.MAX_VALUE if necessary.
Once set, we have to trigger a request management to determine if the upstream needs to be requested or not.
When the child unsubscribes, we need to remove this ReplayProducer from the array of tracked ReplayProducers.
Finally, the bounded buffer's replay requires to update the total requested amount before emission so the request coordination works with latecomers as well.

Before looking at the manageRequests() call, I have to show the skeleton of the Connection class (the equivalent class from PublishConnectableObservable):

@SuppressWarnings({ "unchecked", "rawtypes" })
static final class Connection<T> implements Observer<T> {

    final AtomicReference<ReplayProducer<T>[]> subscribers;
    final State<T> state;
    final AtomicBoolean connected;
    final AtomicInteger wip;

    final SourceSubscriber parent;

    final ReplayBuffer<T> buffer;                        // (1)

    static final ReplayProducer[] EMPTY = 
        new ReplayProducer[0];

    static final ReplayProducer[] TERMINATED = 
        new ReplayProducer[0];

    long maxChildRequested;                              // (2)
    long maxUpstreamRequested;

    public Connection(State<T> state, int maxSize) {
        this.state = state;
        this.wip = new AtomicInteger();
        this.subscribers = new AtomicReference<>(EMPTY);
        this.connected = new AtomicBoolean();
        this.parent = createParent();

        ReplayBuffer b;                                 // (3)
        if (maxSize == Integer.MAX_VALUE) {
            b = new UnboundedReplayBuffer<>();
        } else {
            b = new BoundedReplayBuffer<>(maxSize);
        }
        this.buffer = b;
    }

    SourceSubscriber createParent() {                   // (4)
        SourceSubscriber parent = 
            new SourceSubscriber<>(this);

        parent.add(Subscriptions.create(() -> {
            switch (state.strategy) {
            case SEND_COMPLETED:
                onCompleted();
                break;
            case SEND_ERROR:
                onError(new CancellationException(
"Disconnected"));
                break;
            default:
                parent.unsubscribe();
                subscribers.getAndSet(TERMINATED);
            }
        }));

        return parent; 
    }

    boolean add(ReplayProducer<T> producer) {
        // omitted
    }

    void remove(ReplayProducer<T> producer) {
        // omitted 
    }

    void onConnect(
    Action1<? super Subscription> disconnect) {
        // omitted
    }

    @Override
    public void onNext(T t) {                          // (5)
        ReplayBuffer<T> buffer = this.buffer;
        buffer.onNext(t);
        ReplayProducer<T>[] a = subscribers
            .get();
        for (ReplayProducer<T> rp : a) {
            buffer.replay(rp);
        }
    }

    @Override
    public void onError(Throwable e) {
        ReplayBuffer<T> buffer = this.buffer;
        buffer.onError(e);
        ReplayProducer<T>[] a = subscribers
            .getAndSet(TERMINATED);
        for (ReplayProducer<T> rp : a) {
            buffer.replay(rp);
        }
    }

    @Override
    public void onCompleted() {
        ReplayBuffer<T> buffer = this.buffer;
        buffer.onCompleted();
        ReplayProducer<T>[] a = subscribers
            .getAndSet(TERMINATED);
        for (ReplayProducer<T> rp : a) {
            buffer.replay(rp);
        }
    }

    void manageRequests() {                           // (6)
        if (wip.getAndIncrement() != 0) {
            return;
        }

        int missed = 1;

        for (;;) {

            // implement            

            missed = wip.addAndGet(-missed);
            if (missed == 0) {
                break;
            }
        }
    }
}

The class looks quite the same as PublishConnectableObservable.Connect, therefore, I've omitted the methods that are exactly the same. Let's see the rest:

Instead of a bounded queue, we now have the common ReplayBuffer interface.
We have to keep track the maximum values of both child requests and requests issued to upstream. The latter is necessary because we can't know when the upstream's Producer arrives and we have to accumulate the coordinated request amount until it arrives.
I treat Integer.MAX_VALUE as the indicator for the unbounded replay mode.
The createParent is slightly changed. Instead of the disconnected flag, we now unsubscribe directly from upstream. The implementations add, remove and onConnect are the same as in the last post.
The onXXX methods have the same pattern: call the appropriate method on the buffer instance and then call replay for all known ReplayProducer instances. Note that the terminal events also swap in the TERMINATED array atomically, indicating that subsequent Subscribers have to go to the next Connection object.
Last but not least, we have to manage requests from all child Subscribers which may call the method concurrently and thus we have to do some serialization. Since we are going to calculate the maximum to request, the non-blocking serialization approach works here quite well. This method is called when the upstream producer finally arrives and when any child subscriber requests something.

Now let's dive into the request coordination logic.

// for (;;) {

ReplayProducer<T>[] a = subscribers.get();

if (a == TERMINATED) {
    return;
}

long ri = maxChildRequested;
long maxTotalRequests = ri;                 // (1)

for (ReplayProducer<T> rp : a) {
    maxTotalRequests = Math.max(
        maxTotalRequests, 
        rp.totalRequested.get());
}

long ur = maxUpstreamRequested;
Producer p = parent.producer;

long diff = maxTotalRequests - ri;          // (2)
if (diff != 0) {
    maxChildRequested = maxTotalRequests;
    if (p != null) {                        // (3)
        if (ur != 0L) {
            maxUpstreamRequested = 0L;
            p.request(ur + diff);           // (4)
        } else {
            p.request(diff);
        }
    } else {
        long u = ur + diff;
        if (u < 0) {
            u = Long.MAX_VALUE;
        }
        maxUpstreamRequested = u;           // (5)
    }
} else
if (ur != 0L && p != null) {                // (6)
    maxUpstreamRequested = 0L;
    p.request(ur);
}

// missed = ...

Let's see how it works:

After retrieving the current array of Subscribers and checking for the disconnected/terminated state, we compute the maximum of the total requested amount of each subscriber (and the previously known maximum).
We calculate the difference from the last known maximum. If the difference is non zero, we remember the new maximum in maxChildRequested.
At this point, the upstream Producer may be still missing.
If the producer is already there, we take any missed amount and the current difference and request it.
Otherwise, without a producer, all we can do is to accumulate all the missed differences.
If the maximum didn't change, we still might have to request all missed amounts if the Producer is there. As with (4), we have to "forget" all the missed values thus the next time the requests have to be coordinated, the upstream will only receive the non-zero difference then on.

In other terms, we collect how far each child subscriber wants to go and request from the upstream based on it.

As you may have noticed, this request coordination and the call to it can become quite expensive if there are lots of child Subscribers requesting left and right. In fact, we'd only have to deal with a limited set of requesters at a time and not with everyone. To solve the performance impact, we have to introduce a well known pattern: an emitter-loop or queue-drain that plays with the same serialization logic but the method now receives a parameter indicating who wants to update the coordinated request amount. This way, when a child requests some value and not others, only one child is evaluated instead of all.

There is, however, one thing to prepare for: the arrival of the upstream Producer in which case we still have to check all children. For this, we need to extend the Connection object with some extra fields:

List<ReplayProducer<T>> coordinationQueue;
boolean coordinateAll;
boolean emitting;
boolean missed;

You might have guessed what approach this will take: emitter loop. We can drop thewip counter and replace it with emitting/missed.

void manageRequests(ReplayProducer<T> inner) {
    synchronized (this) {                               // (1)
        if (emitting) {
            if (inner != null) {
                List<ReplayProducer<T>> q = 
                    coordinationQueue;
                if (q == null) {
                    q = new ArrayList<>();
                    coordinationQueue = q;
                }
                q.add(inner);
            } else {
                coordinateAll = true;
            }
            missed = true;
            return;
        }
        emitting = true;
    }

    long ri = maxChildRequested;
    long maxTotalRequested;

    if (inner != null) {                                // (2)
        maxTotalRequested = Math.max(
            ri, inner.totalRequested.get());
    } else {
        maxTotalRequested = ri;

        @SuppressWarnings("unchecked")
        ReplayProducer<T>[] a = producers.get();
        for (ReplayProducer<T> rp : a) {
            maxTotalRequested = Math.max(
                maxTotalRequested, rp.totalRequested.get());
        }

    }
    makeRequest(maxTotalRequested, ri);

    for (;;) {
        if (isUnsubscribed()) {
            return;
        }

        List<ReplayProducer<T>> q;
        boolean all;
        synchronized (this) {                           // (3)
            if (!missed) {
                emitting = false;
                return;
            }
            missed = false;
            q = coordinationQueue;
            coordinationQueue = null;
            all = coordinateAll;
            coordinateAll = false;
        }

        ri = maxChildRequested;                         // (4)
        maxTotalRequested = ri;

        if (q != null) {
            for (ReplayProducer<T> rp : q) {
                maxTotalRequested = Math.max(
                maxTotalRequested, rp.totalRequested.get());
            }
        } 

        if (all) {
            @SuppressWarnings("unchecked")
            ReplayProducer<T>[] a = producers.get();
            for (ReplayProducer<T> rp : a) {
                maxTotalRequested = Math.max(
                maxTotalRequested, rp.totalRequested.get());
            }
        }

        makeRequest(maxTotalRequested, ri);
    }
}

It works as follows:

First, we try to enter the emission loop. If it fails and the parameter to the method was null, we set the coordinateAll flag which will trigger a full sweep. Otherwise, we queue up the ReplayProducer and quit.
If the current thread managed to get into the emission state, we either determine the maximum requested by using the single ReplayProducer the method was called with or do a full sweep if it was actually null.
Next comes the loop part of the emitter-loop approach. We check if we missed some calls and get all the queued up ReplayProducers as well as the indicator for a full sweep.
Given all previous inputs we sweep the queued up ReplayProducers for the maximum value and if necessary, all the other known ReplayProducers as well. Note that they both may have to run since the queue may have ReplayProducers not known at the time this method runs and vice versa.

Finally, the upstream requesting can be factored out into a common method:

void makeRequest(long maxTotalRequests, long previousTotalRequests) {
    long ur = maxUpstreamRequested;
    Producer p = producer;

    long diff = maxTotalRequests - previousTotalRequests;
    if (diff != 0) {
        maxChildRequested = maxTotalRequests;
        if (p != null) {
            if (ur != 0L) {
                maxUpstreamRequested = 0L;
                p.request(ur + diff);
            } else {
                p.request(diff);
            }
        } else {
            long u = ur + diff;
            if (u < 0) {
                u = Long.MAX_VALUE;
            }
            maxUpstreamRequested = u;
        }
    } else
    if (ur != 0L && p != null) {
        maxUpstreamRequested = 0L;
        // fire the accumulated requests
        p.request(ur);
    }
}

which is practically the same as with the original sweep-all manageRequests() method is.

ReplayConnectableObservable

All what's remaining in this post to show the remaining SourceSubscriber class, the ReplayConnectableObservable itself.

Since we need the producer from upstream, we use the SourceSubscriber to store it for us and get it once ready. Note that we can't use Subscriber.request() here for two reasons: a) the call to request() don't accumulate until a Producer arrives and b) we can't know if there has a Producer arrived or not.

static final class SourceSubscriber<T> 
extends Subscriber<T> {
    final Connection<T> connection;

    volatile Producer producer;

    public SourceSubscriber(Connection<T> connection) {
        this.connection = connection;
    }

    @Override
    public void onNext(T t) {
        connection.onNext(t);
    }

    @Override
    public void onError(Throwable e) {
        connection.onError(e);
    }

    @Override
    public void onCompleted() {
        connection.onCompleted();
    }

    @Override
    public void setProducer(Producer p) {
        producer = p;
        connection.manageRequests();
    }
}

Nothing outstanding: we delegate everything to the Connection instance. Note the connection.manageRquests() call which will trigger the request coordination to actually request the amount held in the maxUpstreamRequested field (i.e., the missed requests). If we have the more performant version, the call is manageRequests(null) instead.

The State class is also changed a bit due to the indication of bounded buffering and due to the need to start replaying to a new Subscriber once it successfully subscribed to the current connection.

static final class State<T> implements OnSubscribe<T> {
    final DisconnectStrategy strategy;
    final Observable<T> source;
    final int maxSize;                                     // (1)

    final AtomicReference<Connection<T>> connection;

    public State(DisconnectStrategy strategy, 
    Observable<T> source, int maxSize) {
        this.strategy = strategy;
        this.source = source;
        this.maxSize = maxSize;
        this.connection = new AtomicReference<>(
            new Connection<>(this, maxSize));
    }

    @Override
    public void call(Subscriber<? super T> s) {
        // implement
        ReplayProducer<T> pp = new ReplayProducer<>(s);

        for (;;) {
            Connection<T> curr = this.connection.get();

            pp.connection = curr;

            if (curr.add(pp)) {
                if (pp.isUnsubscribed()) {
                    curr.remove(pp);
                } else {
                    curr.buffer.replay(pp);               // (2)

                    s.add(pp);
                    s.setProducer(pp);
                }

                break;
            }
        }
    }

    public void connect(
    Action1<? super Subscription> disconnect) {
        // same as before
    }

    public void replaceConnection(Connection<T> conn) {   // (3)
        Connection<T> next = 
            new Connection<>(this, maxSize);
        connection.compareAndSet(conn, next);
    }
}

There are some changes:

We have to store the maxSize parameter because a reconnection has to recreate the appropriate ReplayBuffer instance as well.
Once we create an ReplayProducer, first we try to add it to the current connection. If successful, then we do a barebone replay call. Since the ReplayProducer has requested value of zero, this won't replay any value to the child Subscriber. What it does is that it captures the current head of the buffer's linked list (if the buffer is bounded), pins it and makes sure this ReplayProducer starts with the correct total requested amount. Only after this setup is the ReplayProducer added to the child as an unsubscription and request target.
Note that the Connection now requires a maxSize parameter.

Note that this order in (2) does work only because I've shown an implementation of the replay that replays terminal events only when requested which is not a necessary requirement or expectation for terminal events, although should not cause any real world problems as most Subscribers just keep requesting.

Finally, we still need factory methods to create instances of ReplayConnectableObservable:

public static <T> ReplayConnectableObservable<T> createUnbounded(
        Observable<T> source, 
        DisconnectStrategy strategy) {
    return createBounded(source, strategy, Integer.MAX_VALUE);
}

public static <T> ReplayConnectableObservable<T> createBounded(
        Observable<T> source, 
        DisconnectStrategy strategy, int maxSize) {
    State<T> state = new State<>(strategy, source, maxSize);
    return new ReplayConnectableObservable<>(state);
}

Conclusion

In this blog post, I've detailed the inner workings of a replay-like ConnectableObservable that can do both bounded and unbounded replays. The complexity is one level up relative to the PublishConnectableObservable from the last part; if you understood that then this shouldn't come as a too large leap. The added complexity comes from the management of the buffer and the coordination of requests with the max strategy.

In the next part, I'm going to talk a bit about how to turn such ConnectableObservables into Subjects that now will perform request coordination which may become mandatory for RxJava 2.0 Subjects and Reactive-Streams Processors, depending on how a certain discussion will be resolved.

Introduction

The Reactive-Streams initiative becomes more and more known in concurrency/parallelism circles and there appear to be several implementations of the specification, most notably Akka-Streams, Project Reactor and RxJava 2.0.

In this blog post, I'm going to look at how one can use each library to build up a couple of simple flow of values and while I'm at it, benchmark them with JMH. For comparison and sanity checking, I'll also include the results of RxJava 1.0.14, Java and j.u.c.Stream.

In this part, I'm going to compare the synchronous behavior of the 4 libraries through the following tasks:

Observe a range of integers from 1 to (1, 1000, 1.000.000) directly.
Apply flatMap to the range of integers (1) and transform each value into a single value sequence.
Apply flatMap to the range of integers (1) and transform each value into a range of two elements.

The runtime environment:

Gradle 2.8
JMH 1.11.1

Threads: 1
Forks: 1
Mode: Throughput
Unit: ops/s
Warmup: 5, 1s each
Iterations: 5, 2s each

i7 4790 @ 3.5GHz stock settings CPU
16GB DDR3 @ 1600MHz stock RAM
Windows 7 x64
Java 8 update 66 x64

RxJava

Let's start with the implementation of the tasks in RxJava. First, one has to include the library within the build.gradle file. For RxJava 1.x:

compile 'io.reactivex:rxjava:1.0.14'

For RxJava 2.x:

repositories {
mavenCentral()

maven { url 'https://oss.jfrog.org/libs-snapshot' }
}

compile 'io.reactivex:rxjava:2.0.0-DP0-SNAPSHOT'

Unfortunately, one can't really have multiple versions of the same ArtifactID so either we swap the compile ref or switch to my RxJava 2.x backport, which is under a different name and different package naming:

compile 'com.github.akarnokd:rxjava2-backport:2.0.0-RC1'

Once the libs are set up, let's see the flows:

@Params({"1", "1000", "1000000"})
int times;
//...

Observable<Integer> range = Observable.range(1, times);

Observable<Integer> rangeFlatMapJust = range
    .flatMap(Observable::just);

Observable<Integer> rangeFlatMapRange = range
    .flatMap(v -> Observable.range(v, 2));

The code looks the same for both versions, only the imports have to be changed. Nothing complicated.

Observation of the streams will generally be performed via the LatchedObserver instance which extends/implements Observer and will be reused for the other libraries as well:

public class LatchedObserver<T> extends Observer<T> {
    public CountDownLatch latch = new CountDownLatch(1);
    private final Blackhole bh;
    public LatchedRSObserver(Blackhole bh) {
        this.bh = bh;
    }
    @Override
    public void onComplete() {
        latch.countDown();
    }
    @Override
    public void onError(Throwable e) {
        latch.countDown();
    }
    @Override
    public void onNext(T t) {
        bh.consume(t);
    }
}

Since these flows are synchronous, we won't utilize the latch itself but simply subscribe to the flows:

@Benchmark
public void range(Blackhole bh) {
    range.subscribe(new LatchedObserver<Integer>(bh));
}

Let's run it for both 1.x and 2.x and see the benchmark results:

This is a screenshot of my JMH comparison tool; it can display colored comparison of throughput values: green is better than the baseline, red is worse. Lighter color means at least +/- 3%, stronger color means +/- 10% difference.

Here and all the subsequent images, a larger number is better. You may want to multiply the times with the measured value to get the number of events transmitted. Here, Range with times = 1000000 means that there were ~253 million numbers emitted.

It appears RxJava 2.x can do quite the numbers better, except in the two RangeFlatMapJust cases. What's going on? Let me explain.

The improvements come from the fact that RxJava 2.x has generally less subscribe() overhead than 1.x. In 1.x when one creates a Subscriber, it will be wrapped into a SafeSubscribe instance and when the Producer is set on it, there is a small arbitration happening inside setProducer(). As far as I can tell, the JIT in 1.x will do its best to remove the allocation and the synchronization, but the arbitration won't be removed which means more instructions for the CPU to execute. In contrast, in 2.x there is no wrapping and no arbitration at all.

The lower performance of the RangeFlatMapJust comes from a single operator: just(). In 1.x, the operator just() immediately emits its value without bothering with Producers and requests which means it doesn't support or respect backpressure. In 2.x, however, just() has to consider backpressure requests which involves a mandatory atomic CAS (~10 ns or 35 cycles uncontended (!)). This is the cost of correctness.

Edit: (wrong explanation before)

The lower performance comes from the serialization approaches the two versions use: 1.x uses the synchronized-based emitter-loop and 2.x uses the atomics-based queue-drain approach. The former is elided by the JIT whereas the latter can't be and there is always a ~17 ns overhead per value. I'm planning a performance overhaul for 2.x anyways so this won't remain the case for too long.

In conclusion, I think RxJava does a good job both in terms of usability and performance. Why am I mentioning usability? Read on.

Project Reactor

Project Reactor is another library that supports the Reactive-Streams specification and provides a similar fluent API as RxJava.

I've briefly benchmarked one of its earlier version (2.0.5-RELEASE) and posted a picture of it, but I'm going to use the latest snapshot of it. For this, we need to adjust our build.gradle file.

repositories {
mavenCentral()

maven { url 'http://repo.spring.io/libs-snapshot' }
}

compile 'io.projectreactor:reactor-stream:2.1.0.BUILD-SNAPSHOT'
This should make sure I'm using a version with the most performance enhancements possible.

The source code for the flows look quite similar:

Stream<Integer> range = Streams.range(1, times);

Stream<Integer> rangeFlatMapJust = raRange.flatMap(Streams::just);

Stream<Integer> rangeFlatMapRange = raRange
    .flatMap(v -> Streams.range(v, 2));

A small note on the Streams.range() here. It appears the API has changed between 2.0.5 and the snapshot. In 2.0.5, the operator's parameters were start+end (both inclusive) which is now changed to start+count thus matches RxJava's range().

The same LatchedObserver can be used here so let's see the run results:

Here, reactor2 stands for the 2.1.0 snapshot and reactor1 is 2.0.5 release. Clearly, Reactor has improved its performance by reducing the overhead in the operators (by a factor of ~10).

There is, however a curious result with RangeFlatMapJust, similar to RxJava: both RxJava 1.x and Reactor 2.1.0 outperform RxJava 2.x and with roughly the same amount! What's happening there?

I know that flatMap in RxJava 1.x is faster in single-threaded use because it uses the emitter-loop approach (which utilizes synchronized) which can be nicely elided by the JIT compiler and thus the overhead is removed. In 2.x, the code, currently, uses queue-drain with 2 unavoidable a atomic operations per value on the fast path.

So let's find out what Reactor does. Its flatMap is implemented in the FlatMapOperator class and what do I see? It's almost the same as RxJava 2.x flatMap! Even the bugs are the same!

Just kidding about the bugs. There are a few differences so let's check the same fast-path and why it can do 4-8 million values more.

The doNext() looks functionally identical: if the source is a Supplier, it gets the held value directly without subscription then tries to emit it via tryEmit().

Potential bug: If this path crashes and goes into reportError(), the execution falls through and the Publisher gets subscribed to.

Potential bug: In RxJava 2.0, we always wrap user-supplied functions into try-catches so an exception from them is handled in-place. In Reactor's implementation, this is missing from doNext (but may be present somewhere else up in the call chain).

The tryEmit() is almost the same as well with a crucial difference: it batches up requests instead of requesting one-by-one. Interesting!

if (maxConcurrency != Integer.MAX_VALUE && !cancelled
&& ++lastRequest == limit) {
    lastRequest = 0;
    subscription.request(limit);
}

The same re-batching happens with the inner subscribers in both implementations (although this doesn't come into play in the given flow example). Nice work Project Reactor!

In the RangeFlatMapRange case, which doesn't exercise this fast path, Reactor is slower although it uses the same flatMap logic. The answer is a few lines above in the results: Reactor's range produces 100 million values less per second.

Following the references along, there are a bunch of wrappers and generalizations, but those only apply once per Subscriber so they can't be the cause for the times = 1000000 case.

The reason appears to be that range() is implemented like RxJava 2.x's generator (i.e., SyncOnSubscribe). The ForEachBiConsumer looks tidy enough but I can spot a few potential deficiencies:

Atomic read and increment is involved which forces the JIT'd code to re-read the instance fields from cache instead of keeping it in a register. The requestConsumer could be read into a local variable before the loop.
Use == or != as much as possible because the other kind of comparisons appear to be slower on x86.
The atomic decrement is an expensive operation (~10ns) but can be delayed quite a bit: once the current known requested amount runs out, one should try to read the requested amount first to see if there were more requests issued in the mean time. If so, keep emitting, otherwise subtract all that has been emitted from the request count.

RxJava's range doesn't do this latter at the moment; HotSpot's register allocator seems to be hectic at times: too many local variables and performance drops because of register spill (on x64!). Implementing this latter optimization involves more local variables and thus the risk of making things worse.

In conclusion, Project Reactor gets better and better with each release, especially when it adopts RxJava 2.x structures and algorithms ;)

Akka-Streams

I believe Akka-Streams was the most advertised library from the list. With a company behind it and a port from Scala, what could go wrong?

So let's include it in the build.gradle:

compile 'com.typesafe.akka:akka-stream-experimental_2.11:1.0'

So far so good, but where do I start? Looking at the web I came across a ton of examples, in Scala. Unfortunately, I don't know Scala enough so it was difficult for me to figure out what to use. Plus, it doesn't help that with Eclipse, the source code of the library is hard to navigate because it's in Scala (and I don't want to install the plugin). Okay, we won't look at the source code.

It turns out, Akka-Streams doesn't have a range operator, therefore, I have prepopulate a List with the values and use it as a source:

List<Integer> values = rx2Range
    .toList().toBlocking().first();

Source.from(values).???

A good thing RxJava is around. Akka-Stream uses the Source object as factory method for creating sources. However, Source does not implement Publisher at all!

One does not simply observe a Source.

After digging a bit, I found an example which shows one has to use runWith that takes a Sink.publisher() parameter. Let's apply them:

Publisher<Integer> range = Source
    .from(values).runWith(Sink.publisher());

Doesn't work; the example was out of date and one needs a Materializer in runWith. Looking at the hierarchy, ActorMaterializer does implement it so let's get one.

ActorMaterializer materializer = ActorMaterializer
    .create(???);

Publisher<Integer> range = Source.from(values)
    .runWith(Sink.publisher(), materializer);

Hmm, it requires an ActorRefFactory. But hey, I remember the examples creating an ActorSystem, so let's do that.

ActorSystem actorSystem = ActorSystem.create("sys");

ActorMaterializer materializer = ActorMaterializer
    .create(actorSystem);

Publisher<Integer> range = Source.from(values)
    .runWith(Sink.publisher(), materializer);

Finally, no more dependencies. Let's run it!

Doesn't work, crashes with missing configuration for akka.stream. Huh? After spending some time figuring out things, it appears Akka defaults to a reference.conf file in the classpath's root. But both jars of the library have this reference.conf!

As it turns out, when the Gradle-JMH plugin packages up the benchmark jar, it puts both reference.conf files into the jar and both of them end up in there under the same name; Akka then picks up the wrong one.

The solution: pull the one from the streams jar out and put it under a different name into the Gradle sources/resources.

Sidenote: this is still not enough as by default Gradle ignores non java files, especially if they are not under src/main/java. I had to add the following code to build.gradle to make it work:

processResources {
from ('src/main/java') {
include '**/*.conf'
}
}

With all these set up, lets finish the preparation:

Config cfg = ConfigFactory.parseResources(
     ReactiveStreamsImpls.class, "/akka-streams.conf");
ActorSystem actorSystem = ActorSystem.create("sys", cfg);

ActorMaterializer materializer = ActorMaterializer
    .create(actorSystem);

List<Integer> values = rx2Range
    .toList().toBlocking().first();

Publisher<Integer> range = Source.from(values)
            .runWith(Sink.publisher(), materializer);

Compiles? Yes! Benchmark jar contains everything? Yes! The setup runs? Yes! Benchmark method works? No?!

After one iteration, it throws an error because the range Publisher can't be subscribed to more than once. I've asked for solutions on StackOverflow to no avail; whatever I've got back either didn't compile or didn't run. At this point, I just gave up on it and used a trick to make it work multiple times: defer(). I have to defer the creation of the whole Publisher so I get something fresh every time:

Publisher<Integer> range = s -> Source.from(values)
            .runWith(Sink.publisher(), materializer).subscribe(s);

In addition, as I suspected, there is no way to run Akka-Streams synchronously, therefore, any benchmark with the other synchronous guys can't be directly compared. Plus, I have to use the CountDownLatch to await the termination:

@Benchmark
public void akRange(Blackhole bh) throws InterruptedException {
    LatchedObserver<Integer> lo = new LatchedObserver<>(bh);
    akRange.subscribe(lo);

    if (times == 1) {
        while (lo.latch.getCount() != 0);
    } else {
        lo.latch.await();
    }
}

Note: I have to use a spin-loop over the latch for times == 1 because Windows' timer resolution and wakeup takes pretty long milliseconds to happen at times and without spinning, the benchmark produces 35% lower throughput.

Almost ready, we still need the RangeFlatMapJust and RangeFlatMapRange equivalents. Unfortunately, Akka-Streams doesn't have flatMap but has a flatten method on Source. No problem (by now):

Publisher<Integer> rangeFlatMapJust = s -> 
                Source.from(values)
                .map(v -> Source.single(v))
                .flatten(FlattenStrategy.merge())
                .runWith(Sink.publisher(), materializer)
                .subscribe(s)
                ;

Nope. Doesn't work because there is no FlattenStrategy.merge(), despite all the examples. But there is a FlattenStrategy.concat(). Have to do.

Nope, still doesn't compile because of type inference problems. Have to introduce a local variable:

FlattenStrategy<Source<Integer, BoxedUnit>> flatten = 
    FlattenStrategy.concat();

Works in Eclipse, javac fails with ambiguity error. As it turns out, javadsl.FlattenStrategy extends scaladsl.FlattenStrategy which both have the same concat() factory method but different number of type arguments. This isn't the first time javac can't disambiguate but Eclipse can!

We don't give up and use reflection to get the proper method called:

Method m = akka.stream.javadsl.FlattenStrategy
    .class.getMethod("concat");

@SuppressWarnings({ "rawtypes", "unchecked" })
FlattenStrategy<Source<Integer, BoxedUnit>, Integer> flatten = 
    (FlattenStrategy)m.invoke(null);

Publisher<Integer> rangeFlatMapJust = s -> 
                Source.from(values)
                .map(v -> Source.single(v))
                .flatten(flatten)
                .runWith(Sink.publisher(), materializer)
                .subscribe(s)
                ;

Finally, Akka-Streams works. Let's see the benchmark results:

Remember, since Akka can't run synchronously and we had to do a bunch of workarounds, we should expect numbers will be lower by a factor of 5-10.

I don't know what's going on here. Some numbers are 100x lower. Akka certainly doesn't throw an Exception somewhere because we'd see 5M ops/s in those cases, regardless of times.

In conclusion, I'm disappointed with Akka-Streams; it takes quite a hassle to get a simple sequence running and apparently requires more thought to a reasonable performance.

Plain Java and j.u.c.Stream

Just for reference, let's see how the same task looks and works with plain Java for loops and j.u.c.Streams.

For plain Java, the benchmarks look simple:

@Benchmark
public void javaRange(Blackhole bh) {
    int n = times;
    for (int i = 0; i < n; i++) {
        bh.consume(i);
    }
}

@Benchmark
public void javaRangeFlatMapJust(Blackhole bh) {
    int n = times;
    for (int i = 0; i < n; i++) {
        for (int j = i; j < i + 1; j++) {
            bh.consume(j);
        }
    }
}

@Benchmark
public void javaRangeFlatMapRange(Blackhole bh) {
    int n = times;
    for (int i = 0; i < n; i++) {
        for (int j = i; j < i + 2; j++) {
            bh.consume(j);
        }
    }
}

The Stream implementation is a bit complicated because a j.u.c.Stream is not reusable and has to be recreated every time one wants to consume it:

@Benchmark
public void streamRange(Blackhole bh) {
    values.stream().forEach(bh::consume);
}

@Benchmark
public void streamRangeFlatMapJust(Blackhole bh) {
    values.stream()
    .flatMap(v -> Collections.singletonList(v).stream())
    .forEach(bh::consume);
}

@Benchmark
public void streamRangeFlatMapRange(Blackhole bh) {
    values.stream()
    .flatMap(v -> Arrays.asList(v, v + 1).stream())
    .forEach(bh::consume);
}

Finally, just for fun, let's do a parallel version of the stream benchmarks:

@Benchmark
public void pstreamRange(Blackhole bh) {
    values.parallelStream().forEach(bh::consume);
}

@Benchmark
public void pstreamRangeFlatMapJust(Blackhole bh) {
    values.parallelStream()
    .flatMap(v -> Collections.singletonList(v).stream())
    .forEach(bh::consume);
}

@Benchmark
public void pstreamRangeFlatMapRange(Blackhole bh) {
    values.parallelStream()
    .flatMap(v -> Arrays.asList(v, v + 1).stream())
    .forEach(bh::consume);
}

Great! Let's see the results:

Impressive, except for some parallel cases where the forEach synchronizes all parallel operations back to a single thread I presume, negating all benefits.

In conclusion, if you have a synchronous task, try plain Java first.

Conclusion

In this blog post, I've compared the three Reactive-Streams library for usability and performance in case of a synchronous flow. Both RxJava and Reactor did quite well, relative to Java, but Akka-Streams was quite complicated to set up and didn't perform adequate "out of box".

However, there might be some remedy for Akka-Streams in the next part where I compare the libraries in asynchronous mode.

Introduction

I've recently came across an interesting presentation from EclipseCon titled Asynchronous Event Streams – when java.util.stream met org.osgi.util.promise! (video, specification). It appears the OSGi folks want to solve the async problem as well and came up with an API to do that. It promises (in)finite stream processing, error handling and backpressure. As it turns out, nothing is ever simple with this kind of problem so let's see what's going on.

The Interfaces

Unfortunately, the code behind the Asynchronous Event Streams (AsyncES) doesn't seem to exist beyond the text file I linked up there so let's extract them from the documentation instead.

As with Reactive-Streams (RS), AsyncES, consists of a few basic interfaces representing a source and a consumer.

We have a PushStream and a PushStreamFactory interfaces (omitted) which contain our familiar map, flatMap, etc. methods and PushStreams factories respectively. Strangely, PushStream extends Closeable and AutoCloseable which is odd to me. Since there is nothing in the spec indicating a hot source, I'd assume PushStreams are always cold and closing them makes no sense.

The core interface is the PushEventSource:

@FunctionalInterface
public interface PushEventSource<T> {
    Closeable open(
        PushEventConsumer<? super T> consumer) throws Exception;
}

If you remember IObservable/IObserver, the pattern should look familiar. The PushEventSource is the way of specifying a "generator" that will emit values of T to a consumer.

The method specification has two important differences compared to Publisher: the "connection" method returns a Closeable and may throw a checked Exception.

Unfortunately, this pattern has problems similar to what we had with RxJava in the early days:

Given a synchronous source, it may be very hard to cancel the connection from within the consumer unless the Closeable is propagated around (more on this later).
The method throws a checked exception which is generally inconvenient and also seems unnecessary because the exception could be delivered to the consumer.

In RS, it is the choice of the consumer whether or not to expose cancellation and the cancellation itself works for synchronous and asynchronous sources alike.

@FunctionalInterface
public interface PushEventConsumer<T> {
    long ABORT = -1L;
    long CONTINUE = 0L;

    long accept(PushEvent<T> e) throws Exception;
}

Next comes the consumer: PushEventConsumer. At first glance, we don't have per event-type methods just an accept method that takes a PushEvent, may throw an Exception and returns some value.

Having only a single consumer method should remind us of Notification objects with RxJava and indeed, PushEvent is a interface indicating such a container Object.

The second interesting thing is that this method may throw an Exception. It is sometimes convenient lambdas in an IO-related stream can throw directly without wrapping, but PushEventConsumer is an end consumer and not a source/intermediate operator; where would this Exception propagate, especially in an async environment?

The most interesting part is the return type itself. The specification indicates this is the source of backpressure and cancellation support:

Negative values are considered indication of cancellation or abort.
Zero indicates the sender can call the method immediately with new data.
Positive value indicates the sender should wait the specified amount in milliseconds before sending any new data.

In RS, the backpressure works via the concept of co-routines. Unless the downstream requests anything, no value will be emitted/generated; the consumer may take any time to process what it requested and issue more requests at its own convenience.

However, AsyncES returns a delay value after which the source can emit the next value. But how does one know how long the accept call would take? In addition, if the upstream is synchronous, does this mean that a positive return value should trigger a Thread.sleep()? Probably not. Without a reference implementation, I can only speculate that value generation uses some form of recursive scheduling via an Executor: the source schedules an emission of a PushEvent immediately once the consumer arrives. In the Runnable, the accept is run and based on the return value, a new Runnable is submitted with a delay this time.

This approach and the interface itself have several drawbacks:

The delay amount is a fixed unit (of milliseconds).
If the Executor approach is used, it means for each source value, we have a wrapper PushEvent and a Runnable (which is quite an overhead).
Since the ABORT value is returned synchronously, the cancellation doesn't compose. If we'd implement a library based on this structure and our RxJava lift() approach (in which consumers call other consumers' methods), the moment the accept() call of the downstream consumer has to be called on a different thread, there is no way to get back its result to the original thread:

ExecutorService exec = ...
PushEventOperator<T, T> goAsync = downstreamConsumer -> {
    return e -> {
        exec.execute(() -> {
           long backpressure = downstreamConsumer.accept(e);
        });

        return backpressure; // ???
    };
};

Even if we use AtomicLong, the execution happens elsewhere and unless the upstream emits a value, there is no way to return the downstream's cancellation indicator, or in fact, its backpressure-delay requirement.

The final problem is that what if the given delay amount is not enough for the consumer? This model indicates the event will be sent regardless. There are only a few things the consumer can do: accept the value and return a larger delay, abort and crash with an overflow error or do unbounded buffering.

I believe these problems are concerning at least. It is no accident RxJava and Reactive-Streams look like as they are today: finding a workable, composable asynchronous data delivery approach is non-trivial task at best.

However, since this blog is about RxJava, why don't we try to build a bridge between an AsyncES PushEventSource and an Observable?

RxJava - AsyncES bridge

Let's start with a bridge that given an Observable (or in fact, any RS Publisher) and turns it into a PushEventSource:

public static <T> PushEventSource<T> from(
            Publisher<? extends T> publisher, Scheduler scheduler) {
    return c -> {
        // implement
    };
}

We return a PushEventSource from a Publisher of type T. Since the AsyncES deals with delays, we need a Scheduler to make sure values don't get emitted outright like with RS Subscribers. Since the return type is a functional interface, we take a lambda where c will be PushEventConsumer<T>.

Now let's see the missing implementation, chunk by chunk:

            CompositeDisposable cd = new CompositeDisposable();
            Scheduler.Worker w = scheduler.createWorker();
            cd.add(w);

We create the usual CompositeDisposable and a Worker and bind them together to form the basis of our cancellation support.

            publisher.subscribe(new Subscriber<T>() {
                Subscription s;
                @Override
                public void onSubscribe(Subscription s) {
                     // implement
                }

                @Override
                public void onNext(T t) {
                     // implement
                }

                @Override
                public void onError(Throwable t) {
                     // implement
                }

                @Override
                public void onComplete() {
                     // implement
                }

            });

Next, we subscribe to the Publisher with a Subscriber and we will translate its events, but before that, the open method we are implementing in here via a lambda has to return a Closeable:

            return cd::dispose;

So far, nothing special in the structure: lambdas and functional interfaces just by "following the types". Next, let's see the first event in a reactive-stream:

                @Override
                public void onSubscribe(Subscription s) {
                    this.s = s;
                    cd.add(s::cancel);
                    if (!cd.isDisposed()) {
                        s.request(1);
                    }
                }

We save the Subscription first, add it to the CompositeDisposable for mass-cancellation support and unless it is disposed, we request exactly 1 element from upstream. But why? The reason for this is that we don't really know how many elements the PushEventConsumer wants only the pacing between elements. This way, the safest way is to request one upfront and request more after the specified delay returned by the accept() method:

                @Override
                public void onNext(T t) {
                    long backpressure;
                    try {
                        backpressure = c.accept(PushEvent.data(t));    // (1)
                    } catch (Exception e) {
                        onError(e);                                    // (2)
                        return;
                    }

                    if (backpressure <= PushEventConsumer.ABORT) {     // (3)
                        cd.dispose();
                    } else
                    if (backpressure == PushEventConsumer.CONTINUE) {  // (4)
                        s.request(1);
                    } else {
                        w.schedule(() -> s.request(1),                 // (5)
                             backpressure, TimeUnit.MILLISECONDS);
                    }
                }

We call the accept method and wrap the value into an instance of the aforementioned PushEvent (not detailed here, think of a rx.Notification).
In case the accept method throws, we call onError with the exception.
Based on the backpressure result being an ABORT value, we dispose the composite and thus cancel the Subscription and dispose the Worker.
If the backpressure result is CONTINUE (0), we can synchronously request one more element.
Otherwise, we have to schedule the request of a single element after the given number of milliseconds.

The onError and onComplete events are pretty simple: dispose the composite and call the accept method with a wrapped error/close indicator:

                @Override
                public void onError(Throwable t) {
                    cd.dispose();
                    try {
                        c.accept(PushEvent.error(t));
                    } catch (Exception ex) {
                        RxJavaPlugins.onError(ex);
                    }
                }

                @Override
                public void onComplete() {
                    cd.dispose();
                    try {
                        c.accept(PushEvent.close());
                    } catch (Exception ex) {
                        RxJavaPlugins.onError(ex);
                    }

                }

And that's it for the Publisher -> PushEventSource conversion. In fact, there is no problem with this direction (apart from the slight deficiency due to wrapping values and requesting by 1).

All we need now is the reverse conversion PushEventSource -> Publisher. Let's start with the method:

public static <T> Observable<T> to(
        PushEventSource<? extends T> source, long backpressure) {
    return Observable.create(s -> {
        // implement
    });
}

It appears the dual of Scheduler is a backpressure amount in milliseconds. As I mentioned above, there is no clear way of communicating how much the upstream should wait until the downstream can process the next element, or more specifically, how much time to wait before the Subscriber issues a new request().

Now let's see the implementation, part by part:

            CompositeDisposable cd = new CompositeDisposable();
            AtomicLong requested = new AtomicLong();

            s.onSubscribe(new Subscription() {
                @Override
                public void request(long n) {
                    BackpressureHelper.add(requested, n);
                }

                @Override
                public void cancel() {
                    cd.dispose();
                }
            });

First, we create a composite again and an AtomicLong to hold the downstream request amount. Next, we create a Subscription and wire up the request (accumulate requests via helper) and cancel (dispose the composite) methods.

Next we open a connection to the source:

            try {
                Closeable c = source.open(new PushEventConsumer<T>() {
                    @Override
                    public long accept(PushEvent<T> c) throws Exception {
                        // implement
                    }
                });
                cd.add(() -> {
                    try {
                        c.close();
                    } catch (IOException ex1) {
                        RxJavaPlugins.onError(ex1);
                    }
                });
            } catch (Exception ex2) {
                s.onError(ex2);
            }

We create the consumer (detailed below) then add a Disposable to the composite that will call close on the returned Closeable by the open method. If the open method crashes, we simply emit its Exception to the Subscriber.

Finally, let's implement the accept method which should dispatch the PushEvents:

    if (cd.isDisposed()) {                           // (1)
        return ABORT;
    }
    switch (c.getType()) {                           // (2)
    case DATA:
        s.onNext(c.getData());
        for (;;) {
            long r = requested.get();                // (3)
            if (r == 0) {
                return backpressure;                 // (4)
            }
            if (requested.compareAndSet(r, r - 1)) {
                return r > 0L ? 0 : backpressure;    // (5)
            }
        }
    case ERROR:                                      // (6)
        cd.dispose();
        s.onError(c.getFailure());
        return ABORT;
    case CLOSE:
        cd.dispose();
        s.onComplete();
        return ABORT;
    }
    return 0L;

In case the composite is disposed (i.e., downstream cancellation), we return the ABORT value which should terminate the sequence. Note that this can only happen if the upstream actually sends an event, not sooner.
We dispatch on the PushEvent type and if its a DATA element, we emit the contained data value.
Once the value has been emitted, we have to decrement the requested counter in a CAS loop.
If the counter is already zero, return with the backpressure value.
Otherwise use a CAS to decrement by one and return the backpressure if we reached zero this way or zero to indicate the upstream can immediately send a new value.
For the failure and close events, we call the appropriate onXXX method and dispose the composite.

There is an unavoidable problem with this implementation: what if the backpressure amount is not enough? This implementation just writes through for simplicity so you have to apply onBackpressureXXX policies on the created Observable. Implementing a version with the BackpressureStrategy mentioned in an earlier post is left to the reader.

Now that we have the conversion methods, it should be relatively straightforward to implement PushStream's methods on top of RxJava 2.x (or any other Reactive-Streams compliant library).

You can find the example code, involving our favorite flatMap operator, on my gist page.

Conclusion

Designing async APIs is a non-trivial task at best and there has been many attempts to do so, the latest of it being the Asynchronous Event Streams RFC of OSGi.

As we saw, the underlying idea of using time to excert backpressure can be easily emulated by Reactive-Streams APIs (especially RxJava) so technically, if one wants such backpressure scheme, he/she can go ahead with the Reactive-Streams compliant libraries with ease. The opposite direction requires more consideration and overflow management.

With this exercise, I have doubts the Asynchronous Event Streams will work out in its current form.

Introduction

If you are following the day-to-day RxJava GitHub activity, you might have noticed a PR about a new and mysterious rx.Completable class. This PR has been merged into the 1.x branch (in @Experimental fashion) and will most likely be part of RxJava 1.1.1.

In this two part series, I'm first going to introduce the usage of this class and its relation to the existing Observable and Single classes, then I'll explain the internals and development practices of Completable operators.

Note that as the @Experimental tag indicates, method names and their availability may change at any time before (or after) 1.1.1 is released.

What is this Completable class?

We can think of a Completable object as a stripped version of Observable where only the terminal events, onError and onCompleted are ever emitted; they may look like an Observable.empty() typified in a concrete class but unlike empty(), Completable is an active class. Completable mandates side effects when subscribed to and it is its main purpose indeed. Completable contains some deferred computation with side effects and only notifies about the success or failure of such computation.

Similar to Single, the Completable behavior can be emulated with Observable<?> to some extend, but many API designers think codifying the valuelessness in a separate type is more expressive than messing with wildcards (and usually type-variance problems) in an Observable chain.

Completable doesn't stream a single value, therefore, it doesn't need backpressure, simplifying the internal structure from one perspective, however, optimizing these internals requires more lock-free atomics knowledge in some respect.

Hello World!

Let's see how one can build a (side-effecting) Hello World Completable:

Completable.fromAction(() -> System.out.println("Hello World!"))
.subscribe();

Quite straightforward. We have a set of fromXXX method which can take many sources: Action, Callable, Single and even Observable (stripping any values generated by the latter 3 of course). On the receiving end, we have the usual subscribe capabilities: empty, lambdas for the terminal events, a rx.Subscriber and a rx.Completable.CompletableSubscriber, the main intended receiver for Completables.

Reactive-Empty-Streams?

The definition of the CompletableSubscriber looks quite similar to a Reactive-Streams Subscriber and was chosen over the rx.Subscriber for performance reasons:

public interface CompletableSubscriber {

    void onCompleted();

    void onError(Throwable e);

    void onSubscribe(Subscription d);
}

It features the usual onCompleted() and onError() but instead of extending Subscription and having an add() method like rx.Subscriber, it receives the unsubscription enabling Subscription via the onSubscribe call as in the Reactive-Streams API. This setup has the following benefits:

Each CompletableSubscriber implementor can decide if it wants to expose the unsubscription capability to external users unlike rx.Subscriber where anybody can unsubscribe it.
In rx.Subscriber, a mandatory (and maybe shared) SubscriptionList container is created to support resource association with any Subscriber instance. However, many Observable operators don't use (or require) resources themselves and have the unnecessary allocation and instant size overhead.

The terminal event semantics is also the same as in Reactive-Streams. When onError or onCompleted is called, the formerly received Subscription should be considered already unsubscribed.

Thus, the protocol looks like as follows:

onSubscribe (onError | onCompleted)?

It contains a mandatory onSubscribe call with a non-null argument followed by, optionally, either an onError with a non-null Throwable or an onCompleted. As within Reactive-Streams, the methods can't throw any checked exceptions or unchecked exceptions other than NullPointerException. This doesn't mean methods shouldn't fail; it means methods should fail in the downstream direction. There are many cases, however, that one can't really put the received exception anywhere (i.e., post onComplete exceptions); the last resort is to sink it into the RxJavaPlugins.getInstance().getErrorHandler().handleError(e).

Create, Lift and Transform

The Completable class has three additional standard helper interfaces, now becoming common with all RxJava base classes:

The first defines a way to specify the deferred computation and send the terminal notifications out to a CompletableSubscriber:

public interface CompletableOnSubscribe
    extends Action1<CompletableSubscriber> { }

CompletableOnSubscribe complete = cs -> {
    cs.onSubscribe(Subscriptions.unsubscribed());
    cs.onCompleted();
}

It is practically a named alias of an Action1 parametrized by the CompletableSubscriber. Creating an instance via a lambda expression is also straightforward (but one has to remember to call onSubscribe before calling the other onXXX methods).

The second interface allows lifting into a Completable sequence by specifying a CompletableSubscriber level transformation.

public interface CompletableOperator 
    extends Func1<CompletableSubscriber, CompletableSubscriber> { }

CompletableOperator swap = child -> new CompletableSubscriber() {
    @Override
    public void onSubscribe(Subscription s) {
        child.onSubscribe(s);
    }
    @Override
    public void onCompleted() {
        child.onError(new RuntimeException());
    }
    @Override
    public void onError(Throwable e) {
        child.onCompleted();
    }
};

Again, the CompletableOperator is an alias for a Func1 instance that let's you wrap, replace and enrich the downstream's CompletableSubscriber. The example implementation shows how one can turn one terminal event into the other via an operator.

The final helper interface allows preparing entire chains of operators to be included in an existing chain:

public interface CompletableTransformer
    extends Func1<Completable, Completable> { }

CompletableTransformer schedule = c -> 
    c.subscribeOn(Schedulers.io())
    .observeOn(AndroidSchedulers.mainThread());

With this alias, one can pre-compose common operations and present it to a stream through the usual compose() method. The example shows the canonical example of making sure the async computation starts on the IO scheduler and the completion is observed on the main thread.

Entering into the Completable world

As with Observables, Completable offers static factory methods that let's you start a stream from various sources:

create(CompletableOnSubscribe): let's you write your custom deferred computation that receives a CompletableSubscriber as its observer. The CompletableOnSubscribe is invoked for all CompletableSubscriber separately.
complete(): returns a constant instance of a Completable which calls onCompleted without doing anything else.
defer(Func0<Completable>): calls the function for each incoming CompletableSubscriber which should create the actual Completable instance said subscriber will subscribe to.
error(Throwable): it will emit the given constant Throwable to the incoming CompletableSubscribers.
error(Func0<Throwable>): for each incoming CompletableSubscriber, the Func0 is called individually and the returned Throwable is emitted through onError.
fromAction(Action0): let's you execute an action for each CompletableSubscriber which call onCompleted (or onError if the action throws an unchecked exception).
fromCallable(Callable): unfortunately, Java doesn't have a standard interface for an action which returns void and can throw a checked exception (not even in 8). The closest thing is the Callable interface. This let's you write an action that doesn't require you to wrap the computation into a try-catch but mandates the return of some arbitrary value (ignored). Returning null is acceptable here.
fromFuture(Future): let's you attach to a future and wait for its completion, literally. This blocks the subscriber's thread so you will have to use subscribeOn().
fromObservable(Observable): let's you skip all values of the source and just react to its terminal events. The Observable is observed in an unbounded backpressure mode and the unsubscription (naturally) composes through.
fromSingle(Single): let's you turn the onSuccess call into onCompleted call coming from the Single.
never(): does nothing other than setting an empty Subscription via onSubscribe.
timer(long, TimeUnit): completes after the specified time elapsed.

In addition, both the Observable and Single classes feature a toCompletable() method for convenience.

The naming of the fromXXX methods are deliberately specific: the Java 8 compiler likes to get into ambiguity problems due to the similarly appearing functional-interfaces.

Leaving the Completable world

One has to, eventually, leave the Completable world and observe the terminal event in some fashion. The Completable offers some familiar methods to make this happen: subscribe(...).

We can group the subscribe() overloads into two sets. The first set returns a Subscription that allows external cancellation and the second relies on the provided class to allow/manage unsubscriptions.

The first group consists of the lambda-form subscriptions:

subscribe(): runs the Completable and relay any onError call to the RxJavaPlugins.
subscribe(Action0): runs the Completable and calls the given Action0 on successful completion. The onError calls are still relayed to RxJavaPlugins.
subscribe(Action1, Action0): runs the Completable and calls Action1 if it ended with an onError or calls Action0 if it ended with a normal onCompleted.

Since the lambda callbacks don't have access to the underlying Subscription sent through onSubscribe, these methods return a Subscription themselves to allow external unsubscription to happen. Without it, there wouldn't be any way of cancelling such subscriptions.

The second group of subscribe methods take the multi-method Subscriber instances:

subscribe(CompletableSubscriber): runs the Completable and calls the appropriate onXXX methods on the supplied CompletableSubscriber instance.
subscribe(Subscriber<T>): runs the Completable and calls the appropriate onXXX methods on the supplied rx.Subscriber instance.

Sometimes, one wants to wait for the completion on the current thread. Observable has a set of methods accessible through toBlocking() for this purpose. Since there are not many ways one can await the result of a Completable, the blocking methods are part of the Completable class itself:

await(): await the termination of the Completable indefinitely and rethrow any exception it received (wrapped into a RuntimeException if necessary).
await(long, TimeUnit): same as await() but with bounded wait time which after a TimeoutException is thrown.
get(): await the termination of the Completable indefinitely, return null for successful completion or return the Throwable received via onError.
get(long, TimeUnit): same as get() but with bounded wait time which after a TimeoutException is thrown.

Completable operators

Finally, let's see what operators are available to work with an Completable. Unsurprisingly, many of them match their counterpart in Observable, however, a lot of them is missing because they don't make sense in a valueless stream. This include the familiar map, take, skip, flatMap, concatMap, switchMap, etc. operators.

The first set of operators is accessible as a static method and usually deal with a set of Completables. Many of them have overloads for varargs and Iterable sequences.

amb(Completable...): terminates as soon as any of the source Completables terminates, cancelling the rest.
concat(Completable...): runs the Completable one after another until all complete successfully or one fails.
merge(Completable...): runs the Completable instances "in parallel" and completes once all of them completed or any of them failed (cancelling the rest).
mergeDelayError(Completable...): runs all Completable instances "in parallel" and terminates once all of them terminate; if all went successful, it terminates with onCompleted, otherwise, the failure Throwables are collected and emitted in onError.
using(Func0, Func1, Action1): opens, uses and closes a resource for the duration of the Completable returned by Func1.

The second set of operators are the usual (valueless) transformations:

ambWith(Completable): completes once either this or the other Completable terminates, cancelling the still running Completable.
concatWith(Completable): runs the current and the other Completable in sequence.
delay(long, TimeUnit): delays the delivery of the terminal events by a given time amount.
endWith(...): continues the execution with another Completable, Single or Observable.
lift(CompletableOperator): lifts a custom operator into the sequence that allows manipulationg the incoming downstream's CompletableSubscriber's lifecycle and event delivery in some manner before continuing the subscribing upstream.
mergeWith(Completable): completes once both this and the other Completable complete normally
observeOn(Scheduler): moves the observation of the terminal events (or just onCompletded) to the specified Scheduler.
onErrorComplete(): If this Completable terminates with an onError, the exception is dropped and downstream receives just onCompleted.
onErrorComplete(Func1): The supplied predicate will receive the exception and should return true if the exception should be dropped and replaced by a onCompleted event.
onErrorResumeNext(Func1): If this Completable fails, the supplied function will receive the exception and it should return another Completable to resume with.
repeat(): repeatedly executes this Completable (or a number of times in another overload)
repeatWhen(Func1): repeatedly execute this Completable if the Observable returned by the function emits a value or terminate if this Observable emits a terminal event.
retry(): retries this Completable if it failed indefinitely (or after checking some condition in other overloads):
retryWhen(Func1): retries this Completable if it failed and the Observable returned by the function emits a value in response to the current exception or terminates if this Observable emits a terminal event.
startWith(...): begins the execution with the given Completable, Single or Observable and resumes with the current Completable.
timeout(long, TimeUnit, Completable): switches to another Completable if this completable doesn't terminate within the specified time window.
to(Func1): allows fluent conversion by calling a function with this Completable instance and returning the result.
toObservable(): converts this Completable into an empty Observable that terminates if this Completable terminates.
toSingle(Func0<T>): converts this Completable into a Single in a way that when the Completable completes normally, the value provided by the Func0 is emitted as onSuccess while an onError just passes through.
toSingleDefault(T): converts this Completable into a Single in a way that when the Completable completes normally, the value provided is emitted as onSuccess while an onError just passes through.
unsubscribeOn(Scheduler): when the downstream calls unsubscribe on the supplied Subscription via onSubscribe, the action will be executed on the specified scheduler (and will propagate upstream).

The final set of operators support executing callbacks at various lifecycle stages (which can be used for debugging or other similar side-effecting purposes):

doAfterTerminate(Action0): executes the action after the terminal event has been sent downstream CompletableSubscriber.
doOnComplete(Action0): executes an action just before the completion event is sent downstream.
doOnError(Action1): calls the action with the exception in a failed Completable just before the error is sent downstream.
doOnTerminate(Action0): executes the action just before any terminal event is sent downstream.
doOnSubscribe(Action1): calls the action with the Subscription instance received during the subscription phase.
doOnUnsubscribe(Action0): executes the action if the downstream unsubscribed the Subscription connecting the stages.
doOnLifecycle(...): combines the previous operators into a single operator and calls the appropriate action.

Currently, there are no equivalent Subject implementations nor publish/replay/cache methods available. Depending on the need for these, they can be added later on. Note however that since Completable deals only with terminal events, all Observable-based Subject implementation have just a single equivalent, Completable-based Subject implementation and there is only one way to implement the publish/replay/cache methods.

It is likely the existing Completable operators can be extended or other existing Observable operators matched. Until then, you can use the

toObservable().operator.toCompletabe()

conversion pattern to reach out to these unavailable operators. In addition, I didn't list all overloads so please consult with the source code of the class (or the Javadoc once it becomes available online).

Conclusion

In this post, I've introduced the new Completable base class and detailed the available methods and operators on it. Its usage pattern greatly resembles the use of Observable or Single with the difference that it doesn't deal with values at all but only with the terminal events and as such, many operators are meaningless for Completable.

In the next part, I'm going to talk about how one can create source and transformative operators for Completable by implementing the CompletableOnSubscribe and CompletableOperator interfaces respectively.

Introduction

In this final part, I'm going to show how one can implement operators (source and transformative alike). Since the Completable API features no values but only the terminal onError and onCompleted events, there are a way less meaningful operators possible unlike the main Observable class; therefore, most of the examples will feature an existing Completable operator.

Empty

Our first operator will be the empty() operator that emits onCompleted once subscribed to. We implement it through the CompletableOnSubscribe functional interface:

Completable empty = Completable.create(completableSubscriber -> {
    BooleanSubscription cancel = new BooleanSubscription();    // (1)

    completableSubscriber.onSubscribe(cancel);                 // (2)

    if (!cancel.isUnsubscribed()) {
        cancel.unsubscribe();                                  // (3)
        completableSubscriber.onCompleted();                   // (4)
    }
});

The relief here is that there is no need for type parameters. The Completable API follows the concepts of the Reactive-Streams design where the way of letting the child subscribers cancel their upstream is enabled by calling an onSubscribe method with a Subscription instance.

For the case of empty, it means we have to create some instance of the Subscription interface: a BooleanSubscription that let's us examine if the child has unsubscribed or not (1). Before we can emit any terminal event we have to call onSubscribe and send the child subscriber this BooleanSubscription instance (2). This is a mandatory step and if omitted, we can expect NullPointerException from the various Completable operators at worst or non-functioning unsubscription when converted to Observable at best.

Similar to the Reactive-Streams spec, when a terminal event is emitted, the aforementioned Subscription has to be considered unsubscribed. We achieve this by manually unsubscribe our BooleanSubscription (3) and calling the onCompleted() method.

This may look a bit complicated for such a simple operator. The rules, however, allow for a much simpler version:

Completable empty = Completable.create(completableSubscriber -> {
    completableSubscriber.onSubscribe(Subscriptions.unsubscribed());
    completableSubscriber.onCompleted();
});

Here, we send an already unsubscribed Subscription and call onCompleted immediately after. The reason for this is twofold: unsubscription should be considered best effort meaning that a) one must be prepared events may slip trough; b) there is a really small window between the onSubscribe and onCompleted calls so the isUnsubscribed may return false for most async usages and c) the Subscription should be considered unsubscribed anyway just before onCompleted is called.

Empty delayed

Let's assume we want to emit the onCompleted event after some delay. We are going to need a Scheduler.Worker instance for it but we also have to ensure the delayed task can be cancelled if necessary.

public static Completable emptyDelayed(
        long delay, TimeUnit unit, Scheduler scheduler) {
    return Completable.create(cs -> {
        Scheduler.Worker w = scheduler.createWorker();

        cs.onSubscribe(w);

        w.schedule(() -> {
            try {
                cs.onCompleted();
            } finally {
                w.unsubscribe();
            }
        }, delay, unit);
    });
}

Luckily, Scheduler.Worker is a Subscription and thus we can directly send it to the child CompletableSubscriber before scheduling the call to onCompleted. In the scheduled task, we unsubscribe the worker after calling onCompleted at which point it is clear the worker is not unsubscribed. The Reactive-Streams specification states that org.reactivestreams.Subscription should be considered cancelled at that point and it works there because there is no way to check if a Subscription is cancelled or not. However, RxJava lets you check it but it doesn't really make sense to check if your upstream considers you unsubscribed or not. The second reason for the show order is that unsubscribing a Worker may cause unwanted interruptions down the line of onCompleted.

If we really want to make sure the Subscription the child receives is actually unsubscribed, we have to add a level of indirection via a MultipleAssignmentSubscription:

Scheduler.Worker w = scheduler.createWorker();

MultipleAssignmentSubscription mas = 
    new MultipleAssignmentSubscription();

mas.set(w);

cs.onSubscribe(mas);

w.schedule(() -> {
    mas.set(Subscriptions.unsubscribed());
    mas.unsubscribe();

    try {
        cs.onCompleted();
    } finally {
        w.unsubscribe();
    }
}, delay, unit);

Instead of sending the Worker to the child, we wrap it inside the MultipleAssignmentSubscription and just before completing the child, we replace its content with an already unsubscribed Subscription and unsubscribe the whole container. The reason for this replacement, and the use of MultipleAssignmentSubscription is to avoid unsubscribing the worker too early; a SerialSubscription or a CompositeSubscription would not allow this.

Finally, one must be careful with the scheduling of the onCompleted call, especially with RxJava 2.0 Schedulers. 2.0 Schedulers allow direct scheduling, that is, you can schedule a task without the need to create a Worker and unsubscribe it after use. This reduces the overhead for most one-shot scheduled operators but has the (acceptable) property of not ensuring any ordering between scheduled tasks of the same Scheduler. Therefore, if one is inclined to implement the delayed empty with it, it may look like this:

cs.onSubscribe(
     scheduler.scheduleDirect(cs::onCompleted, delay, unit));

There is, however, a race condition here: it is possible the scheduled action, and the onCompleted within it gets executed before the scheduleDirect returns and thus the child subscriber will not have a Subscription set. A worse scenario is that both onXXX methods may run at the same time which violates the sequential protocol of Completable. The solution, again, is to have some indirection:

MultipleAssignmentDisposable mad = 
    new MultipleAssignmentDisposable();

cs.onSubscribe(mad);

mad.set(
    scheduler.scheduleDirect(cs::onCompleted, delay, unit));

Remark: due to the naming conflict, 2.0 named its resource-handling interface Disposable instead of Subscription. I'll post an entire series about RxJava 2.0 soon.

This example should foreshadow one property of the Completable API: resource management is the responsibility of the operator itself and there is no convenient add(Subscription) available anymore (such as with rx.Subscriber). This small inconvenience requires you to use subscription containers whenever there is scheduling or multiple sources involved. The benefit of this setup is that if an operator doesn't need resource management, no such structure is created (such as in rx.Subscriber) saving on allocation cost, footprint and gives better performance overall.

First completed

When I was in primary school, sometimes, the teacher would issue a challenge where the first one to complete it would win some small prize. This is a nice analogy to the amb() operator: the first source to complete wins. The first one to fail also wins - unlike in real life were failure is not an option.

Regardless, what does it take to write such an operator for Completable? Clearly, unlike Observable.amb, we can't really capture the information which Completable source was the one that terminated first, therefore, there is no need for index trickery and the other things, a simple AtomicBoolean is sufficient.

public static Completable amb(Completable... students) {
    return Completable.create(principal -> {
        AtomicBoolean done = new AtomicBoolean();            // (1)

        CompositeSubscription all = 
             new CompositeSubscription();                    // (2)

        CompletableSubscriber teacher = 
                new CompletableSubscriber() {
            @Override
            public void onSubscribe(Subscription s) {
                all.add(s);                                  // (3)
            }

            @Override
            public void onCompleted() {
                if (done.compareAndSet(false, true)) {       // (4)
                    all.unsubscribe();
                    principal.onCompleted();
                }
            }

            @Override
            public void onError(Throwable e) {
                if (done.compareAndSet(false, true)) {       // (5)
                    all.unsubscribe();
                    principal.onError(e);
                }
            }
        };

        principal.onSubscribe(all);                          // (6)

        for (Completable student : students) {
            if (done.get() || all.isUnsubscribed()) {        // (7)
                return;
            }
            student.subscribe(teacher);                      // (8)
        }
    });
}

In the primary school example, there is only one teacher that listens to all students for the completion indicator. This is an interesting optimization in the Completable (and in Reactive-Streams) world and it does technically work: the teacher CompletableSubscriber is essentially stateless: it forwards its onXXX calls to other classes and doesn't have to remember who called its methods. I'm emphasizing technically because at this point in time, the Reactive-Streams specification forbids subscribing the same Subscriber instance to multiple Publishers, which in my opinion is over restrictive: library writers, who know what they are doing should be allowed to do this. Unsurprisingly, the compliant resolution is to move the creation of the teacher CompletableSubscriber into (8) without any changes to its internals; clearly, that should indicate it can be shared among the students. Nonetheless, let's see what each notable point does:

We create the shared done indicator which is set to true once one of the student notifies the teacher about its completion (or failure).
In case the head principal doesn't like the challenge, he/she can cancel the entire challenge.
The teacher will register the Subscriptions given by the student Completables.
and makes sure the first who signals the terminal event will also notify the principal about it (unfortunately, he/she won't know who was the first actually). At this point, there is no reason for the others to continue and will be unsubscribed from the challenge.
In addition, if it turns out the challenge melts the brain of one of the students, for safety reasons, the challenge has to be cancelled and the head principal notified about the error.
We allow the principal to tell each student to stop working on the challenge.
For each student Completable, we "hand out" the challenge material and subscribe the teacher to the each students terminal event. It is possible that while the challenge gets completed/failed/cancelled while we are still subscribing to students at which point there is no reason to continue the process.

When all completed

Remaining at the school example, other times the students are evaluated and such evaluation procedure happens until all students have completed (or failed) their evaluation. If an accident happens, we may or may not stop the evaluation process and send the injured students to the ambulance in one batch.

I hope this operator setup sounds familiar, if not, here is the answer: merge and mergeDelayError respectively, depending on the "assembled" school policy.

Let's see the simplest case where we know exactly the number of students and any failure should stop the evaluation process:

public static Completable merge(Completable... students) {
    return Completable.create(principal -> {
        AtomicInteger remaining = 
            new AtomicInteger(students.length);              // (1)

        CompositeSubscription all = 
            new CompositeSubscription();

        CompletableSubscriber evaluator = 
                 new CompletableSubscriber() {
             @Override
             public void onSubscribe(Subscription s) {
                 all.add(s);
             }

             @Override
             public void onCompleted() {
                 if (remaining.decrementAndGet() == 0) {     // (2)
                     all.unsubscribe();
                     principal.onCompleted();
                 }
             }

             @Override
             public void onError(Throwable e) {
                 if (remaining.getAndSet(0) > 0) {           // (3)
                     all.unsubscribe();
                     principal.onError(e);
                 }
             }
        };

        principal.onSubscribe(all);

        for (Completable student : students) {
            if (all.isUnsubscribed() 
                || remaining.get() <= 0) {                   // (4)
                return;
            }
            student.subscribe(evaluator);
        }
    });

The implementation looks quite similar to the amb() case but has a few differences:

We need to count (down) atomically the number of students who completed the evaluation successfully.
Once it reaches zero, the principal is notified about the completion of the entire evaluation.
If one of the students signals an error, we set the remaining count to zero atomically and if it was previously non-zero, we cancel everybody and signal the error to the principal. Note that this can happen at most once because if there are multiple concurrent onError calls, only one of them will successfully replace a positive remaining value with a zero value. Any subsequent regular completion will just further decrement the remaining value.
If there was an error or cancellation, the loop that subscribes the evaluator to the students should be stopped to not waste more time and resources.

But what if we don't know the number of students and we don't really want to stop the evaluation in case of an error. This complicates the error management and tracking of completed students slightly:

public static Completable mergeDelayError(
        Iterable<? extends Completable> students) {
    return Completable.create(principal -> {
        AtomicInteger wip = new AtomicInteger(1);         // (1)

        CompositeSubscription all = 
            new CompositeSubscription();

        Queue<Throwable> errors = 
            new ConcurrentLinkedQueue<>();                // (2)

        CompletableSubscriber evaluator = 
                new CompletableSubscriber() {
            @Override
            public void onSubscribe(Subscription s) {
                all.add(s);
            }

            @Override
            public void onCompleted() {
                if (wip.decrementAndGet() == 0) {         // (3)
                    if (errors.isEmpty()) {
                        principal.onCompleted();
                    } else {
                        principal.onError(
                            new CompositeException(errors));
                    }
                }
            }

            @Override
            public void onError(Throwable e) {
                errors.offer(e);                          // (4)
                onCompleted();
            }
        };

        principal.onSubscribe(all);

        for (Completable student : students) {
            if (all.isUnsubscribed()) {                   // (5)
                return;
            }
            wip.getAndIncrement();                        // (6)
            student.subscribe(evaluator);
        }

        evaluator.onCompleted();                          // (7)
    });
}

Again, the structure looks similar, but the expected behavior requires different algorithms:

We start with an AtomicInteger work-in-progress counter and from number 1. The reason for this is that we don't know how many students are going to get from the Iterable, but we know we have finished once the wip counter reaches zero after all students and the CompletableOnSubscribe finished.
We will collect the exceptions into a concurrent queue.
The terminal condition is determined in the evaluator's onCompleted method: if the wip counter reaches zero, we check if there were errors queued up along the way and if so, we emit it as a CompositeException. Otherwise, a regular onCompleted event is emitted to the principal.
Since we don't stop on error, we have to perform the same wip decrement as in onCompleted, but before that, the error has to be queued up. (Note that misbehaving Completable sources can disrupt this and trigger early completion if they send multiple terminal events.)
In the loop where the student Completables are subscribed to we can only check if the principal is no longer interested in the evaluation; the value of the wip counter doesn't help here because it is going to be at least 1 while the loop is running.
For each new Completable student, first we increment the wip count and then subscribe the evaluator to it. This makes sure the wip counter is at least 1 so the terminal condition isn't met while the loop is running.
Finally, we call the evaluator's onCompleted method to signal no more Completables students will appear. This now allows the wip counter to reach zero and terminate the whole process.

Such compactness and reuse is rare (or even impossible) with the regular RxJava 1.x Observable operators.

Transformative operators

I don't think there are too many ways one can transform a Completable"sequence". Most Observable operators no longer make sense and thus are omitted from the Completable API. Regardless, let's see a few examples.

With the first operator, we'd like to suppress an exception and since there are no values involved, the best we can do is to signal an onCompleted ourselves:

CompletableOperator onErrorComplete = cs -> {
    return new CompletableSubscriber() {
        @Override
        public void onSubscribe(Subscription s) {
            cs.onSubscribe(s);
        }

        @Override
        public void onCompleted() {
            cs.onCompleted();
        }

        @Override
        public void onError(Throwable e) {
            cs.onCompleted();
        }
    };
};

source.lift(onErrorComplete).subscribe();

Alternatively, we'd like to resume with another Completable in case of an error:

public static Completable onErrorResumeNext(
        Completable first,
        Func1<Throwable, ? extends Completable> otherFactory) {

    return first.lift(cs -> new CompletableSubscriber() {
        final SerialSubscription serial = 
            new SerialSubscription();                          // (1)
        boolean once;

        @Override
        public void onSubscribe(Subscription s) {
            serial.set(s);                                     // (2)
        }

        @Override
        public void onCompleted() {
            cs.onCompleted();
        }

        @Override
        public void onError(Throwable e) {
            if (!once) {                                       // (3)
                once = true;
                otherFactory.call(e).subscribe(this);          // (4)
            } else {
                cs.onError(e);                                 // (5)
            }
        }
    });
}

In this example, we lift the CompletableOperator instance into the supplied Completable. In the operator body, we return a CompletableSubscriber with the following internal behavior:

Since we may need to switch sources, we have to swap the incoming Subscription from the first to the other Completable. It is possible to use MultipleAssignmentSubscription for this case as well. This is analogous to the ProducerArbiter approach common with Observable operators, although much simpler in nature. In addition, we will reuse the this instance on the new Completable but we don't want to keep resubscribing to it if it fails as well.
We set the incoming Subscription on the SerialSubscription, evicting the previous subscription.
In case the first signals an error, we make sure the case that switches to the other Completable runs once,
because we are going to reuse the current CompletableSubscriber instance for it as well, saving on allocation. The code in (2) makes sure the unsubscription chain is still maintained. This example omits the try-catch around the factory call for brevity; in that case, you can create a CompositeException for both the original error and the fresh crash and call cs.onError with it.
If this is the other Completable source that fails, we simply signal the same error downstream.

If we think about it, implementing a regular continuation (i.e., andThen, endWith, concatWith) in case of an onCompleted practically uses the same approach. Instead of switching in onError, the switch happens in onCompleted:

        // ...
        @Override
        public void onCompleted() {
            if (!once) {                                        // (3)
                once = true;
                otherFactory.call(e).subscribe(this);           // (4)
            } else {
                cs.onCompleted();                               // (5)
            }
        }

        @Override
        public void onError(Throwable e) {
            cs.onError(e);
        }

Lastly, we'd like an operator that switches to another Completable on a timeout condition:

public static Completable timeout(
        Completable first,
        long timeout, TimeUnit unit, Scheduler scheduler
        Completable other) {
     return first.lift(cs -> new CompletableSubscriber() {

         final CompositeSubscription csub = 
             new CompositeSubscription();                    // (1)

         final AtomicBoolean once = new AtomicBoolean();     // (2)


         @Override
         public void onSubscribe(Subscription s) {
             csub.add(s);                                    // (3)

             Scheduler.Worker w = scheduler.createWorker();
             csub.add(w);

             cs.onSubscribe(csub);

             w.schedule(this::onTimeout, timeout, unit);     // (4)
         }

         @Override
         public void onCompleted() {
             if (once.compareAndSet(false, true)) {
                 csub.unsubscribe();                         // (5)
                 cs.onCompleted();
             }
         }

         @Override
         public void onError(Throwable e) {
             if (once.compareAndSet(false, true)) {
                 csub.unsubscribe();
                 cs.onError(e);
             }
         }

         void onTimeout() {
              if (once.compareAndSet(false, true)) {         // (6)
                  csub.clear();

                  other.subscribe(new CompletableSubscriber() {
                      @Override
                      public void onSubscribe(Subscription s) {
                          csub.add(s);
                      }

                      @Override
                      public void onCompleted() {
                          cs.onCompleted();
                      }

                      @Override
                      public void onError(Throwable e) {
                          cs.onError(e);
                      }
                  });
              }
         }
     });
}

This operator is a bit more involved:

We are going to track both the first and other Completable source's Subscription as well as the Worker of the Scheduler that triggers the timeout condition.
Since the terminal event of the first Completable races with timeout event, we need to determine a winner (as with amb()) which locks out the other event.
When the Subscription arrives from the first source, we add it to the composite, create the Worker and then forward the whole composite to the downstream CompletableSubscriber.
Finally, we schedule the execution of the onTimeout method. This setup avoids the race between the onTimeout and onSubscribe, i.e., if the timeout were scheduled in assembly time, it may happen before the Subscription arrives at which point one needs extra logic to make thinks right (not detailed here). Most of the time, this style of onSubscribe implementation saves a lot of headache with Reactive-Streams compliant operators in 2.0.
If the terminal event of the first Completable happens first, we atomically and conditionally set the flag to true (this will prevent the onTimeout to execute its inner logic). Since the composite manages the Worker resource, we have to unsubscribe it, followed by the emission of the original event to downstreams.
If the onTimeout happens first and wins the race to set the flag, we clear the composite and subscribe to the other Completable with a fresh CompletableSubscriber. We call clear here because we still need the CompositeSubscription to store the Subscription of the other Completable and an unsubscribed composite is of no use. The additional benefit of clear is that it will cancel the first Completable as well as the Worker running the timeout.

Hot Completable?

As a final thought, I'm not sure if a hot Completable (or a published Completable) has any use cases out there, but let's see how one can implement one if necessary. Since Completable doesn't have any value, there is only one meaningful CompletableSubject implementation possible: there is nothing to replay other than the terminal value.

First, let's see the skeleton of the CompletableSubject:

public final class CompletableSubject 
extends Completable implements CompletableSubscriber {

    public static CompletableSubject create() {
        State state = new State();
        return new CompletableSubject(state);
    }

    static final class State 
    implements CompletableOnSubscribe, CompletableSubscriber {

        // TODO state fields

        boolean add(CompletableSubscriber t) {
            // TODO implement
        }

        void remove(CompletableSubscriber t) {
            // TODO implement
        }

        @Override
        public void call(CompletableSubscriber t) {
            // TODO implement
        }

        @Override
        public void onSubscribe(Subscription d) {
            // TODO implement
        }

        @Override
        public void onCompleted() {
            // TODO implement

        }

        @Override
        public void onError(Throwable e) {
            // TODO implement

        }
    }

    static final class CompletableSubscription 
    extends AtomicBoolean implements Subscription {
        /** */
        private static final long serialVersionUID = 
            -3940816402954220866L;

        final CompletableSubscriber actual;
        final State state;

        public CompletableSubscription(
                CompletableSubscriber actual, State state) {
            this.actual = actual;
            this.state = state;
        }

        @Override
        public boolean isUnsubscribed() {
            return get();
        }

        @Override
        public void unsubscribe() {
            if (compareAndSet(false, true)) {
                state.remove(this);
            }
        }
    }

    final State state;

    private CompletableSubject(State state) {
        super(state);
        this.state = state;
    }

    @Override
    public void onSubscribe(Subscription d) {
        state.onSubscribe(d);
    }

    @Override
    public void onCompleted() {
        state.onCompleted();
    }

    @Override
    public void onError(Throwable e) {
        state.onError(e);
    }
}

The structure here starts out as with any other Subject before. We have to create the CompletableSubject instance with a factory and have the CompletableSubscriber methods delegate to the shared State instance. The CompletableSubscription will be used for tracking each CompletableSubscriber and help manage the unsubscription based on their identity.

The State class will hold onto the terminal indicator and the optional Throwable instance along with the array of known child CompletableSubscribers:


    Throwable error;

    volatile CompletableSubscription[] subscribers = EMPTY;

    static final CompletableSubscription[] EMPTY = 
            new CompletableSubscription[0];
    static final CompletableSubscription[] TERMINATED =
            new CompletableSubscription[0];

    boolean add(CompletableSubscription t) {
        if (subscribers == TERMINATED) {
            return false;
        }
        synchronized (this) {
            CompletableSubscription[] a = subscribers;
            if (a == TERMINATED) {
                return false;
            }

            CompletableSubscription[] b = 
                new CompletableSubscription[a.length + 1];
            System.arraycopy(a, 0, b, 0, a.length);
            b[a.length] = t;
            subscribers = b;
            return true;
        }
    }

    void remove(CompletableSubscription t) {
        CompletableSubscription[] a = subscribers;
        if (a == EMPTY || a == TERMINATED) {
            return;
        }

        synchronized (this) {
            a = subscribers;
            if (a == EMPTY || a == TERMINATED) {
                return;
            }

            int j = -1;
            for (int i = 0; i < a.length; i++) {
                if (a[i] == t) {
                    j = i;
                    break;
                }
            }

            if (j < 0) {
                return;
            }
            if (a.length == 1) {
                subscribers = EMPTY;
                return;
            }
            CompletableSubscription[] b = 
                new CompletableSubscription[a.length - 1];
            System.arraycopy(a, 0, b, 0, j);
            System.arraycopy(a, j + 1, b, j, a.length - j - 1);
            subscribers = b;
        }
    }

The add and remove methods have the usual and familiar implementations that allows tracking the subscribed child CompletableSubscribers.

Next, let's handle the incoming subscribers:

    @Override
    public void call(CompletableSubscriber t) {
        CompletableSubscription cs = 
            new CompletableSubscription(t, this);
        t.onSubscribe(cs);

        if (add(cs)) {
            if (cs.isUnsubscribed()) {
                remove(cs);
            }
        } else {
            Throwable e = error;
            if (e != null) {
                t.onError(e);
            } else {
                t.onCompleted();
            }
        }
    }

We create a CompletableSubscription for each CompletableSubscriber that captures bot the subscriber and the current state instance: if the child subscriber calls unsubscribe on it, it can remove itself from the tracking array of the state. Note that there is a race condition between a successful add and a cancellation coming from downstream which may leave the CompletableSubscriber attached. Therefore, we have to check if the child has unsubscribed during the run of add and if so, we call remove() again to be sure. If the add returns false, that means the CompletableSubject has reached its terminal state and reading the error field can tell how the child subscriber should be notified.

Handling the onXXX notifications isn't that complicated either:

    @Override
    public void onSubscribe(Subscription d) {
        if (subscribers == TERMINATED) {
            d.unsubscribe();
        }
    }

    @Override
    public void onCompleted() {
        CompletableSubscription[] a;
        synchronized (this) {
            a = subscribers;
            subscribers = TERMINATED;
        }

        for (CompletableSubscription cs : a) {
            cs.actual.onCompleted();
        }
    }

    @Override
    public void onError(Throwable e) {
        CompletableSubscription[] a;
        synchronized (this) {
            a = subscribers;
            error = e;
            subscribers = TERMINATED;
        }

        for (CompletableSubscription cs : a) {
            cs.actual.onError(e);
        }
    }

Reacting to onSubscribe is up for debate; here I unsubscribe the incoming subscription if the CompletableSubject reached its terminal state. Otherwise, we can't do much since the CompletableSubject could be subscribed to many Completable sources of which any can bring it to its terminal state. You can ignore the parameter this method or keep track all of the Subscriptions in a composite.

The onError and onCompleted methods look quite alike. In both cases we atomically swap in the terminated array and loop through the previous array while emitting the appropriate terminal event. Note that we set the error field in onError before we store the TERMINATED value in subscribers, which will give it proper visibility in the call() method above.

Conclusion

In this post, I've detailed ways of implementing Completable operators that either act as sources of terminal events or transform them in some way and even thrown in a Subject-like implementation of it.

Completable is more like a type-tool to indicate a sequence won't have any value but only side-effects and the reduced API surface may be more convenient than working with the full-blowin Observable API.

Implementing Completable operators is easier than implementing the backpressure-supporting Observable operators, but one has to look out for proper unsubscription chaining, avoiding races between the onXXX methods and utilizing the AtomicXXX classes for the (efficient) state management.

Now that we have even more experience in writing Subjects, the next blog post will conclude the series about ConnectableObservables.

Introduction

In this blog post, I begin to explain the outer and inner properties of the most used, misunderstood and at the same time, one of the most complex operator there is: flatMap.

FlatMap is most useful because it let's you replace simple values with something that can change the output in terms of time, location and value count. FlatMap is misunderstood because it is introduced late, not enough time is spent demonstrating it and often surrounded with functional programming technoblabble. Finally, it's complex because it has to coordinate backpressure of a single consumer and request from multiple sources, and we usually don't know which of them will respond with actual items. Maybe all of them.

FlatMap has a companion operator: merge. Merge lets you flatten a sequence of Observables into a single stream of values while ensuring the contract of the Observer, namely, the requirement of non-concurrent invocation of the onXXX methods and the conformance to the onNext* (onError|onCompleted)? protocol. This is necessary because although the individual Observables you merge do conform the same protocol individually, they get mixed in time, location and numbers when you listen to them all at once. Of course, flatMap has to do the same so why are there two operators?

The answer is convenience and usage pattern. FlatMap is an in-sequence operator that reacts to values from the upstream by generating an Observable, through a callback function, that is internally subscribed to, coordinated and serialized in respect to any previous or subsequent Observables generated through the same callback function. Merge, on the other hand works on a two-dimensional sequence: an Observable of Observables. There is no function involved here but the operator has to subscribe all of those inner Observables emitted by the outer Observable.

The fun thing is, you can express them with the other:

Func1<T, Observable<R>> f = ...
source.flatMap(f) == Observable.merge(source.map(f))


Observable<Observable<R>> o = ...
Observable.merge(o) == o.flatMap(v -> v)

In the first case, flatMap can be expressed by mapping the values of the source onto Observable<R>s and mergin them together. If you look at RxJava 1.x source code, you'll see that flatMap is implemented in terms of merge in this way. In the second case, given the two-dimensional sequence, when we flatMap over the elements of the inner Observable<R>s as the value v, they are already of type observable and we can return them as they are.

You can think of flatMap as the join part of a fork-join operation, when all threads come together again to form a single sequence. However, there are no guarantees on when this coming together happens.

For example, given a sequence of product IDs, you'd want to fire off network calls, available conveniently with Retrofit-reachable services, that return some additional information to each of them. We know that networks and databases can respond in an unpredictable way and some network calls for later IDs may come back earlier than other responses. Now you have the responses in arbitrary order. Sometimes, the order doesn't matter, sometimes it does.

But no matter how many times this lack of ordering guarantee is mentioned with flatMap, it still tends to pop up in questions everywhere. But why?

The reason is how flatMap is introduced to the reader: by showing a completely sequential example of maybe mapping a range of values onto a subranges:

Observable.range(1, 10).flatMap(v -> Observable.range(v, 2))
.subscribe(System.out::println);

This is completely synchronous use and the the output is nicely ordered. Then you see this same example written with concatMap, which does keep the order, and receive the same output. So instead, let's introduce some asynchrony into the example, a fairly obvious and simple one to show that the order of the input may not hold all the way through:

Observable.range(1, 10)
.flatMap(v -> Observable.just(v).delay(11 - v, TimeUnit.SECONDS))
.toBlocking()
.subscribe(System.out::println);

What happens here is we map the individual integers onto a delayed scalar Observable where the delay is gets smaller as the value gets bigger. At the end, the output is a sequence of decreasing values, the complete reverse order of the original input range.

FlatMap also plays a big role in asynchronous continuations. That is, when some asynchronous computation or network retrieval completes, one wishes to resume with some other asynchronous computation based on the single value returned by the first. The emphasis is on single here. My current knowledge is that no reactive network libraries do stream you multiple values at this time, but give you a single result at once (which can be a List of all values but still just a single object). Thus, when one encounters a logic with flatMap, one rarely experiences its property of running Observables in parallel.

The final property of using flatMap is its ability to change the number of items that gets emitted to the downstream. Given the proper callback function, you can make the operator emit nothing for a single input, exactly one value, multiple values or even an error. All you have to do is return empty(), just(), some chain of Observables or error() from that callback.

It is often asked what one should do when given a regular map() operation, one would like to throw an error instead of returning a value. If that exception is a RuntimeException, you can throw it directly and RxJava will turn it into onError for you inside the map operator. However, if you have checked Exception, like IOException, you are out of luck with map(). Either you wrap it into a RuntimeException but then you have to unwrap it somewhere else or write a custom map() operator with a callback function that let's you throw checked exception.

The alternative is to use flatMap because you can return an error()Observable which then directly emits your error as onError without the need of wrapping:

Observable.range(1, 10)
.flatMap(v -> {
    if (v < 5) {
        return Observable.just(v * v);
    }
    return Observable.<Integer>error(new IOException("Why not?!"));
})
.subscribe(System.out::println, Throwable::printStackTrace);

Sometimes, you flatMapObservables that themselves may signal an error for some reason. The default RxJava behavior is that whenever an onError situation is encountered, tear down everything immediately and terminate the stream with that specific error. The problem with this is that a failing source can waste your ongoing effort of the other sources, but at the same time, you don't want to suppress that failing source with one of the onErrorXXX operators - likely because you may not know which of them is going to fail upfront.

The solution is to delay the error till all sources terminated and report the error(s) after at the very end. This allows us to apply a "global" error handler on the output of the flatMap and still work with all "successfully received" values.

Therefore, flatMap has an overload which takes a boolean delayErrors parameter just after the function callback. The operator merge has a different method name for the same behavior: mergeDelayError.

Backpressure, the way of preventing buffer-bloat with reactive flows, is a cornerstone of RxJava. Most operators that don't have timing aspects in them apply and honor backpressure. Unfortunately, flatMap by default can only say it honors backpressure but doesn't apply it towards its main input.

This means that when you use the common overload of flatMap or merge, the operator will request Long.MAX_VALUE from its upstream and realize it all at once. This unbounded behavior leads to an unbounded number of active subscriptions to the generated inner Observables.

This property doesn't cause much trouble if the inner Observables are short or infrequently emitting, but if there is an asynchronous boundary after flatMap, let's say, observeOn, items can quite easily pile up in flatMap and degrade performance considerably.

Technically, as we will see, there is nothing preventing flatMap from applying the same backpressure to its input. However, historically, some coming from Rx.NET were relying on its lack of backpressure and happily merge 1000s of Observables at once and at the same time, somehow relying on the fact that they are merged live. Thus, the default unbounded behavior stuck with RxJava.

However, many recognized that merging 1000s hot Observables doesn't compare to 1000s of cold, networking Observables and there needs to be a way of limit the active number of Observables. Thus, flatMap and merge have overloads that take a maxConcurrency parameter to make this limit happen. The fun fact of this property is that it's much easier to implement it in a backpressured environment than in the non-backpressured Rx.NET world.

Implementing FlatMap

Given the conversion between flatMap and merge above, one can ask the question which operator should we implement. Clearly, merge() doesn't have to deal with a mapping function so why not do that, like RxJava itself?

The answer is: allocation. If you implement flatMap in terms of merge, you have to use two operators: merge and map. When the sequences are assembled, the application of an operator incurs allocation cost. This has to happen because operators usually hold some state: parameters, function callbacks, etc. Having more operators means having more assembly allocation, more garbage and more GC churn, especially if the sequence is short lived.

(Things got worse a bit due to a convenience decision made some time ago in RxJava: the introduction of the lift() operator and its ubiquitous usage inside the standard operators. So in total, applying operators may incur allocating 6-10 objects whereas the theoretical minimum should be 1-2.)

FlatMap has to serialize events coming from all of the active sources so the first building block that comes into mind is SerializedSubscriber. How easy it would be to just subscribe (or route to) an instance of it and we have the all things nicely serialized out.

Unfortunately, that doesn't work.

First of all, subscribing the same, stateful instance of SerializedSubscriber to multiple sources is a bad idea. We can't really control the request this way, plus, different sources may and often will set a Producer on their Subscribers, thus given a single instance of SerializedSubscriber, they will overwrite each other's Producer.

Second, we need a way to get rid of completed sources and not retain them indefinitely. Since there is no Subscriber.remove() to complement Subscriber.add(), even if we instantiate multiple Subscribers that forward events to the same underlying SerializedSubscriber, we have to do some CompositeSubscription juggling to get the cleanup or downstream's unsubscription working.

Lastly, SerializedSubscriber can block due to the use of synchronized block. Blocking gives some "natural" backpressure, but at the same time, hinders progress. If one runs within an asynchronous requirement/environment, there is a great incentive to avoid blocking as much as possible.

Therefore, the tool to avoid blocking and get serialized output is to use the familiar queue-drain approach. So let's start by sketching out the skeleton of our flatMap operator:

public final class OpFlatMap<T, R> implements Operator<R, T> {

    final Func1<? super T, ? extends Observable<? extends R>> mapper;

    final int prefetch;

    public OpFlatMap(Func1<? super T, ? extends Observable<? extends R>> mapper,
            int prefetch) {
        this.mapper = mapper;
        this.prefetch = prefetch;
    }

    @Override
    public Subscriber call(Subscriber<? super R> t) {
        FlatMapSubscriber<T, R> parent = new FlatMapSubscriber<>(t, mapper, prefetch);
        parent.init();
        return parent;
    }
}

The operator takes a mapper callback function that will generate the inner Observables and a prefetch amount that tells how many items to request from each of these inner Observables. We will hand these and the incoming child subscriber over to the parent Subscriber we create. For convenience, the setting up of the unsubscription chain and backpressure is hidden inside the init() method.

Next comes the implementation of the FlatMapSubscriber that does the coordination and value collection:

static final class FlatMapSubscriber<T, R> extends Subscriber<T> {
    final Subscriber<? super R> actual;

    final Func1<? super T, ? extends Observable<? extends R>> mapper;

    final int prefetch;                                             // (1)

    final CompositeSubscription csub;                               // (2)

    final AtomicInteger wip;                                        // (3)

    final Queue<Object> queue;                                      // (4)

    final AtomicLong requested;                                     // (5)

    final AtomicInteger active;                                     // (6)

    final AtomicReference<Throwable> error;                         // (7)

    public FlatMapSubscriber(Subscriber<? super R> actual,
            Func1<? super T, ? extends Observable<? extends R>> mapper,
            int prefetch) {
        this.actual = actual;
        this.mapper = mapper;
        this.prefetch = prefetch;
        this.csub = new CompositeSubscription();
        this.wip = new AtomicInteger();
        this.requested = new AtomicLong();
        this.queue = new ConcurrentLinkedQueue<>();
        this.active = new AtomicInteger(1);
        this.error = new AtomicReference<>();
    }

    public void init() {
        // TODO implement
    }

    @Override
    public void onNext(T t) {
        // TODO implement
    }

    @Override
    public void onError(Throwable e) {
        // TODO implement
    }

    @Override
    public void onCompleted() {
        // TODO implement
    }

    void childRequested(long n) {
        // TODO implement
    }

    void innerNext(Subscriber<R> inner, R value) {
        // TODO implement
    }

    void innerError(Throwable ex) {
        // TODO implement
    }

    void innerComplete(Subscriber<?> inner) {
        // TODO implement
    }

    void drain() {
        // TODO implement
    }
}

So far nothing special, just the usual fields and parameters:

We have the child subscriber, the mapper function and the prefetch value.
We will track the inner subscribers with a CompositeSubscription so when the child unsubscribes, we can unsubscribe them all at once, plus when an inner Observable terminates, we can remove just its subscriber.
We have the usual work-in-progress atomic integer indicating if there is a drain going on, thus establishing the non-blocking queue-drain approach
We have a shared queue where all sources will submit their value before attempting to drain it towards the child subscriber. The queue takes Object instead of R because we will also use this queue to post the Subscriber who generated that particular value so we can request more from that particular source.
We need to track the child requested amount because when a child request 1, we can't really tell which inner Observable to request that 1 from, thus we have to request all of them. However, this may yield any number of response items and we can't just simply emit all of them to the child subscriber (possibly causing MissingBackpressureException down the line).
We need to track how many active sources there are, including the main source of Ts. When this counter reaches zero, the child subscriber can be completed.
Any of the sources may signal an error or even the main source itself as well. For simplicity, we will only store the very first exception and route the rest to the RxJavaPlugins' error handler.

You may think, if this so-called request amplification happens, why not request 1-by-1? The reason is twofold: a) it is very inefficient to get values 1-by-1 from most sources and b) you still get N values for a single downstream request so you have to do accounting of delivery of some sorts to know when to request 1 again from the inner sources.

Before we jump into the unimplemented methods, we still need another class: FlatMapInnerSubscriber that we will use to subscribe to each individual Observable<R> generated by the mapper function. Since you can't extend two classes or implement the same generic interface with different type parameters, a separate class is required.

static final class FlatMapInnerSubscriber<T, R> extends Subscriber<R> {
    final FlatMapSubscriber<T, R> parent;

    public FlatMapInnerSubscriber(FlatMapSubscriber<T, R> parent, int prefetch) {
        this.parent = parent;
        request(prefetch);                                         // (1)
    }

    @Override
    public void onNext(R t) {
        parent.innerNext(this, t);                                 // (2)
    }

    @Override
    public void onError(Throwable e) {
        parent.innerError(e);
    }

    @Override
    public void onCompleted() {
        parent.innerComplete(this);
    }

    void requestMore(long n) {
        request(n);                                                // (3)
    }
}

Here, we start with an initial request of the prefetch (1) value and delegate all onXXX methods back to the parent FlatMapSubscriber. In the parent FlatMapSubscriber, I mentioned that we will enqueue the sender along with the value it sends in the shared Queue<Object>. This may come non-intuitive and one would simply call request(1) just before or after (2). The problem with this is that the source will keep receiving requests and generate values, flooding the queue and not achieving backpressure at all. The solution is to make sure one requests only when that particular source's value has been taken and thus it is allowed to produce a replacement. (We will see in part 2 how this can be achieved by different means). In addition, we need to expose the protected request() method to allow the drain loop to request replenishments.

Now back to FlatMapSubscriber:

    public void init() {
        add(csub);
        actual.add(this);
        actual.setProducer(new Producer() {
            @Override
            public void request(long n) {
                childRequested(n);
            }
        });
    }

In the method init() we setup the unsubscription link with the child subscriber and delegate its request() call back to the childRequested() method of ours. You could ask, why not do this in the constructor? The reason is that by having this separate, there this of the constructor won't leak before all the final fields have been sealed, avoiding memory visibility and other problems.

    @Override
    public void onNext(T t) {
        Observable<? extends R> o;

        try {
            o = mapper.call(t);
        } catch (Throwable ex) {
            Exceptions.throwOrReport(ex, this, t);
            return;
        }

        active.getAndIncrement();
        FlatMapInnerSubscriber<T, R> inner = 
                new FlatMapInnerSubscriber<>(this, prefetch);
        csub.add(inner);

        o.subscribe(inner);
    }

In the onNext() method, we call the function to generate an Observable, increment the active counter, create the inner Subscriber and add it to the composite before (!) we subscribe it to the generated Observable. This way, when the inner terminates, it won't accidentally decrement the active count to zero and will be able to remove itself from the composite. Since the mapper can throw, we wrap the call into a try-catch and use a helper method from Exceptions to either rethrow a fatal exception (such as OutOfMemoryError or StackOverflowError) or report it through ourselves, which essentially calls onError:

    @Override
    public void onError(Throwable e) {
        if (error.compareAndSet(null, e)) {
            unsubscribe();
            drain();
        } else {
            RxJavaPlugins.getInstance()
            .getErrorHandler().handleError(e);
        }
    }

We atomically try to set the Throwable inside error if it is still null, if successful, we unsubscribe ourselves (and thus all active inner subscribers) and call drain() which will take care of emitting the error in a serialized fashion to the child subscriber. If there was somebody else that already signaled an error, we instead send the Throwable to the plugin handler. We don't have to worry about the active count here because onError is an immediate terminal state for us, unlike what happens in onCompleted:

    @Override
    public void onCompleted() {
        if (active.decrementAndGet() == 0) {
            drain();
        }
    }

The active count is decremented atomically and if it reaches zero, we call drain(). The drain will make sure all queued up values get emitted before the completion signal. Since we consider the main source of Ts also an input, the counter starts at 1 and may go up and down as inner sources get created and subscribed to. We strongly expect the Observables participating in the flatmapping to honor the protocol, we expect at most 1 onCompleted calls from anyone. Thus if the main source completes and the active counter was 1, we can be sure no further inner sources will arrive (so there is no 0 - 1 - 0 change) to mess up the accounting. If for, some reason, you don't trust all the sources or your main Observable, feel free to surround this with a CAS to ensure the decrement is only executed once:

    AtomicBoolean once = new AtomicBoolean();
    // ...
    @Override
    public void onCompleted() {
        if (once.compareAndSet(false, true)) {
            if (active.decrementAndGet() == 0) {
                drain();
            }
        }
    }

Due to design decisions in RxJava, the FlatMapSubscriber can't itself implement the Producer interface and needs to swing around any child request with the help of another Producer instance, as seen in the init() method. The target of that call looks as follows:

    void childRequested(long n) {
        if (n > 0) {
            BackpressureUtils.getAndAddRequest(requested, n);
            drain();
        }
    }

The utility method will make sure the requested amount is added and capped to Long.MAX_VALUE and then invokes the drain() method to make sure any queued value is emitted up to that total requested amount.

The delegate methods called from the FlatMapInnerSubscriber are themselves short and I'll show them together:

    void innerNext(Subscriber inner, R value) {
        queue.offer(inner);
        queue.offer(NotificationLite.instance().next(value));
        drain();
    }

    void innerError(Throwable ex) {
        onError(ex);
    }

    void innerComplete(Subscriber inner) {
        csub.remove(inner);
        onCompleted();
    }

The innerNext() puts the subscriber and the value (plus making sure nulls are wrapped with the help of NotificationLite) into the queue before calling drain; the innerError just delegates to onError and finally, the innerComplete removes the inner Subscriber from the composite and delegates to the regular onCompleted. Note that if you did the once trick I mentioned above, this delegation won't work. You have to introduce the same once field on the FlatMapInnerSubscriber before it calls the innerCompleted and you have to include that decrementAndGet() == 0 in innerCompleted directly.

Finally, let's see the drain() method, piece by piece:

if (wip.getAndIncrement() != 0) {
    return;
}

int missed = 1;

for (;;) {

    long r = requested.get();
    long e = 0L;

    while (e != r) {

The first section is typical; we increment the wip counter and if it happened to be 0, we enter the drain loop. We'll use missed to detect if others have also called drain() and thus more work has to be performed. We read the current requested amount and prepare the emission amount. The loop will go as long as the emission account doesn't reach the requested amount.

        if (actual.isUnsubscribed()) {
            return;
        }

        boolean done = active.get() == 0;              // (1)
        Throwable ex = error.get();                    // (2)
        if (ex != null) {
            actual.onError(ex);
            return;
        }

        Object o = queue.poll();

        if (done && o == null) {                       // (3)
            actual.onCompleted();
            return;
        }

        if (o == null) {
            break;
        }

        Object v;

        for (;;) {                                     // (4)
            if (actual.isUnsubscribed()) {
                return;
            }
            v = queue.poll();
            if (v != null) {
                break;
            }
        }

        actual.onNext(NotificationLite
                .<R>instance().getValue(v));           // (5)

        ((FlatMapInnerSubscriber<?, ?>)o)
            .requestMore(1);

        e++;
    }

The inside of the drain loop should look familiar with the exception of a secondary loop perhaps. In the while loop, we detect the completion by checking the active count against zero
as well as see if the error reference holds something non-null.
Since we put 2 objects inside the queue, we have to take 2 objects out. If the first poll returns null, the queue is empty and if at the same time, done is true, we reached the end of all sources and can complete the child.
However, we can't just poll again because it is possible that the thread got interrupted between the two offer() call in the innerNext() above and thus the second value, the value to be emitted, isn't there yet. Therefore, we need to have an inner loop that keeps polling until the second value arrives (or the child unsubscribes). (Note that this can be avoided with specialized queues or with tuple types.)
Once we have the second and real value, we unwrap it - with the help of NotificationLite - and emit it to the child subscriber. At the same time, we cast the sender back to FlatMapInnerSubscriber (of any) and ask for replenishment. The loop body ends by incrementing the emission amount so we can detect if we fulfilled all current requests.

    if (e == r) {
        if (actual.isUnsubscribed()) {
            return;
        }
        boolean done = active.get() == 0;
        Throwable ex = error.get();
        if (ex != null) {
            actual.onError(ex);
            return;
        }

        if (done && queue.isEmpty()) {
            actual.onCompleted();
            return;
        }
    }

    if (e != 0L) {
        BackpressureUtils.produced(requested, e);
    }

    missed = wip.addAndGet(-missed);
    if (missed == 0) {
        break;
    }
}

The last section deals with the case if the emission count reached the requested count but all that's left is to complete the sequence. Incidentally, this case of e == r may happen if all sources were empty but the child didn't request either; the logic ensures the eager completion of the child Subscriber. If there were emissions, we deduce that amount from the requested field via the help of another utility method, then subtract the missed amount from the wip counter. If that reaches zero, that indicates no further work needs to happen. Otherwise, the most outer loop starts over (with a new missed amount).

Conclusion

If you look into how merge is implemented in RxJava 1.x or how flatMap is implemented in RxJava 2.x and in Reactor 2.5, you will find that they differ notably from the implementation I showed you. The difference is due to performance and functionality reasons that we will dive into in the next parts of this mini-series.

You may think, why not explain the implementation in RxJava 1.x immediately? The reason is twofold: complexity and building blocks. The implementation in this blog post is, in my opinion, more accessible for those who follow my blog posts and as usual, builds upon previously explained concepts as well as establishes a base for further concepts and tie-ins that will come later on.

Introduction

In this post, we will look into expanding the features of our flatMap implementation and improve its performance.

RxJava's flatMap implementation offers limiting the maximum concurrency, that is, the maximum number of active subscriptions to the generated sources and allows delaying exceptions coming from any of the sources, including the main.

Limiting concurrency

Due to historical reasons, RxJava's flatMap (and our version of it from part 1) is unbounded towards the main source. This may work with infrequent main emissions and/or short lived inner Observable sequences. However, even if the main source, such as range(), can emit at any rate, the mapped inner Observables may consume limited resources such as network connections.

So the question is, how can we make sure only an user defined number of active Observables are being merged at once? How can we make sure some source emits only a limited number of values?

The answer is, of course, backpressure.

To limit the concurrency in flatMap, the idea is to request a maxConcurrency amount upfront via request(), and then whenever a source completes, request(1) extra.

Let's change our OpFlatMap and FlatMapSubscriber's implementation to include this maxConcurrency parameter:

    final int maxConcurrency;

    public OpFlatMap(Func1<? super T, ? extends Observable<? extends R>> mapper,
            int prefetch, int maxConcurrency) {
        this.mapper = mapper;
        this.prefetch = prefetch;
        this.maxConcurrency = maxConcurrency;
    }

    @Override
    public Subscriber<T> call(Subscriber<? super R> t) {
        FlatMapSubscriber<T, R> parent = 
            new FlatMapSubscriber<>(t, mapper, prefetch, maxConcurrency);
        parent.init();
        return parent;
    }

As a contract, we will handle Integer.MAX_VALUE as an indicator for the original unbounded mode:

        final int maxConcurrency;

        public FlatMapSubscriber(Subscriber<? super R> actual,
                Func1<? super T, ? extends Observable<? extends R>> mapper,
                int prefetch, int maxConcurrency) {
            this.actual = actual;
            this.mapper = mapper;
            this.prefetch = prefetch;
            this.csub = new CompositeSubscription();
            this.wip = new AtomicInteger();
            this.requested = new AtomicLong();
            this.queue = new ConcurrentLinkedQueue<>();
            this.active = new AtomicInteger(1);
            this.error = new AtomicReference<>();

            this.maxConcurrency = maxConcurrency;
            if (maxConcurrency != Integer.MAX_VALUE) {
                request(maxConcurrency);
            }
        }

Finally, we need to update the innerComplete to request another value from the main source

        void innerComplete(Subscriber<?> inner) {
            csub.remove(inner);

            request(1);

            onCompleted();
        }

Quite straightforward use of backpressure. Note here that this innerComplete may happen concurrently, triggered by the different inner Observables, therefore, the main source's request handler must be thread safe and reentrant-safe.

Delaying errors

By default, many standard operators terminate eagerly whenever they encounter an onError signal. If said operator does something with multiple sources, one sometimes wishes to process all non-error values first and only then act on any error signal that has popped up.

    final boolean delayErrors;

    public OpFlatMap(Func1<? super T, ? extends Observable<? extends R>> mapper,
            int prefetch, int maxConcurrency, boolean delayErrors) {
        this.mapper = mapper;
        this.prefetch = prefetch;
        this.maxConcurrency = maxConcurrency;
        this.delayErrors = delayErrors;
    }

    @Override
    public Subscriber<T> call(Subscriber<? super R> t) {
        FlatMapSubscriber<T, R> parent = 
            new FlatMapSubscriber<>(t, mapper, prefetch, maxConcurrency, delayErrors);
        parent.init();
        return parent;
    }

    // ...

        final boolean delayErrors;

        public FlatMapSubscriber(Subscriber<? super R> actual,
                Func1<? super T, ? extends Observable<? extends R>> mapper,
                int prefetch, int maxConcurrency, boolean delayErrors) {
            this.actual = actual;
            this.mapper = mapper;
            this.prefetch = prefetch;
            this.csub = new CompositeSubscription();
            this.wip = new AtomicInteger();
            this.requested = new AtomicLong();
            this.queue = new ConcurrentLinkedQueue<>();
            this.active = new AtomicInteger(1);
            this.error = new AtomicReference<>();

            this.maxConcurrency = maxConcurrency;
            if (maxConcurrency != Integer.MAX_VALUE) {
                request(maxConcurrency);
            }

            this.delayErrors = delayErrors;
        }

Delaying errors within flatMap is simple in terms of the delay part, but requires some extra effort on the errors part: at the very end we need to emit a single onError signal no matter how many sources (main or inner) signalled onError before. Certainly, the keeping the very first error till the end is a possible option, but dropping the rest of the errors may not be desirable either. The solution is to collect all Throwables in some data structure and emit a CompositeException at the end.

Using a concurrent Queue<Throwable> for this purpose is an option, RxJava does this, but we can reuse our existing error AtomicReference and perform some compare-and-swap loop to accumulate all the exceptions:

        @Override
        public void onError(Throwable e) {
            if (delayErrors) {
                for (;;) {
                    Throwable current = error.get();
                    Throwable next;
                    if (current == null) {
                        next = e;
                    } else {
                        List<Throwable> list = new ArrayList<>();
                        if (current instanceof CompositeException) {
                            list.addAll(((CompositeException)current).getExceptions());
                        } else {
                            list.add(current);
                        }
                        list.add(e);

                        next = new CompositeException(list);
                    }

                    if (error.compareAndSet(current, next)) {
                        if (active.decrementAndGet() == 0) {
                            drain();
                        }                        
                        return;
                    }
                }
            } else {
                if (error.compareAndSet(null, e)) {
                    unsubscribe();
                    drain();
                } else {
                    RxJavaPlugins.getInstance()
                    .getErrorHandler().handleError(e);
                }
            }
        }

In the loop, we take the current error and if it is null, we update it to the given exception. If there is an error already, we create a CompositeException to hold the current and the given new exception. However, if the current error happens to be a CompositeException, we flatten the whole list of the previous errors; this gives a nice and flat array of errors at the end inside a single CompositeException. Since we now accept both onError and onCompleted as being non-global terminal events, we decrement the active count and trigger a drain if it reaches zero.

Given Java 7's Throwable.addSuppressed, you may be tempted to use that to collect exceptions, but it has some drawbacks: it uses synchronized and needs a parent exception upfront that costs time to create, even if there was no exception after all. In addition, modifying an existing exception which already has some suppressed exception itself may be more confusing to figure out.

Since an innerError is no longer an immediate terminal condition, we need to adjust the method to remove the inner subscriber from the tracking structure as well as ask for replenishment in case the flatMap operator also runs with limited concurrency:

        void innerError(Throwable ex, Subscriber<?> inner) {
            if (delayErrors) {
                csub.remove(inner);
                request(1);
            }
            onError(ex);
        }

Lastly, the drain() method needs adjustments as well. The default implementation signalled the onError the moment it detected it. This has to be changed so if there is an error, it gets only emitted if all values inside the shared queue have been emitted (just like the completion event):

                    boolean done = active.get() == 0;
                    if (!delayErrors) {
                        Throwable ex = error.get();
                        if (ex != null) {
                            actual.onError(ex);
                            return;
                        }
                    }

                    Object o = queue.poll();

                    if (done && o == null) {
                        Throwable ex = error.get();
                        if (ex != null) {
                            actual.onError(ex);
                        } else {
                            actual.onCompleted();
                        }
                        return;
                    }

                    if (o == null) {
                        break;
                    }

The original error emission case is now behind a check for delayErrors being false. Otherwise, we check if all sources terminated and the queue is empty and then check if there is any error. We emit the terminal event accordingly and quit.

In addition, we need to update the e == r case (i.e., the case when we emitted the requested amount and the next signal would be a terminal event):

                if (e == r) {
                    if (actual.isUnsubscribed()) {
                        return;
                    }
                    boolean done = active.get() == 0;

                    if (!delayErrors) {
                        Throwable ex = error.get();
                        if (ex != null) {
                            actual.onError(ex);
                            return;
                        }
                    }

                    if (done && queue.isEmpty()) {
                        Throwable ex = error.get();
                        if (ex != null) {
                            actual.onError(ex);
                        } else {
                            actual.onCompleted();
                        }
                        return;
                    }
                }

Practically the same as above, except the check for isEmpty() instead of poll() as we don't want to consume the value if there is one.

Now we are done with the extra features of OpFlatMap (don't forget to change the FlatMapInnerSubscriber.onError to parent.innerError(e, this); by the way).

Increasing the performance of the Queue

Our flatMap implementation is decent in performance, but all those thread-safety features put an extra toll on the throughput.

The default queue implementation we employed, ConcurrentLinkedQueue, is nice, but all those unused Queue features entail an unnecessary overhead; plus, our usage pattern is just multiple-producer single-consumer.

Fortunately, the JCTools library offers higher performance Queue implementations with often 5-20x lower overhead. We can just replace the queue implementation with MpscLinkedQueue (or the fresh MpscGrowableArrayQueue). In addition, if the flatMap runs with some maxConcurrency value, you can even use an MpscArrayQueue directly (where the capacity is maxConcurrency * prefetch), but note that the capacity of array-based queues are rounded up to the next power-of-two value and may waste space.

This change gives a decent throughput increase, but can we do better? Let me answer with another question: what is the overhead of not using the queue at all? Practically zero! If we bypassed the queue completely, that would save even more overhead!

Now the question is how and when can we bypass the queue? In other terms, when can we emit a value?

To emit a value two conditions must be met: 1) no other source is trying to emit at the same time and 2) downstream has requested some.

The first condition is ensured by the wip counter and the second is checked in the drain() loop. If you remember, the wip counter actually encodes 3 states. If zero, nobody is emitting, if 1, there is a drain going on and 2+ indicates more work has to be performed. Therefore, if we can change the wip from 0 to 1, that would meet condition 1). Next we have to check the requested amount for condition 2) and if met, emit the value directly to the downstream's Subscriber.

To accomplish this, we have to extend the innerNext() method with some bypass logic:

        void innerNext(FlatMapInnerSubscriber<T, R> inner, R value) {
            Object v = NotificationLite.instance().next(value);    // (1)

            if (wip.get() == 0 && wip.compareAndSet(0, 1)) {       // (2)
                if (requested.get() != 0L) {                       // (3)
                    actual.onNext(value);                          // (4)
                    BackpressureUtils.produced(requested, 1);
                    inner.requestMore(1);
                } else {
                    queue.offer(inner);                            // (5)
                    queue.offer(v);
                }

                if (wip.decrementAndGet() != 0) {                  // (6)
                    drainLoop();
                }
                return;
            }

            queue.offer(inner);                                    // (7)
            queue.offer(v);
            drain();
        }

        void drain() {
            if (wip.getAndIncrement() != 0) {
                return;
            }
            drainLoop();
        }

        void drainLoop() {
            int missed = 1;
            // ...

This pattern should be somewhat familiar too; it is a fast-path queue-drain approach introduced at the very beginning.

First, we convert the potential null value as it will be potentially required at two places later on.
If the wip value is zero and can be successfully changed to 1, we enter into the serialized drain mode and are now free to emit ...
if the requested amount is non-zero.
Therefore, if there is no contention and there is request from downstream, we emit and don't touch the queue. Once emitted, we have to reduce the requested amount by 1 and request replenishment from the source.
If the downstream hasn't requested, we have to revert to the original queuing behavior and store the value for later use.
In the drain mode, we decrement the wip counter and if more work has arrived in the meantime, we resume with the loop-part of the former drain. This also means the drain() method has to be refactored into drain() and drainLoop() methods. Using the original drain() here wouldn't work because it would skip its loop due to wip being non-zero already.
In case there was a contention, i.e., wip was non-zero, we have to revert to the original queue-drain behavior.

You may recall we use wip.getAndIncrement() == 0 to enter the serialized mode elsewhere but not here. The reason for it is that although getAndIncrement scales better and is intrinsified into a single CPU instruction on x86, it has more overhead compared to a single compareAndSet call when there is no contention - and we base our fast-path optimization on this no contention property.

Checking the requested amount and entering the drain mode is not swappable. Imagine an emission finds the requested amount non zero before entering the drain mode, but just before that, an already running drain decremented the requested amount to zero. Now when the first thread enters the drain mode itself and emits the value, it violates the backpressure-contract. Naturally, if one thinks the downstream will request infrequently and thus the requested amount is usually zero, you can implement innerNext in a way that it checks the requested amount before and after entering the drain mode.

Increasing performance in high-contention case

The queue-bypass optimization has its limits; it's almost never triggered when all sources emit quite rapidly, causing contention all the time.

This contention affects both the shared queue and the wip counter, therefore, we could gain some performance by getting rid of one or both contention point. Unfortunately, wip is essential and unavoidable, therefore, let's look at queue instead.

The problem is that all concurrent sources use the same queue and contend on the queue's offer() side, thus requiring a multi-producer queue instance that uses the heavyweight getAndSet() or getAndIncrement() atomic operations internally.

However, since each source is sequential by nature, we practically have single-threaded producers, N at once and due to the drain loop, there is only a single consumer to all of those sources.

The solution is to use a single-producer single-consumer queue for each source and in the drain loop, collect values from all of them individually. A great opportunity for JCTools again with its ultra-high performance SpscArrayQueue. We can use the array variant because our prefetch value is expected to be reasonably low; RxJava runs with 128 by default.

This requires some modest changes to both the FlatMapInnerSubscriber and its FlatMapSubscriber parent as well:

    static final class FlatMapInnerSubscriber<T, R> extends Subscriber<R> {
        final FlatMapSubscriber<T, R> parent;

        final int prefetch;

        volatile Queue<Object> queue;

        volatile boolean done;

        public FlatMapInnerSubscriber(
                FlatMapSubscriber<T, R> parent, int prefetch) {
            this.parent = parent;
            this.prefetch = prefetch;
            request(prefetch);
        }

        @Override
        public void onNext(R t) {
            parent.innerNext(this, t);
        }

        @Override
        public void onError(Throwable e) {
            done = true;
            parent.innerError(e, this);
        }

        @Override
        public void onCompleted() {
            done = true;
            parent.innerComplete(this);
        }

        void requestMore(long n) {
            request(n);
        }

        Queue<Object> getOrCreateQueue() {
            Queue<Object> q = queue;
            if (q == null) {
                q = new SpscArrayQueue<>(prefetch);
                queue = q;
            }
            return q;
        }
    }

The FlatMapInnerSubscriber gets two fields, one storing the prefetch amount to be used later creating the SpscArrayQueue and the Queue instance itself. In addition, we need to know if the source finished emitting any events via the done flag of its own. Of course, we could pre-create the queue but we would waste the benefit of the fast-path from the previous section, which doesn't require a queue if the fast-path succeeds. Regardless, if a queue is eventually needed, the getOrCreateQueue will make that happen. Note that if the queue is eventually needed, it will be created by a single thread yet may be read from the draining thread and thus has to be volatile.

Next step is to change the innerNext() to work with this per-source queue instead of the shared one:

        void innerNext(FlatMapInnerSubscriber<T, R> inner, R value) {
            Object v = NotificationLite.instance().next(value);

            if (wip.get() == 0 && wip.compareAndSet(0, 1)) {
                if (requested.get() != 0L) {
                    actual.onNext(value);
                    BackpressureUtils.produced(requested, 1);
                    inner.requestMore(1);
                } else {

                    Queue<Object> q = inner.getOrCreateQueue();


                    q.offer(v);
                }

                if (wip.decrementAndGet() != 0) {
                    drainLoop();
                }
                return;
            }

            Queue<Object> q = inner.getOrCreateQueue();


            q.offer(v);
            drain();
        }

This incurs a small change only in the form of inner.getOrCreateQueue() as the target queue in case of contention or missing downstream requested. (At this point, one could remove the main queue from the parent class, but let's hold onto it a bit more.)

Unfortunately, this per-source queue does some trouble because the drainLoop() can no longer use the shared queue and has to know about the current active sources in some way, but the CompositeSubscription doesn't expose its content. In addition, the CompositeSubscription uses an internal HashSet which has to be iterated in a thread-safe manner, adding so much overhead to the common case all our hard work goes to waste.

Instead, we can use the same copy-on-write subscriber tracking we did with Subjects and ConnectableObservables. This gives us a nice array of FlatMapInnerSubscribers and can get rid of both csub and active fields.

        @SuppressWarnings("rawtypes")
        static final FlatMapInnerSubscriber[] EMPTY = new FlatMapInnerSubscriber[0];
        @SuppressWarnings("rawtypes")
        static final FlatMapInnerSubscriber[] TERMINATED = new FlatMapInnerSubscriber[0];

        final AtomicReference<FlatMapInnerSubscriber<T, R>[]> subscribers;

        volatile boolean done;

        @SuppressWarnings("unchecked")
        public FlatMapSubscriber(Subscriber<? super R> actual,
                Func1<? super T, ? extends Observable<? extends R>> mapper,
                int prefetch, int maxConcurrency, boolean delayErrors) {
            this.actual = actual;
            this.mapper = mapper;
            this.prefetch = prefetch;
            this.wip = new AtomicInteger();
            this.requested = new AtomicLong();
            this.error = new AtomicReference<>();

            this.subscribers = new AtomicReference<>(EMPTY);

            this.maxConcurrency = maxConcurrency;
            if (maxConcurrency != Integer.MAX_VALUE) {
                request(maxConcurrency);
            }
            this.delayErrors = delayErrors;
        }

We have the usual empty and terminated indicator arrays and a volatiledone field which will be true if the main source completes. The initialization logic has to change as well, plus, we need the usual add(), remove() and terminate() methods:

        public void init() {
            add(Subscriptions.create(this::terminate));
            actual.add(this);
            actual.setProducer(new Producer() {
                @Override
                public void request(long n) {
                    childRequested(n);
                }
            });
        }

        @SuppressWarnings("unchecked")
        void terminate() {
            FlatMapInnerSubscriber<T, R>[] a = subscribers.get();
            if (a != TERMINATED) {
                a = subscribers.getAndSet(TERMINATED);
                if (a != TERMINATED) {
                    for (FlatMapInnerSubscriber<T, R> inner : a) {
                        inner.unsubscribe();
                    }
                }
            }
        }

        boolean add(FlatMapInnerSubscriber<T, R> inner) {
            for (;;) {
                FlatMapInnerSubscriber<T, R>[] a = subscribers.get();
                if (a == TERMINATED) {
                    return false;
                }
                int n = a.length;
                @SuppressWarnings("unchecked")
                FlatMapInnerSubscriber<T, R>[] b = new FlatMapInnerSubscriber[n + 1];
                System.arraycopy(a, 0, b, 0, n);
                b[n] = inner;
                if (subscribers.compareAndSet(a, b)) {
                    return true;
                }
            }
        }

        @SuppressWarnings("unchecked")
        void remove(FlatMapInnerSubscriber<T, R> inner) {
            for (;;) {
                FlatMapInnerSubscriber<T, R>[] a = subscribers.get();
                if (a == TERMINATED || a == EMPTY) {
                    return;
                }
                int n = a.length;
                int j = -1;
                for (int i = 0; i < n; i++) {
                    if (a[i] == inner) {
                        j = i;
                        break;
                    }
                }

                if (j < 0) {
                    return;
                }

                FlatMapInnerSubscriber<T, R>[] b;
                if (n == 1) {
                    b = EMPTY;
                } else {
                    b = new FlatMapInnerSubscriber[n - 1];
                    System.arraycopy(a, 0, b, 0, j);
                    System.arraycopy(a, j + 1, b, j, n - j - 1);
                }
                if (subscribers.compareAndSet(a, b)) {
                    return;
                }
            }
        }

The onNext method has a small change by adding a conditional subscription clause in case the operator is unsubscribed:

        @Override
        public void onNext(T t) {
            Observable<? extends R> o;

            try {
                o = mapper.call(t);
            } catch (Throwable ex) {
                Exceptions.throwOrReport(ex, this, t);
                return;
            }

            FlatMapInnerSubscriber<T, R> inner = 
                    new FlatMapInnerSubscriber<>(this, prefetch);

            if (add(inner)) {
                o.subscribe(inner);
            }
        }

The onError method has a small change too; there is no active field to decrement and thus the drain is always called:

                    if (error.compareAndSet(current, next)) {
                        drain();
                        return;
                    }

The onCompleted no longer decrements the active field but instead has to set the done flag:

        @Override
        public void onCompleted() {
            done = true;
            drain();
        }

The innerError and innerCompleted gets also simpler:

        void innerError(Throwable ex, FlatMapInnerSubscriber<T, R> inner) {
            onError(ex);
        }

        void innerComplete(FlatMapInnerSubscriber<T, R> inner) {
            drain();
        }

As usual, all that simplification is eventually offset by complication somewhere else. In our case, the drain loop gets more complicated: now it has to iterate over all active sources and drain their queues, ask for replenishments for the sources as well as the main source.

        void drainLoop() {

            int missed = 1;

            for (;;) {
                boolean d = done;

                FlatMapInnerSubscriber<T, R>[] a = subscribers.get();

                long r = requested.get();
                long e = 0L;
                int requestMain = 0;
                boolean again = false;

                if (isUnsubscribed()) {
                    return;
                }

The drain loop now has some additional variables. We get the fresh array of subscribers upfront and introduce a counter for requesting more from the main source and a flag indicating a condition that indicates the front of this outer loop needs to be executed. Note that the done flag has to be checked before getting the current value of the subscribers array due to a possible race with onNext.

                if (!delayErrors) {
                    Throwable ex = error.get();
                    if (ex != null) {
                        actual.onError(ex);
                        return;
                    }
                }

                if (d && a.length == 0) {
                    Throwable ex = error.get();
                    if (ex != null) {
                        actual.onError(ex);
                    } else {
                        actual.onCompleted();
                    }
                    return;
                }

The next section deals with a delayed or non-delayed error condition. Note also that the done indicator is not used on its own but in conjunction with the length of the inner subscriber array; we know we reached a terminal state if both the main has terminated and we don't have any active inner subscribers (empty array).

                for (FlatMapInnerSubscriber<T, R> inner : a) {
                    if (isUnsubscribed()) {
                        return;
                    }

                    d = inner.done;
                    Queue<Object> q = inner.queue;
                    if (q == null) {
                        if (d) {
                            remove(inner);
                            requestMain++;
                            again = true;
                        }
                    } else {

The next part has to loop over all subscribers to see if any of them has values in its respective queue, provided that it has actually queue to begin with; it is possible the fast-path was taken for this source all the way through and there was no queue created via getOrCreateQueue(). In this case, all that remains is to remove the inner subscriber and indicate replenishment from the main source.

                        long f = 0L;

                        while (e != r) {
                            if (isUnsubscribed()) {
                                return;
                            }

                            d = inner.done;
                            Object v = q.poll();
                            boolean empty = v == null;

                            if (d && empty) {
                                remove(inner);
                                requestMain++;
                                again = true;
                            }

                            if (empty) {
                                break;
                            }

                            actual.onNext(NotificationLite.<R>instance().getValue(v));

                            e++;
                            f++;
                        }

                        if (f != 0L) {
                            inner.requestMore(f);
                        }

                        if (e == r) {
                            if (inner.done && q.isEmpty()) {
                                remove(inner);
                                requestMain++;
                                again = true;
                            }
                            break;
                        }

This is a fairly usual drain loop, with the addition of the remove/replenishment logic and a break keyword to stop the loop since the emission count reached the request count and can't emit values anymore. Note the use of the f counter which tracks how many items were consumed from that particular inner FlatMapInnerSubscriber.

                    }
                }

                if (e != 0L) {
                    BackpressureUtils.produced(requested, e);
                }
                if (requestMain != 0) {
                    request(requestMain);
                }

                if (again) {
                    continue;
                }

                missed = wip.addAndGet(-missed);
                if (missed == 0) {
                    break;
                }
            }
        }
    }

The final part performs the necessary request tracking, replenishment and missed work handling.

Given the hundreds of lines of complicated code, you should now understand why flatMap is one of the most complicated operators we have.

Inner request rebatching

Before finishing up with this post, lets do a small and final optimization to the latest flatMap structure.

If you look at the innerNext() logic, you see that whenever the fast-path is taken, 1 item is requested as replacement for the emitted item all the time. If you imagine the source is a range() operator, such 1-by-1 request will yield an atomic increment after each and every value emitted by range(), adding more overhead.

Fortunately, since the inner sources use a fixed prefetch amount, we can define a re-request point and batch up those 1-by-1 requests into a much larger amount, amortizing the request-tracking overhead in the source sequence.

This point can be anywhere between 1 and the prefetch amount and generally, the value really depends on how the source emits values. Sources may work better at any point in this range. Unfortunately, libraries can't help too much setting these individually and any adaptive logic adds so much overhead that likely negates any benefit from it. Therefore, RxJava chose to re-request after half of the prefetch amount has been emitted (and lately, I tend to use 75% of the prefetch for it).

The solution requires two additional fields and a change to the requestMore() method in FlatMapInnerSubscriber:

        final int limit;

        long produced;

        public FlatMapInnerSubscriber(
                FlatMapSubscriber<T, R> parent, int prefetch) {
            this.parent = parent;
            this.prefetch = prefetch;

            this.limit = prefetch - (prefetch >> 2);

            request(prefetch);
        }

        void requestMore(long n) {
            long p = produced + n;
            if (p >= limit) {
                produced = 0;
                request(p);
            } else {
                produced = p;
            }
        }

Conclusion

In this post, I showed ways to improve the functionality and performance of our flatMap operator. The diligent reader may check if we reached the structure of RxJava's flatMap implementation, but the answer is: not yet. Apart from cutting an already long post sort, there are a couple more optimizations remaining we could apply.

The first one exploits the fact likelihood that the last source cut short due to lack of requests is the likely candidate to resume emitting values once more requests come in. Saving and restoring an index in the for-loop over the FlatMapInnerSubscriber array helps with it.

The second is called the scalar-optimization and handles a case when one flatMapsObservables of Observable.just(), avoiding the subscription overhead with such sources. This optimization adds significantly more logic to our drainLoop() method, plus has its own queue-bypass optimization as well.

In the next part of this series, I'll add these remaining two optimizations as well as something even better. However, to understand this mysterious optimization, of which the scalar-optimization is actually a member of, we have to learn about something, that requires almost perfect knowledge about the internals of not just flatMap, but every other operator as well.

We call it operator fusion.

Intoduction

RxJava is now out more than 3 years and lived through several significant version changes. In this blog post, I'll point out design and implementation decisions that I personally think wasn't such a good idea.

Don't get me wrong, it doesn't mean that RxJava is bad or I knew all along how to do it "properly". It was a learning process for all of us involved, but the question is, can we learn from those mistakes and do it better in the next major version?

Synchronous unsubscription

In the early days, RxJava mirrored the architecture of Rx.NET which consisted of two important interfaces, IObservable and IObserver, derived through dualizing the IEnumerable and IEnumerator. (This was also true for my own library, Reactive4Java).

If we look at IObservable, we find the subscribe() method that returns an IDisposable. This returned object allows one to dispose or cancel a running sequence. However, it has a critical problem I demonstrate with a minimalistic reactive program:

interface IDisposable {
    void dispose();
}

interface IObserver<T> {
   void onNext(T t);
}
interface IObservable<T> {
    IDisposable subscribe(IObserver<T> observer);
}  


IObservable<Integer> source = o -> {
   for (int i = 0; i < Integer.MAX_VALUE; i++) {
       o.onNext(i);
   }

   return () -> { };
};

IDisposable d = o.subscribe(System.out::println);
d.dispose();

If we run this code, it starts to print a lot of numbers to the console, despite we called dispose on the returned object by the subscribe method. What's wrong?

The problem is that the source observable can only return its IDisposable object only after the for-loop finishes, but then it has nothing to do. The whole setup is synchronous and thus this structure can't be reasonably cancelled.

Although Rx is good at async processing, many steps in a typical pipeline is synchronous and is affected by this synchronous cancellation requirement. Since Rx.NET is at least 3 years older than RxJava, how could this shortcoming still be in today's Rx.NET?

The example code above is just the well known range() operator and if we run a similar code in C#, we find that it doesn't print or stops printing almost immediately. The secret is that Rx.NET's range() operator runs on async scheduler by default, so that for loop runs on a different thread and the operator can immediately return with a meaningful IDisposable. Therefore, the synchronous cancellation issue is averted, but I wonder, was it a conscious or unconscious decision to sidestep the underlying problem? Who knows.

If you look at the source code of Rx.NET's range, you'll find something more complicated. It uses a recursive scheduling technique to deliver each value to the observer. When I measured it, it could only sustain 1M ops/second on the same machine which could do 250M ops/second with RxJava while delivering 1M elements.

Now RxJava's range() never used a scheduler thus the synchronous cancellation problem was discovered and mitigated by introducing the Subscriber class. An instance can be checked if it still wants events or not. The example above can be rewritten so the for loop checks its subscriber and quits accordingly.

Observable<Integer> source = Observable.create(s -> {
   for (int i = 0; i < Integer.MAX_VALUE && !s.isUnsubscribed(); i++) {
       o.onNext(i);
   }
});

Subscription d = o.subscribe(new Subscriber<Integer>() {
    @Override
    public void onNext(Integer v) {
        System.out.println(v);
        unsubscribe();
    }

    // ...
});

The Subscriber acts like a take(1) and unsubscribes itself after the first item. This unsubscribe() call internally sets a volatile boolean flag that is read by isUnsubscribed() and the loop above is stopped. Note, however, that you still can't unsubscribe via the Subscription returned by subscribe() because the lambda with the for loop doesn't exit until its terminal condition is met.

It doesn't seem to solve our initial problem that well, does it? Luckily, this new structure has the property that it can be cancelled before or while the loop is running, the latter done from another thread of course:

Subscriber<Integer> s = new Subscriber<Integer>() {
    @Override
    public void onNext(Integer v) {
        System.out.println();
    }
    // ...
}

Schedulers.computation().schedule(s::unsubscribe, 1, TimeUnit.SECONDS);

source.subscribe(s);

The fact that you can basically inject the cancellation support upfront, before anything is even subscribed to, allows proper propagation of unsubscription even with the most complicated operators.

In addition, there is a deeper implication of this new structure. In the new Observable, the lambda doesn't return anything, but Observable.subscribe() does still return a Subscription which is practically the same Subscriber sent in as the parameter. (Technically, it is a bit more involved process; see Jake Wharton's excellent video talk on the subject).

The insight: you can't be fully reactive if you return something. Returning something implies synchronous behavior and the method has to provide some result, even though it can't at that moment. This is when one is forced to block or sleep until the real logic can produce the relevant object to be returned. I showed this in my post about the OSGi Asynchronous Event Streams initiative.

Resources of the Subscriber

The Subscriber class offers the ability to associate resources with it in the form of Subscription instances. When the operator is cancelled (or terminates), these resources are unsubscribed as well.

This is quite a convenience for operator developers, however, has its own cost: allocation.

Whenever a Subscriber is instantiated with the default constructor, an inner SubscriptionList instance is also always created, whether or not the Subscriber is likely to hold resources or not. In the previous example, range() doesn't need resources thus the SubscriptionList is never really used.

On one hand, there exist many operators that don't manage resources so creating the extra container is wasteful. On the other hand, many operators do use resources and expect this convenience to be present.

In addition, you may recall that Subscriber has a constructor that takes another Subscriber and gives the option to share the underlying SubscriptionList. Certainly, this could help reduce the allocation count, but most operators that use resources themselves can't share the same underlying SubscriptionList as this would allow them to unsubscribe resources downstream (see pitfall #2). Thus, the current Subscriber structure is more of a burden, performance wise, than a win for operator writers.

You may now think, what's wrong with giving convenience tools to operator writers? I agree that operators implemented outside RxJava should get as much help as reasonable, however, I believe internal operators should take the diligence and have efficient implementations to begin with.

I've tried a few times to resolve this problem but given the architecture of 1.x, I have doubts it can be achieved. Fortunately, the Reactive-Streams' architecture and thus RxJava 2.x solves this problem by making the resource management the responsibility of the operators.

Subscriber request

If you look into how Subscriber is implemented, you'll see the protected final request() method. This is a convenience method that makes sure if there is a Producer set via setProducer, the request is forwarded to it or accumulated until one Producer arrives. Basically, this is an inlined producer-arbiter.

One might think the method's implementation gives significant overhead to request management, but JMH benchmarks confirmed they don't really affect the overhead outside a small +/- 3% difference, that may also be due to noise.

The real problem with this method is that it has the same name as Producer.request, making it impossible to implement Producer when one extends Subscriber at the same time.

This has the unfortunate consequence that one usually needs an extra Producer object along with the main Subscriber if the operator does some request-manipulation.

This has the consequence of extra allocation during subscription time which affects GC the most with short-lived sequences. The second property is that it increases the call-stack depth and may prevent some JIT optimization.

Since Subscriber.request() is also part of the public API, it can't be renamed in 1.x to make room for Producer.request().

Again, the solution will come with 2.x: there, since Reactive-Streams Subscriber and Subscription are both interfaces, both can appear the same time, plus, a convenience request() method can be moved into a convenience implementation of Subscriber (i.e., AsyncSubscriber) without affecting the operator internals. (This also means it will be discouraged to use convenience Subscribers within operators.)

Lift

Along with backpressure, the method Observable.lift() is considered by many as the best addition to the library. It lets you step into the subscription process and given a Subscriber from downstream, you can return another Subscriber for upstream that does the business logic for that operator.

It became so popular almost all instance operators of Observable are now using it.

Unfortunately, the convenience has a cost: allocation. For most operators, applying that operator to a sequence incurs 3 object allocations. To show this, let's unroll the application of the map() operator:

public final <R> Observable<R> map(Func1<? super T, ? extends R> func) {
    OperatorMap<T, R> op = new OperatorMap<T, R>(func);
    return new Observable<R>(new OnSubscribe<R>() {
        @Override
        public void call(Subscriber<? super R> child) {
            Subscriber<? super T> parent = op.call(child);
            Observable.this.unsafeSubscribe(parent);
        }
    });
}

We have 1) the Operator instance, 2) the Observable instance and 3) the OnSubscribe instance for each application.

This may not of concern for direct sequences that use map, but imagine you have this 3 allocation a million times because you happen to flatMap something whose inner Observables have operators applied to them:

Observable.range(1, 1_000_000).flatMap(v -> 
    Observable.just(v).observeOn(Schedulers.computation()).map(v -> v * v))
.subscribe(...);

The lift operator is practically an OnSubscribe instance that captures the upstream Observable and calls Operator.call with the downstream Subscriber. Clearly, one could just implement operators directly with OnSubscribe and have the upstream Observable as a parameter; the total instance sizes wouldn't change much but both the allocation count and stack depth get reduced.

The current lift structure has another adverse effect: it makes operator-fusion difficult to impossible in its current form because 1) it is an anonymous class and one can't discover its upstream Observable and Operator easily, and 2) even if made a named class, the two classes are hidden behind indirection and any discovery process now faces more overhead.

Luckily, the shortcomings mentioned so far can be remedied without affecting the public API, but requires diligence of writing and reviewing thousands of lines of code changes.

Unfortunately, when I implemented RxJava 2.0 developer-preview last September, I did not think of this overhead thus the current 2.x branch still uses lift() extensively.

However, there is light at the end of the tunnel: Project Reactor 2.5 doesn't go down on the lift() path and now has lower overhead than RxJava.

Create

Lately, I'm quite outspoken against Observable.create() and now I think it should be named something more scary so beginners avoid it and look for proper factory methods in Observable that do backpressure and unsubscription properly. I can see it as a tool for demonstating to one's audience how to enter the reactive world, but I'm convinced it should receive less spotlight in those presentations.

Regardless, the problem with create() is that it encourages creating 2 instances per Observable: 1) the Observable instance itself and the OnSubscribe holding the subscription logic.

The approach that one creates an Observable instance with create() was born from the encouragement: "composition over inheritance". From general design perspective, this sounds okay, but one has to note that in Java, composition means object allocation: outer objects and inner objects, and more inner-inner objects.

To avoid all these allocations, the solution would be to make Observable not hold an instance of OnSubscribe by default (but keep create() as the lambda-factory version) and operators (both source and intermediate) should extend Observable directly. All operator methods would still reside in Observable:

public final <R> Observable<R> map(Func1<? super T, ? extends R> func) {
    return new ObservableMap<T, R>(this, func);
}

Thus, without lift() and create(), map() would allocate a single Observable instance per application.

Such change, I believe, wouldn't affect the public API since Observable methods are static or final to begin with and operators would be still a subclass of Observable. The change also would help with operator-fusion because each upstream source can now be directly identified and its parameters exposed without indirection.

Again, Project Reactor 2.5 is ahead of RxJava and doesn't use the create() mechanics. Its operators are implemented extending a base class, Flux, the way suggested above.

Conclusion

Designing and implementing RxJava version was and is a learning process as well with unanticipated effects on complexity and performance.

You may think, why the hassle about structures and allocations that clearly work in their current form? Two reasons: the Cloud and Android/IoT. For the cloud, where billions of events happen, any inefficiency or unnecessary overhead is amplified along with the numbers. You may not easily calculate how much does that range-flatmap example above cost you on your laptop, but Cloud suppliers will make you pay for each second, gigabyte and gigahertz of using their service. For Android and IoT, the resource constraints of the devices and the expectancy of more and more features requires one - eventually - to budget memory usage, GC and battery life.

Introduction

Operator-fusion, one of the cutting-edge research topics in the reactive programming world, is the aim to have two of more subsequent operators combined in a way that reduces overhead (time, memory) of the dataflow.

(Other cutting-edge topics are: 1) reactive IO, 2) more native parallel async sequences and 3) transparent remote queries.)

The key insight with operator-fusion is threefold:

many sequences are started from constant or quasi-constant sources such as just(), from(T[]), from(Iterable), fromCallable() which don't really need the thread-safety dance in a sequence of operators,
some pairs of operators can share internal components such as Queues and
some operators can tell if they consumed the value or dropped it, avoiding request(1) call overhead.

In this mini-series, I'll describe the hows and whys of operator-fusion, as we currently understand it. By "we", I mean the joint research effort on optimizing Reactive-Streams operators beyond what's there in RxJava 2.x and has been in previous versions of Project Reactor.

The experimentation happens in the reactive-streams-commons, Rsc for short, GitHub repository. The results of the Rsc is now driving Project Reactor 2.5 (currently in milestone 2) and verified by a large user base. Hopefully, RxJava can benefit from the results as well (but maybe not before 3.x).

If you are following Akka-Streams, you might have read/head about operator-fusion there as well. As far as I could understand their approach, their objective is to make sure more stages of the pipeline run on the same Actor, avoiding the previous, very likely, thread-hopping with their sequences. Essentially, there is now a mode where the developer can define the async boundaries in the pipeline. Does this sound familiar? From day 1, Rx-based libraries let you do this.

Generations

Reactive libraries and associated concepts evolved over time. What we had 7 years ago in Rx.NET, requirements and implementation-wise is significantly different what we'll have tomorrow with libraries such as Project Reactor.

With my experience with the history of "modern" reactive programming, I categorize the libraries into generations.

0th generation

The very first generation of reactive programming tools mainly consist of java.util.Observable API and its cousins in other languages and almost any callback-based API such as addXXXListener in Swing/AWT/Android.

The Observable API was most likely derived from the Gang-of-four design patterns book (or the other way around, who knows) and has the drawback of being inconvenient to use and non-composable. In today's terms, it is a limited PublishSubject where you have only one stage: publisher-subscriber.

The addXXXListener style of APIs suffer, although facilitate push-based eventing, from composability deficiencies. The lack of common base concept would require you to implement a composable library for each of them one-by-one; or have one common abstraction like RxJava and build adapter for each addXXXListener/removeXXXListener entry point.

1st generation

Once the deficiencies were recognized and addressed by Erik Meijer & Team at Microsoft, the first generation of reactive programming libraries were born: Rx.NET around 2010, Reactive4Java in 2011 and early versions of RxJava in 2013.

The others followed the Rx.NET architecture closely, but soon turned out there are problems with this architecture. When the original IObservable/IObserver is implemented with purely same-thread manner, the sequences can't be cancelled in progress with operators such as take(). Rx.NET sidestepped the issue by using mandatory asnycrony in sources such as range().

The second problem was the case when the producer side is separated by an implicit or explicit asynchronous boundary from a consumer that can't do its job fast enough. This can happen with trivial consumers as well because of the infrastructure overhead of crossing the asynchronous boundary. This is what we call the backpressure problem.

2nd generation

The new deficiencies of synchronous cancellation and the lack of backpressure were recognized by the RxJava team (I wasn't really involved) and a new architecture has been designed.

The class Subscriber was introduced which could tell if it was interested in more events or not via isUnsubscribed() that had to be checked by each source or operator emitting events.

The backpressure problem was addressed by using co-routines to signal the amount of items a Subscriber can process at a time through a Producer interface.

The third addition was the method lift() which allows a functional transformation between Subscribers directly. Almost all instance operators have been rewritten to run with lift() through the new Operator interface.

3rd generation

Apart from being clumsy and limiting some optimizations, the problem with RxJava's solution was that it was incompatible with the viewpoints of other (upcoming) reactive libraries at the time. Recognizing the advent of (backpressure enabled) reactive programming, engineers from various companies got together and created the Reactive-Streams specification. The main output is a set of 4 interfaces and 30 rules regarding them and their 7 total methods.

The Reactive-Streams specification allows library implementors to be compatible with each other and compose the sequences, cancellation and backpressure across library boundaries while allowing the end-user to switch between implementations at will.

Reactive-Streams, and thus 3rd generation, libraries are, for example, RxJava 2.x, Project Reactor and Akka-Streams.

4th generation

Implementing a fluent library on top of Reactive-Streams requires quite a different internal architecture, thus RxJava 2.x had to be rewritten from scratch. While I was doing this reimplementation, I recognized some operators could be combined in an external or internal fashion, saving on various overheads such as queueing, concurrency-atomics and requesting more.

Since RxJava 2.x development crawled to halt due to lack of serious interest from certain parties, I set RxJava 2.x aside until Stephane Maldini (one of the contributors to Reactive-Streams and main contributor to Project Reactor) and I started talking about a set of foundational operators that both RxJava 2.x and Project Reactor 2.5+ (and eventually Akka-Streams) could use and incorporate them into the respective libraries.

With active communication, we established the reactive-streams-commons library, built the foundational operators and designed the components of optimizations that we call now operator-fusion.

Thus, a 4th generation reactive library may look like a 3rd generation from the outside, but the internals of many operators change significantly to support overhead reduction even further.

5+ generation

I think, at this point, we are at half point in what operator-fusion can achieve, but there are signs the architecture of Reactive-Streams will need extensions to support reactive IO operations in the form of bi-directional sequences (or channels). In addition, transparent remote reactive queries may require changes as well (see QBservable in Rx.NET). I don't see the full extent of possibilities and requirements at this point and all is open for discussion.

The Rx lifecycle

Before jumping into operator-fusion, I'd like to define the major points (thus the terminology I'll be using) of the lifecycle of an Rx sequence. This applies to any version of RxJava and any Reactive-Streams based libraries as well.

The lifecycle can be split into 3 main points:

Assembly-time. This is the time when you write up just().subscribeOn().map() and assign that to a field or variable of type Observable/Publisher. This is the main difference between Future-based APIs (Promise, CompletableFuture, etc.) which if support some fluent API where there isn't a separate assembly time but some form of interleaving among the 3 points.
Subscription-time. This is the time when a Subscriber subscribes to a sequence at its very end and triggers a "storm" of subscriptions inside the various operators. On one hand, it has an upstream-directed edge and on the other hand, a downstream-directed edge of calls to setProducer/onSubscribe. This is when subscription-sideeffects are triggered and generally no value is flowing through the pipeline.
Runtime. This is the time when items are generated followed by zero or one terminal event of error/completion.

Each distinct point in the lifecycle enables a different set of optimization possibilities.

Operator-fusion

I admit, the got the term operator-fusion from some Intel CPU documentation describing their internal architecture doing macro- and micro-fusions on assembly-level operators. It kinda sounded cool and the concepts behind it could be expanded up the language level and reach the operators of reactive dataflows.

The idea, on the reactive level, is to modify the sequence the user created at various lifecycle points to remove overhead mandated by the general architecture of the reactive library.

As with the assembly-level fusion, we can define two kinds of reactive operator-fusion.

Macro-fusion

Macro-fusion happens mainly in the assembly-time in the form of replacing two or more subsequent operators with a single operator, thus reducing the subscription-time overhead (and sometimes the runtime overhead in case the JIT would be overwhelmed) of the sequence. There are several ways this can happen.

1) Replacing an operator with another operator

In this form of fusion, the operator applied looks at the upstream source (this is why I mentionedlift() causes trouble) and instead of instantiating its own implementation, it calls/instantiates a different operator.

One example of this is when you try to amb()/concat()/merge() an array of sources which has only one element. In this case, it would be unnecessary to instantiate the implementation and one can avoid the overhead by returning that single element directly. This kind of optimization is already part of RxJava 1.x.

The second example is when one uses a constant source, such as range() and applies subscribeOn(). However, there is little-to-no behavioral difference between applying observeOn() in the same situation. Thus subscribeOn() detecting a range() can switch to observeOn() and perhaps benefit from other optimizations that observeOn() itself can provide.

2) Replacing an operator with a custom operator

The exist operator-pairs that come up often and may work better if they were combined into a single operator. A very common operator-pair that is used for jump-starting some asynchronous computation is just().subscribeOn() or the equivalent just().observeOn().

Such sequences have quite a large overhead compared to the single value they emit: internal queues get created, workers get instantiated and released, several atomic variables are modified.

Therefore, replacing the pair with a custom operator that combines the scheduling and emission into a single value into one single operator is a win.

This approach, especially involving just(), can be extended to other operators, such as flatMap() where all the internal complexities can be avoided by invoking the mapper function once and running with the single Observable/Publisher directly, without buffering or extra synchronization.

Again, RxJava 1.x already has optimizations such as these examples above.

3) Replacing during subscription-time

There are cases when the previous two cases may happen during subscription-time instead of assembly-time.

I can see two reasons for moving the optimization into the subscription-time: 1) safety-net in case the fluent API is bypassed and 2) convenience if the fused and non-fused version doesn't differ that much to warrant a full-independed class as operator.

4) Replacing with the same operator but with modified parameters

Users of the libraries tend to apply certain operator types multiple times in a sequence, such as map() and filter():

Observable.range(1, 10)
   .filter(v -> v % 3 == 0)
   .filter(v -> v % 2 == 0)
   .map(v -> v + 1)
   .map(v -> v * v)
   .subscribe(System.out::println);

This is quite convenient to look at one can more easily understand what's happening. Unfortunately, if you have a range of 1M or resubscribe to the sequence a million times, the structure has quite a measurable overhead compared to a flatter structure.

The idea with this macro-fusion is to detect if an operator of the same type was applied before, take the original source and apply the operator where the parameters get combined. In our example, that means range() is followed, internally, by a single filter() application where the two lambda functions (in their reference form) are combined:

Predicate<Integer> p1 = v -> v % 3 == 0;
Predicate<Integer> p2 = v -> v % 2 == 0;

Predicate<Integer> p3 = v -> p1.test(v) && p2.test(v);

A similar fusion happens with the lambda of the map() operations, with the difference that the output of the first lambda is going to be the input of the second lambda:

Function<Integer, Integer> f1 = v -> v + 1;
Function<Integer, Integer> f2 = v -> v * v;

Function<Integer, Integer> f3 = v -> f2.apply(f1.apply(v));

Micro-fusion

Micro-fusion happens when two or more operators share their resources or internal structures and thus bypassing some overhead of the general wired-up structure. Micro-fusion can mostly happen in subscription-time.

The original idea of micro-fusion was the recognition that operators that end in an output queue and operators starting with a front-queue could share the same Queue instance, saving on allocation and saving on the drain-loop work-in-progress serialization atomics. Later, the concept has been extended to sources that could pose as Queues and thus avoiding the creation of SpscArrayQueue instances completely.

There are several forms of micro-fusion that can happen in operators.

1) Conditional Subscriber

When filtering an (upstream) source with filter() or distinct(), if that source features a drain-loop with request accounting, there is the likely scenario that filter() will request(1) if the last value has been dropped by the operator. Lots of request(1) calls, which all trigger some atomic increment or CAS loop adds up overhead relatively quickly.

The idea behind a conditional subscriber is to have an extra method, boolean onNextIf(T v), that would indicate if it didn't really consume the value. In that case, the usual drain-loop would then skip incrementing its emission counter and keep emitting until the request limit is reached by successful consumptions.

This saves a lot on request management overhead and some operators in RxJava 2.x support it, but there are some drawbacks as well, mostly affecting the library writers themselves:

a) The source and filter may be separated by other operators so those operators have to offer a conditional Subscriber version of themselves to pass along the onNextIf() calls.

b) By returning non-void, the onNextIf() implementation is forced to be synchronous in nature. However, since it just returns a boolean, it can still behave as the regular onNext() method by claiming it consumed the value even though it dropped it; therefore, it has to request(1) manually again.

Since this is an internal affair, conditional Subscribers of operators still have to implement the regular onNext() behavior in case the upstream doesn't support conditional emission and/or is from some other reactive library with different internals.

2) Synchronous-fusion

We call synchronous micro-fusion the cases when the source to an operator is synchronous in nature, and can pretend to be a Queue itself.

Typical sources of such nature are range(), fromIterable, fromArray, fromStream and fromCallable. You could count just() here as well but usually, it is involved more in macro-fusion cases.

Operators that use an internal queues are, for example, observeOn(), flatMap() in its inner sources, publish(), zip(), etc.

The idea is for the source's Subscription to also implement Queue, and during the subscription time, the onSubscribe() can check for it and use it instead of newing up its internal Queue implementation.

This requires a different operation mode (a mode switch) from both upstream and the operator itself, namely, calling request() is forbidded and one has to remember the mode itself in some field variable. In addition, when the Queue.poll() returns null, that should indicate no more values will ever come, unlike regular poll()s in operators where null means no values available but there could be in the future.

Unfortunately for the RxJava 1.x, this fusion works better with the Reactive-Streams architecture because a) setting a Producer is optional, b) the lifecycle-related behaviors are too unreliable and c) discovery difficulties and too much indirection.

When benchmarked in Rsc, this form of fusion makes a range().observeOn() sequence go from 55M Ops/s to 200M Ops/s in throughput, giving a ~4x overhead reduction in this trivial sequence.

Again, there are downsides of this kind of API "hacking":

a) In short sequences, the mode switch inside the operator may not be worth it.

b) This optimization is library local at the moment so unless there is a standard API like with Reactive-Streams interfaces, library A implementing micro-fusion may not cross-fuse with library B.

c) There are situations where this queue-fusion optimization is invalid, mainly due to thread-boundary violations (or other effects we haven't discovered yet that create invalid fused sequences).

d) This optimization has also some library-spanning effect, because intermediate operators have to support, or at least not interfere with the setup protocol of the mode-switch.

e) This also has the effect that in a Reactive-Streams architecture, an operator can't just pass along the Subscription from upstream to its downstream because if they fuse, the intermediate operator is cut out.

3) Asynchronous-fusion

There are other situations when the source has its own internal, downstream facing queue which is drained by requests, but the timing and count of the items are not known upfront.

In this situation, the source can also implement the Queue interface and the operator use it instead of a fresh queue, but the protocol has to change, especially if the same operator wants to support synchronous fusion.

Therefore, in Rsc, instead of checking if Subscription implements Queue received in onSubscribe(), we established a custom interface, QueueSubscription, that implements Subscription, Queue and a method called requestFusion().

The method requestFusion() takes a int-flag telling the upstream what kind of fusion the the current operator wants or supports and the upstream should respond what kind of fusion mode it has activated.

For example, flatMap() would request a synchronous fusion from the inner source which could answer with, sorry-no, yes-synchronous or instead-asynchronous mode and act according to them. Generally, one can "downgrade" from a synchronous mode to asynchronous or none, but one can't "upgrade" to synchronous mode from asychronous mode requests.

In asynchronous-fusion mode, downstream has to still issue request() calls, but instead of enqueueing the value twice, the value gets generated into the shared queue and the upstream calls onNext() indicating its availability. The value of this call is irrelevant, we use null as a type-neutral value, and can trigger the usual drain() call directly.

Since fusion happens in subscription time, there is too late to change the Subscriber instance itself, therefore, one needs a mode flag in the operator and do a conditional check for the fusion mode. Therefore, the same class can work with regular and fuseable sources alike.

This is the point when the complexity rises 50% above the complexity of a classical backpressured operator and requires quite an in-detail knowledge of all the operators and their behavior in various situations.

Invalid fusions

Before one goes ahead and fuses every queue in every operation, a problem comes up in the form of invalid fusion.

Operators tend to have some barriers associated with them. These are somewhat analogous to memory barriers and have a similar effect: 1) prevent certain reorderings and 2) prevent certain optimizations altogether.

For example, mapping from String to Integer and then Integer to Double can't be reordered because of the type mismatch. Reordering a filter() with map() may be invalid when the map changes types or by introducing side-effects in map that would have been avoided because filter didn't let the causing value through in the first place.

On one hand, these functional barriers mainly affect the macro-fusion operators and somewhat easier the detect and understand.

On the other hand, when asynchrony is involved, in the form of a thread-jumping behavior provided by observeOn(), micro-fusion can become invalid.

For example, if you have a sequence of

source.flatMap(u -> range(1, 2).map(v -> heavyComputation(v))
    .observeOn(AndroidSchedulers.mainThread()))
.subscribe(...)

The inner sequence of range-map-observeOn-flatMap would have a single fused queue, where the map()'s behavior has been reordered to the output side of the shared queue, now executes the heavy computation on the main thread.

On a side note, classical observeOn can also drag the emission to its thread due to how backpressure triggers emission, thus in the example above, if you have a longer range(), the range's emission and so the map()'s computation would end up on the main thread anyway. This is why one needs subscribeOn()/observeOn() before map to ensure it runs on the correct thread.

This required a slight change to the protocol of the requestFusion() call by introducing a bit indicating if the caller (chain) acts as an asynchronous boundary, that is, the endpoint of the fused queue would be in another thread. Intermediate operators such as map() intercept this method all and simply respond with no-fusion.

Finally, there might be a subscription-time related barrier as well that prevents reordering/optimization due to subscription side-effects. We are not sure of this yet but here are a few hands-on cases that requires further study:

1) Is it valid to turn a range().subscribeOn(s1).observeOn(s2) chain, which I call strongly-pipelined sequence because of the forced thread-boundary switch by default, into a fused range().observeOn(s2)? The tail-emission pattern is the same, you get events on Scheduler s2, but now we've lost the strong pipelining effect.

2) Subscribing to a Subject may take some in case there are lots of Subscribers there thus subscribeOn() may be a valid use to offset the overhead, but generally, there are no other side-effects happening when one subscribes to a PublishSubject. Is it valid to drop/replace subscribeOn() here?

Conclusion

Operator-fusion is a great opportunity, but also a great responsibility, to reduce overhead in reactive dataflows, and sometimes, get pretty close (+50% overhead with Project Reactor 2.5 M1 instead of +200% overhead with RxJava 2.x) to a regular Java Streams sequence's overhead while still supporting asynchronous (parts of) sequences with the same API (and similar internals).

However, adding fusion to every operator over zealously may not worth it and one should focus on operators doing the heavy lifting in user's code most of the time: flatMap(), observeOn(), zip(), just(), from() etc. In addition, one could say every operator pair is macro-fuseable because a custom operator can be written for it, but then you now have a combinatorial explosion of operators that now have to interact with the regular operators and with each other.

Of course, on the other side, there are operators that don't look like they could be (micro-) fused but may turn up fuseable after all. But instead of building a huge operator cross-fusion matrix, there might be a possibility to automatically discover which operators can be fused by modelling them and the sequences in some way and applying graph algorithms on the network - a topic for further research.

Anyway, the in the next part, I'll dive deeper into how operator-fusion in Rsc has been implemented, but before that, I'd like to describe the in-depth technicalities and differences of subscribeOn() and observeOn() operators in an intermediate post for two reasons:

1) I think showing how to implement them clears up the confusion around them because I learned about subscribeOn() and observeOn()the same in-depth technical way in the first place (and I was never confused).

2) Knowing their structure and exact behavior helps in understanding the fusion-related changes applied to them later on.

As for where you can play with this fusion thing (as an end-user), check out Project Reactor 2.5, who have extensively (unit-) tested the solutions I have described in the post. Of course, since this is an ongoing research, the Rsc project itself welcomes feedback or tips on what operator combinations we should optimize for.

Introduction

From time to time, the question or request comes up that one would really like to have his/her own reactive type. Even though RxJava's Observable has plenty of methods and extension points via lift(), extend() and compose(), one feels the Observable should have the operator xyz() or in some chains, the chain shouldn't allow calling uvw().

The first case, namely adding a new custom method without going through the project as a contribution, is as old as the reactive programming on the JVM. When I first ported Rx.NET to Java, I had to face the same problem because .NET had the very convenient extension method support already back in 2010. Java doesn't have this and the idea has been rejected in the version 8 development era in the favor of default methods with the "justification" that such extension methods can't be overridden. Yes they can't but they can be replaced by another method from another class.

The second case, hiding or removing operators, comes up with custom Observables where certain operations don't make sense. For example, given a ParallelObservable that splits the input sequence into parallel processing pipelines internally, it makes sense to map() or filter() in parallel, but it doesn't make sense to use take() or skip().

Wrapping

Both cases can be solved by writing a custom type and just wrap the Observable into it.

public final class MyObservable<T> {
    private Observable<T> actual;
    public MyObservable<T>(Observable<T> actual) {
        this.actual = actual;
    }
}

Now we can add operators of our liking:

    // ...
    public static <T> MyObservable<T> create(Observable<T> o) {
        return new MyObservable<T>(o);
    }

    public static <T> MyObservable<T> just(T value) {
        return create(Observable.just(value));
    }

    public final MyObservable<T> goAsync() {
        return create(actual.subscribeOn(Schedulers.io())
            .observeOn(AndroidSchedulers.mainThread()));
    }

    public final <R> MyObservable<R> map(Func1<T, R> mapper) {
        return create(actual.map(mapper));
    }

    public final void subscribe(Subscriber<? super T> subscriber) {
        actual.subscribe(subscriber);
    }
    // ...

As seen here, we achieved both goals: get rid of the unnecessary operators and introduce our own operator while staying within our custom type.

If you look at the source code of RxJava, you see the same pattern where the actual object is just the OnSubscribe / Publisher type and the Observable enriches them with all sorts of operators.

Interoperation

The MyObservable looks adequate, but eventually, one has to interoperate with the regular Observable or somebody else's YourObservable. Because these are distinct types, we need a common type they can communicate with each other. Naturally, everybody can implement a toObservable() and return an Observable view, but that's yet another inconvenience of calling the method. Instead, every MyObservable and YourObservable can extend a base class or implement an interface with the minimal set of operations that each requires.

In RxJava 1.x, the obvious choice, Observable, isn't too good, because its methods are final and leak into MyObservable and the worst, they all return Observable instead of MyObservable! Unfortunately, 1.x can't help in this regard due to binary compatibility reasons.

Lucky thing is that in 2.x, Observable (Flowable) isn't really the root of the reactive type but Publisher. Every observable is a Publisher and many operators take Publisher as parameter instead of Observable. This has the benefit of working with other Publisher-based types out of box. The reason this can work is that for the Observable chain to work, the operators only need a single method to be available from their sources: subscribe(Subscriber<? super T> s);

Therefore, if we target 2.x, the MyObservable should implement Publisher and thus immediately available as source to operators of any decent reactive library:

public class MyObservable<T> implements Publisher<T> {
    // ...
    @Override
    public void subscribe(Subscriber<? super T> subscriber) {
        actual.subscribe(subscriber);
    }
    // ...
}

Extension

Given this MyObservable, one would eventually want to have other custom reactive type for different use cases, but that will become tedious as well due to the need for duplicating operators all over. Naturally, one thinks about using MyObservable as the base class for TheirObservable and adding new operators there, but that suffers from the same problem as the Observable -> MyObservable would: operators return the wrong type.

I believe the Java 8 Streams API suffered from a similar problem and if you look at the signature, Stream extends BaseStream<T, Stream<T>> and BaseStream<T, S extends BaseStream<T, S>>. Quite odd that some supertype has a type parameter for the subtype. The reason for this is to capture the subtype in the type signature of the methods, thus if you have MyStream, all stream methods' type signature now has MyStream as a return type.

We can achieve a similar structure by declaring MyObservable as follows:

    public class MyObservable<T, S extends MyObservable<T, S>> implements Publisher<T> {
       
        final Publisher<? extends T> actual;
       
        public MyObservable(Publisher<? extends T> actual) {
            this.actual = actual;
        }
       
        @SuppressWarnings("unchecked")
        public <R, U extends MyObservable<R, U>> U wrap(Publisher<? extends R> my) {
            return (U)new MyObservable<R, S>(my);
        }
       
        public final <R, U extends MyObservable<R, U>> U map(Function<? super T, ? extends R> mapper) {
            return wrap(Flowable.fromPublisher(actual).map(mapper));
        }
       
        @Override
        public void subscribe(Subscriber<? super T> s) {
            actual.subscribe(s);
        }
    }

Quite a set of generic type mangling. We specify a wrap() method that turns an arbitrary Publisher into MyObservable and we call it from map() to ensure the result type is ours. Descendants of MyObservable will then override wrap to provide their own type:

    public class TheirObservable<T> extends MyObservable<T, TheirObservable<T>> {
        public TheirObservable(Publisher<? extends T> actual) {
            super(actual);
        }

        @SuppressWarnings("unchecked")
        @Override
        public <R, U extends MyObservable<R, U>> U wrap(Publisher<? extends R> my) {
            return (U) new TheirObservable<R>(my);
        }
    }

Let's try it:

    public static void main(String[] args) {
        TheirObservable<Integer> their = new TheirObservable<>(Flowable.just(1));
       
        TheirObservable<String> out = their.map(v -> v.toString());

        Flowable.fromPublisher(out).subscribe(System.out::println);
    }

It works as expected; no compilation errors and it prints the number 1 to the console.

Now let's add a take() operator to TheirObservable:

        @SuppressWarnings({ "rawtypes", "unchecked" })
        public <U extends AllObservable<T>> U filter(Predicate<? super T> predicate) {
            Flowable<T> p = Flowable.fromPublisher(actual);
            Flowable<T> u = p.filter(predicate);
            return (U)(AllObservable)wrap(u);
        }

The method signatures get more complicated and the type system starts to fight back; one needs raw types and casts to make things appear the expected type. In addition, if one writes their.map(v -> v.toString()).take(1); the compiler won't find take(). The reason for it is that map returns something of MyObservable which something was defined by the assignment to TheirObservable. To make the types work out, we have to split the fluent calls into individual steps:

        TheirObservable<Integer> their2 = new TheirObservable<>(Flowable.just(1));
        TheirObservable<String> step1 = their2.map(v -> v.toString());
        TheirObservable<String> step2 = step1.take(1);
        Flowable.fromPublisher(step2).subscribe(System.out::println);

Finally, lets extend TheirObservable further into AllObservable and let's add the filter() method:

    public static class AllObservable<T> extends TheirObservable<T> {
        public AllObservable(Publisher<? extends T> actual) {
            super(actual);
        }

        @Override
<R, U extends MyObservable<R, U>> U wrap(Publisher<? extends R> my) {
            return (U)new AllObservable<R>(my);
        }

        @SuppressWarnings({ "rawtypes", "unchecked" })
        public <U extends AllObservable<T>> U filter(Predicate<? super T> predicate) {
            Flowable<T> p = Flowable.fromPublisher(actual);
            Flowable<T> u = p.filter(predicate);
            return (U)(AllObservable)wrap(u);
        }
    }

then use it:

        AllObservable<Integer> all = new AllObservable<>(Flowable.just(1));

        AllObservable<String> step1 = all.map(v -> v.toString());

        AllObservable<String> step2 = step1.take(1);

        AllObservable<String> step3 = step2.filter(v -> true);

        Flowable.fromPublisher(step3).subscribe(System.out::println);

Unfortunately, this doesn't compile because map() doesn't return AllObservable, namely, AllObservable is not MyObservable<String, U extends MyObservable<String, U>>. Changing step1's type to TheirObservable<String> resolves the compilation issue. However, if one would then swap filter() and take(), step1 no longer is an AllObservable and filter() is no longer available.

Conclusion

Can we fix the situation with AllObservable? I don't know; this is where my understanding of Java's type system and type inference ends.

Will RxJava 2.x have such structure then? If it were up to me then no. To support this style, we'd need wrapping all the time despite I want to get rid of all lift() and create() use and the type signatures of classes and methods end up way more complicated than before.

Therefore, if one wants to go down this path, the example shows above that RxJava's API doesn't have to change and can be wrapped to do the work while one specifies their surface API at will. It is a good example for "composition over inheritance".

Introduction

One of the most confused operator pair of the reactive ecosystem is the subscribeOn and observeOn operators. The source of confusion may be rooted in a few causes:

they sound alike,
they sometimes show similar behavior when looked at from downstream and
they are duals in some sense.

It appears the name-confusion isn't local to RxJava. Project Reactor faces a similar issue with their publishOn and dispatchOn operators. Apparently, it doesn't matter what they are called and people will confuse them anyhow.

When I started learning about Rx.NET back in 2010, I never experienced this confusion; subscribeOn affects subscribe() and observeOn affects onXXX().

(Remark: I've searched Channel 9 for the early videos but couldn't really find the talk where they build up these operators just like I'm about to do. The closest thing was this.)

My "thesis" is that the confusion may be resolved by walking through how one can implement these operators and thus showing the internal method-call flow.

SubscribeOn

The purpose of subscribeOn() is to make sure side-effects from calling subscribe() happens on some other thread. However, almost no standard RxJava source does side-effects on its own; you can have side-effects with custom Observables, wrapped subscription-actions via create() or as of lately, the with the SyncOnSubscribe and fromCallable() APIs.

Why would one move the side-effects? The main use cases are doing network calls or database access on the current thread or anything that involves blocking wait. Holding off a Tomcat worker thread hasn't been much of a programming problem (that doesn't mean we can't improve the stack with reactive) but holding off the Event Dispatch Thread in a Swing application or the Main thread in an Android application has adverse effect on the user experience.

(Sidenote: it's a funny thing that blocking the EDT is basically a convenience backpressure strategy in the GUI world to prevent the user from changing the application state while some activity was happening.)

Therefore, if the source does something immediately when a child subscribes, we'd want it to happen somewhere off the precious current thread. Naturally, we could just submit the whole sequence and the call to subscribe() to an ExecutorService, but then we'd be faced with the problem of cancellation being separate from the Subscriber. As more and more (complex) sequences requiring this asynchronous subscription behavior, the inconvenient it becomes to manage all those in this manner.

Luckily, we can include this behavior into an operator we call subscribeOn().

For simplicity, let's build this operator on a much simpler reactive base type: the original IObservable from Rx.NET:

@FunctionalInterface
interface IObservable<T> {
    IDisposable subscribe(IObserver<T> observer);
}

@FunctionalInterface
interface IDisposable {
    void dispose();
}

interface IObserver<T> {
    void onNext(T t);
    void onError(Throwable e);
    void onCompleted();
}

Don't worry about synchronous cancellation and backpressure for now.

Let's assume we have a source which just sleeps for 10 seconds:

IObservable<Object> sleeper = o -> {
    try {
        Thread.sleep(1000);
        o.onCompleted();
    } catch (InterruptedException ex) {
        o.onError(ex);
    }
};

which will obviously go to sleep if we call sleeper.subscribe(new IObserver ... ); Let's now create an operator that moves this sleep to some other thread:

ExecutorService exec = Executors.newSingleThreadedExecutor();

IObservable<Object> subscribeOn = o -> {
    Future<?> f = exec.submit(() -> sleeper.subscribe(o));
    return () -> f.cancel(true);
}

The subscribeOn instance will submit the action that subscribes to the actual IObservable to the executor and returns a disposable that will cancel the resulting Future from the submission.

Of course, one would instead have this in some static method or on a wrapper around the IObservable (as Java doesn't support extension methods):

public static <T> IObservable<T> subscribeOn(IObservable<T> source, 
    ExecutorService executor);

public Observable<T> subscribeOn(Scheduler scheduler);

Two of the common questions regarding subscribeOn are what happens when one applies it twice (directly or some regular operators in between) and why can't one change the original thread with a second subscribeOn. I hope the answer becomes apparent from the simplified structure above. Let's apply the operator a second time:

ExecutorService exec2 = Executors.newSingleThreadedExecutor();

IObservable<Object> subscribeOn2 = o -> {
    Future<?> f2 = exec2.submit(() -> subscribeOn.subscribe(o));
    return () -> f2.cancel(true);
};

Now let's expand subscribeOn.subscribe() in place:

IObservable<Object> subscribeOn2 = o -> {
    Future<?> f2 = exec2.submit(() -> {
       Future<?> f = exec.submit(() -> {
          sleeper.subscribe(o);
       });
    });
};

We can simply read this from top to bottom. When o arrives, a task is scheduled on exec2 which when executes, another task is scheduled on exec which when executes subscribes to sleeper with the original o. Becasue the subscribeOn2 was the last, it gets executed first and no matter where it runs the task, it gets rescheduled by subscribeOn anyway on its thread. Therefore, that subscribeOn() operator's thread will matter which is closest to the source and one can't use another subscribeOn() application to change this. This is why APIs built on top of Rx either should not pre-apply subscribeOn() when they return an Observable or give the option to specify a scheduler.

Unfortunately, the subscribeOn operator above doesn't handle unsubscription properly: the result of the sleeper.subscribe() is not wired up to that external IDisposable instance and thus won't dispose the "real" subscription. Of course, this can be resolved by having a composite IDisposable and adding all relevant resources to it. In RxJava 1, however, we don't need this kind of juggling and the operator can be written with less work:

Observable.create(subscriber -> {
    Worker worker = scheduler.createWorker();
    subscriber.add(worker);
    worker.schedule(
        () -> source.unsafeSubscribe(Schedulers.wrap(subscriber))
    )
});

This makes sure the unsubscribe() call on the subscriber will affect the schedule() and whatever resources the upstream source would use. We can use unsafeSubscribe() to avoid the unnecessary wrapping into a SafeSubscriber but we have to wrap the subscriber anyway because both subscribe() and unsafeSubscribe() call onStart() on the incoming Subscriber, which has already been called by the outer Observable. This avoids repeating any effects inside the user's Subscriber.onStart() method.

The structure above composes backpressure as well, but we are not done.

Before RxJava got backpressure, the subscribeOn() implementation above made sure that an otherwise synchronous source would emit all of its events on the same thread:

Observable.create(s -> {
    for (int i = 0; i < 1000; i++) {
        if (s.isUnsubscribed()) return;

        s.onNext(i);
    }

    if (s.isUnsubscribed()) return;

    s.onCompleted();
});

Users started to implicitly rely on this property. Backpressure breaks this property because usually the thread that calls request() will end up running the fragment of the loop above (see range()), causing potential thread-hopping. Therefore, to keep the same property, calls to request() has to go to the very same Worker that did the original subscription.

The actual operator thus is more involved:

subscriber -> {
    Worker worker = scheduler.createWorker();
    subscriber.add(worker);

    worker.schedule(() -> {
       Subscriber<T> s = new Subscriber<T>(subscriber) {
           @Override
           public void onNext(T v) {
               subscriber.onNext(v);
           }

           @Override
           public void onError(Throwable e) {
               subscriber.onError(e);
           }

           @Override
           public void onCompleted() {
               subscriber.onCompleted();
           }

           @Override
           public void setProducer(Producer p) {
               subscriber.setProducer(n -> {
                   worker.schedule(() -> p.request(n));
               });
           }
       };

       source.unsafeSubscribe(s);
    });
}

Other than forwarding the onXXX() methods to the child subscriber, we set a custom producer on the child where the request() method schedules an action that calls the original producer with the same amount on the scheduler, ensuring that if there is an emission tied to the request, that happens on the same thread every time.

This can be optimized a bit by capturing the current thread in the outer schedule() action, comparing it to the caller thread in the custom Producer and then calling p.request(n) directly instead of scheduling it:

    Thread current = Thread.currentThread();

    // ...

    subscriber.setProducer(n -> {
        if (Thread.currentThread() == current) {
            p.request(n);
        } else {
            worker.schedule(() -> p.request(n));
        }
    });

ObserveOn

The purpose of observeOn is to make sure values coming from any thread are received or observed on the proper thread. RxJava is by default synchronous, which technically means that onXXX() methods are called in sequence on the same thread:

for (int i = 0; i < 1000; i++) {
    MapSubscriber.onNext(i) {
       FilterSubscriber.onNext(i) {
           TakeSubscriber.onNext(i) {
               MySubscriber.onNext(i);
           }
       }
    }
}

There are several use cases for moving this onNext() call (and any subsequent calls chained after) to another thread. For example, generating the input to a map() operation is cheap but the calculation itself is expensive and would hold off the GUI thread. Another example is when there is a background activity (database, network or the previous heavy computation), the results should be presented on the GUI and that requires the programmer to only interact with the GUI framework on the specific thread.

In concept, observeOn works by scheduling a task for each onXXX() calls from the source on a specific scheduler where the original parameter value is handed to the downstream's onXXX() methods:

ExecutorService exec = Executors.newSingleThreadedExecutor();

IObservable<T> observeOn = o -> {
    source.subscribe(new Observer<T>() {
        @Override
        public void onNext(T t) {
            exec.submit(() -> o.onNext(t));
        }

        @Override
        public void onError(Throwable e) {
            exec.submit(() -> o.onError(e));
        }

        @Override
        public void onCompleted() {
            exec.submit(() -> o.onCompleted());
        }            
    });
};

This pattern only works if the executor is single threaded or otherwise ensures FIFO behavior and doesn't execute multiple tasks from the same "client" at the same time.

Unsubscription here is more complicated, because one has to keep track all the pending tasks, remove them when they have finished and make sure every pending task can be mass-cancelled.

I believe the Rx.NET has some complicated machinery for this, but luckily, RxJava has a simple solution in the form of the Scheduler.Worker, taking care of all the required unsubscription behavior:

Observable.create(subscriber -> {
    Worker worker = scheduler.createWorker();
    subscriber.add(worker);

    source.unsafeSubscribe(new Subscriber<T>(subscriber) {
        @Override
        public void onNext(T t) {
            worker.schedule(() -> subscriber.onNext(t));
        }

        @Override
        public void onError(Throwable e) {
            worker.schedule(() -> subscriber.onError(e));
        }

        @Override
        public void onCompleted() {
            worker.schedule(() -> subscriber.onCompleted());
        }            
    });
});

Now if we compare subscribeOn and observeOn, one can see that subscribeOn schedules the entire source.subscribe(...) part whereas observeOn schedules the individual subscriber.onXXX() calls onto another thread.

You can now see if observeOn is applied twice, that inner scheduled task expands to another lever of scheduling:

worker.schedule(() -> worker2.schedule(() -> subscriber.onNext(t)));

thus it overrides the emission thread in the chain, therefore, functionally, the closest observeOn to the consumer will win. From the expanded call above, you can see that worker is now wasted as a resource while providing no functional value to the sequence.

The observeOn with the given structure has a drawback. If the source is some trivial Observable such as range(0, 1M); that will emit all of its values and suddenly, we have a large amount of pending task in the underlying threadpool of the scheduler. This can overwhelm the downstream consumer and also consumes lot of memory.

Backpressure was introduced mostly to handle such cases, preventing internal buffer bloat and unbounded memory usage due to an asynchronous boundary. Consumers specifying the items they can consume via request() makes sure the producer side will only emit that many elements onNext(). Once the consumer is ready, it will issue another request(). The observeOn() above, with its new Subscriber<T>(subscriber) wrapping, already composes backpressure and relays the request() calls to the upstream source. However, this doesn't prevent the consumer from requesting everything via Long.MAX_VALUE and now we have the same bloat problem again.

Unfortunately, RxJava discovered the backpressure problem too late and the mandatory requesting would have required a lot of user code changes. Instead, backpressure was introduced as an optional behavior and made the resposibility of the operators such as observeOn to handle it while maintaining transparency with bounded Subscribers and unbounded Observers alike.

The way it can be handled is via a queue, request tracking for the child Subscriber, fixed request amount towards the source and a queue-drain loop.

Observable.create(subscriber -> {
    Worker worker = scheduler.createWorker();
    subscriber.add(worker);

    source.unsafeSubscribe(new Subscriber<T>(subscriber) {
        final Queue<T> queue = new SpscAtomicArrayQueue<T>(128);

        final AtomicLong requested = new AtomicLong();

        final AtomicInteger wip = new AtomicInteger();

        Producer p;

        volatile boolean done;
        Throwable error;

        @Override
        public void onNext(T t) {
            queue.offer(t);
            trySchedule();
        }

        @Override
        public void onError(Throwable e) {
            error = e;
            done = true;
            trySchedule();
        }

        @Override
        public void onCompleted() {
            done = true;
            trySchedule();
        }            

        @Override
        public void setProducer(Producer p) {
            this.p = p;
            subscriber.setProducer(n -> {
                BackpressureUtils.addAndGetRequested(requested, n);
                trySchedule();
            }); 
            p.request(128);
        }

        void trySchedule() {
            if (wip.getAndIncrement() == 0) {
                worker.schedule(this::drain);
            }
        }

        void drain() {
            int missed = 1;
            for (;;) {
                long r = requested.get();
                long e = 0L;

                while (e != r) {
                    boolean d = done;
                    T v = queue.poll();
                    boolean empty = v == null;

                    if (checkTerminated(d, empty)) {
                        return;
                    }

                    if (empty) {
                        break;
                    }

                    subscriber.onNext(v);
                    e++;
                }

                if (e == r && checkTerminated(done, queue.isEmpty())) {
                    break;
                }

                if (e != 0) {
                    BackpressureHelper.produced(requested, e);
                    p.request(e);
                }

                missed = wip.addAndGet(-missed);
                if (missed == 0) {
                    break;
                }
            }
        }

        boolean checkTerminated(boolean d, boolean empty) {
            if (subscriber.isUnsubscribed()) {
                queue.clear();
                return true;
            }
            if (d) {
                Throwable e = error;
                if (e != null) {
                    subscriber.onError(e);
                    return true;
                } else
                if (empty) {
                    subscriber.onCompleted();
                    return true;
                }
            }
            return false;
       }
    });
});

By now, the pattern should be quite familiar. We queue up the item or save the exception, then increment the wip counter and schedule the draining of the queue. This is necessary as values may arrive the same time the downstream issues a request. Issuing a request has to schedule the drain as well because values may be available in the queue already. The drain loop emits what it can and asks for replenishment from the upstream Producer it got through the setProducer() call.

Naturally, one can extend this with additional safeguards, error-delay capability, parametric initial request amount and even stable replenishment amounts. This trySchedule setup has the property that it doesn't require a single threaded scheduler to begin with as it self-trampolines: due to the getAndIncrement, only a single thread will issue the drain task at a time and then only when the wip counter is decremented to zero will open the opportunity for scheduling another drain task by somebody.

Conclusion

In this post, I've tried to clear up the confusion around the subscribeOn and observeOn operators by showing a simplified, clutter free way of implementing them.

We saw then that the complication in RxJava comes from the need for handling backpressure somewhat transparently for consumers that do or don't directly drive a sequence through it.

Now that the inner workings and structures have been clarified, let's continue with the discussion about operator fusion where I can now use subscribeOn and observeOn as an example how macro- and micro-fusion can help around the asynchronous boundaries they provide.

Introduction

In the previous part, I've introduced the concepts around operator fusion. In this post, I'll detail the API and protocols required make operator fusion happen.

In its current form, operator fusion works between two subsequent operators and is based on the ability to identify each other and, in case of micro-fusion, switch to a different protocol than Reactive-Streams (RS) if both agree.

Macro-fusion constructs

The primary targets of macro-fusion are the single element sources: just(), empty(), fromCallable(). Firing up the complete RS infrastructure for such single elements is quite expensive, but half of the API use in RxJava and Reactor come from these. Therefore, RxJava introduced Single and Reactor introduced Mono to help as much as possible and offer (ever increasingly) optimized operators on them.

However, knowing a source will generate 0 or 1 element during assembly time is also a great help in regular Observable / Flux uses. In addition, knowing the source is also a constant helps inlining it in via some custom operator.

Creating 0 or 1 element synchronous sources

To indicate a source returns a single value, the Reactive-Streams-Commons (Rsc) project (and Reactor off it) established a contract:

If a Publisher implements java.util.concurrent.Callable, it is considered a 0 or 1 element source.

You can implement Callable and return a non-null value that can be computed synchronously. You can also return null which indicates an empty result. (Remember, RS doesn't allow null values over onNext.) The call to call() will happen during subscription time.

public class MySingleSource implements Publisher<Object>, Callable<Object> {
    @Override
    public void subscribe(Subscriber<? super Object> s) {
        s.onSubscribe(new ScalarSubscription<>(s, System.currentTimeMillis()));
    }

    @Override
    public Object call() throws Exception {
        return System.currentTimeMillis();
    }
}

If the 0 or 1 element source is known to be constant, the source can be the subject of assembly time optimizations. For example, if it returns null, indicating emptiness (like empty()), there are only a handful of operators that can be applied to it (which don't work on items) and the assembly process can just return empty().

We can extend Callable with a new interface ScalarCallable to indicate a 0 or 1 element constant source.

public interface ScalarCallable<T> extends Callable<T> {
    @Override
    T call();
}

By extending Callable, any use places who expects a 0 or 1 element dynamic source can work with a constant source. The reverse is not true; those expecting a constant source won't execute an arbitrary Callable (which could block or trigger side-effects) during assembly time:

public class MyScalarSource implements Publisher<Object>, ScalarCallable<Object> {
    @Override
    public void subscribe(Subscriber<? super Object> s) {
        s.onSubscribe(new ScalarSubscription<>(s, 1));
    }

    @Override
    public Object call() {
        return 1;
    }
}

Note that the ScalarCallable overrides the call() method and removes the throws Exception clause: scalar constants should not throw for one and consumers should not need to wrap the call() into a try-catch.

Consuming 0 or 1 element synchronous sources

Consuming Callable and ScalarCallable is a matter of instanceof checks performed either in subscription time or assembly time respectively, followed by the extraction of the single value through call().

For example, a macro-fusion on the operator count() could check for a scalar value and return a constant 0 for an empty or 1 for a single value:

public final Flux<Long> count() {
    if (this instanceof ScalarCallable) {

       T value = ((ScalarCallable&ltT>)this).call();

       return just(value == null ? 0 : 1);
    }
    return new FluxCount<>(this);
}

Another example is to have a shortcut in flatMap(), concatMap() or switchMap() for 0 or 1 element sources. In this case, there is no need to run the full infrastructure but just subscribe to the Publisher returned by their mapping function.

Note that since the mapper function can side-effect itself, one can't use assembly-time optimization on them and a new source operator has to be introduced.

public final <R> Px<R> flatMap(
        Function<? super T, ? extends Publisher<? extends R>> mapper) {

    if (this instanceof Callable) {

        return new PublisherCallableMap<((Callable<T>)this, mapper);
    }

    return new PublisherFlatMap<>(this, mapper, ...);
}

(Remark: Px stands for Publisher Extensions in Rsc and is the base type for Rsc's fluent API - more of a convenience in tests and perf to avoid spelling out all those PubliserXXX classes than a fully fledged API entry point.)

public final class PublisherCallableMap<T, R> implements Publisher<R> {
    final Callable<? extends T> source;
    final Function<? super T, ? extends Publisher<? extends T>> mapper;

    public PublisherCallableMap(
            Callable<? extends T> source,
            Function<? super T, ? extends Publisher<? extends T>> mapper) {
        this.source = source;
        this.mapper = mapper;
    }

    @Override
    public void subscribe(Subscriber<? super R> s) {
        T value;

        try {
            value = source.call();                                    // (1)
        } catch (Throwable ex) {
            ExceptionHelper.throwIfFatal(ex);
            EmptySubscription.error(s, ex);
            return;
        }

        if (value == null) {
            EmptySubscription.complete(s);
            return;
        }

        Publisher<? extends R> p;

        try {
            p = mapper.apply(value);                                  // (2)
        } catch (Throwable ex) {
            ExceptionHelper.throwIfFatal(ex);
            EmptySubscription.error(s, ex);
            return;
        }

        if (p == null) {
            EmptySubscription.error(s, 
                new NullPointerException("The mapper returned null");
            return;
        }

        if (p instanceof Callable) {                                  // (3)
            R result;

            try {
                result = ((Callable<R>)p).call();
            } catch (Throwable ex) {
                ExceptionHelper.throwIfFatal(ex);
                EmptySubscription.error(s, ex);
                return;
            }

            if (result == null) {
                EmptySubscription.complete(s);
                return;
            }

            s.onSubscribe(new ScalarSubscription<>(s, result));

            return;
        }

        p.subscribe(s);
    }
}

First (1), we extract the single value from the underlying Callable instance. If it is null, we complete the Subscriber immediately. Otherwise, we call the mapper that returns a Publisher (2). Since this publisher could be also a Callable, we do the extraction again (3) and either complete or set a backpressure-enabled ScalarSubscription on the Subscriber. Because call() can throw, we catch any exception, signal the fatal ones in some library-specific way and signal non-fatal exceptions to the Subscriber as well (plus setting its Subscription at the right time).

Caution with Callable

Since Callable is an established interface, one must be careful with implementors of Publisher and Callable where functionally, the callable means something different than a shortcut to a 0 or 1 element.

My hope is that since RS is relatively new and only a few people have actually implemented operators with it, we can avoid any pitfalls related to this combined interface approach.

Micro-fusion constructs

Unlike macro-fusion, micro-fusion requires a protocol switch between two subsequent operators; instead of using the standard RS method calls, some or all of them gets replaced by other method calls. This allows sharing internal structures or state between the two.

In theory, in a pair of operators, the upstream operator can be the initiator and work with the internals of the downstream operator. In practice, so far, we implemented fusion the other way around: the downstream operator works with the internals of the upstream operator.

However, going for a full custom interaction is not advised because that may lead to a complete custom implementation and duplication of a lot of code. (That being said, unfortunately, ConditionalSubscriber requires code duplication to avoid casting.)

Currently, Rsc and Reactor can do two kinds of micro-fusion: conditional and queue-based. On a second dimension, we can think 3 kinds of operators:

sources that support fusion (range(), UnicastProcessor)
intermediate operators that may support fusion (concatMap, observeOn, groupBy, window)

front fusion (concatMap)
back fusion (groupBy)
transitive fusion (map, filter)

consumers (flatMap inner, zip)

The third dimension appears with queue-based fusion where the source can be synchronous (i.e., fromArray) or asynchronous (UnicastProcessor).

Conditional micro-fusion

The conditional micro-fusion ability is indicated by an interface: ConditionalSubscriber extending Subscriber with one extra method:

public interface ConditionalSubscriber<T> extends Subscriber<T> {
    boolean onNextIf(T t);
}

If a source or intermediate operator sees that its consumer is a ConditionalSubscriber it may call the onNextIf method. (By nature, this means a synchronous execution and response, thus conditional fusion is for synchronous cases only.)

If the method returns true, the value has been consumed as usual. If the method returns false, it means the value was dropped and a new value can be sent immediately. This avoids a request(1) call for a replenishment in filter and other operators as well.

Sidenote: You may ask, why is this important? A call to request() usually ends up in an atomic CAS, costing 21-45 cylces for each dropped element.

To work with ConditionalSubscribers in source operators, you may have to first switch on the incoming Subscriber's type and do a different implementation to avoid casting the downstream Subscriber all the time.

@Override
public void subscribe(Subscriber<? super Integer> s) {
    if (s instanceof ConditionalSubscriber) {

        s.onSubscribe(new RangeConditionalSubscription<>(
            (ConditionalSubscriber<T>)s, start, count));

    } else {
        s.onSubscribe(new RangeSubscription<>(s, start, count);
    }
}

The implementation can then can use the onNextIf method during emissions. For example, the fast-path can be rewritten as follows:

for (long i = start; i < (long)start + count; i++) {
    if (cancelled) {
        return;
    }
    s.onNextIf((int)i);
}
if (!cancelled) {
    s.onComplete();
}

You may think, why call onNextIf if we don't care about the return value? For composition reasons. Even though this path in range() doesn't need the return value, but if the downstream is also calling onNextIf further down, this can avoid a whole chain of unnecessary request(1) calls.

The slow path is more interesting in this regard:

long i = index;
long end = (long)start + count;
long r = requested;
long e = 0L;

while (i != end && e != r) {
    if (cancelled) {
       return;
    }

    if (s.onNextIf((int)i)) {
        e++;
    }
    i++;
}

if (i == end) {
    if (!cancelled) {
        s.onComplete();
    }
    return;
}

if (e != 0L) {
    index = i;
    REQUESTED.addAndGet(this, REQUESTED, -e);
}

In the while loop, if the onNextIf returns false, we don't increment the emission count which means the next integer value can come immediately. If a downstream consumer requests only 1 and then drops all values, the loop can exhaust the available integers and not call the atomic addAndGet even once.

Since filter is one of the most common operators in a chain, one should be prepared to work with ConditionalSubscriber even if one doesn't interfere with the number of events flowing through. For example, map() and filter appear together and it is advised map() also supports conditional fusion by switching on the Subscriber's type just like above and using a ConditionalSubscriber-based Subscriber:

static final class MapConditionalSubscriber<T, R> implements ConditionalSubscriber<T> {
    final ConditionalSubscriber<? super R> actual;

    final Function<? super T, ? extends R> mapper;

    boolean done;

    Subscription s;

    // ...

    @Override
    public boolean onNextIf(T t) {
        if (done) {
            return;
        }

        R v;

        try {
            v = mapper.apply(t);
        } catch (Throwable ex) {
            ExceptionHelper.throwIfFatal(ex);
            s.cancel();
            onError(ex);
            return;
        }

        if (v == null) {
            s.cancel();
            onError(new NullPointerException("..."));
            return;
        }

        return actual.onNextIf(v);
    }

    // ...
}

The final case for the conditional micro-fusion is the "terminal" operator or consumer implementation. Luckily, usually doesn't have to provide two implementations, on ConditionalSubscriber and one Subscriber, but have them together. Those who can work with the ConditionalSubscriber part will do, others will just use the regular Subscriber methods:

static final FilterSubscriber<T> implements ConditionalSubscriber<T> {
    final Subscriber<? super T> actual;

    final Predicate<? super T> predicate;

    boolean done;

    Subscription s;

    // ...

    @Override
    public void onNext(T t) {

        if (!onNextIf(t)) {
           s.request(1);
        }
    }

    @Override
    public boolean onNextIf(T t) {
        if (done) {
            return;
        }

        try {
            return predicate.test(t);
        } catch (Throwable ex) {
            ExceptionHelper.throwIfFatal(ex);
            s.cancel();
            onError(ex);
        }
    }

    // ...
}

In conclusion, conditional micro-fusion is a relatively simple but sometimes verbose way of avoiding request(1) calls and the resulting per-item overhead.

Queue-based micro-fusion

Believe me if I tell, this is the most complicated thing, so far, in the reactive landscape. Not because it requires complicated structures or algorithms, but for the implications towards operators and the combinatoric-explosion nature of what happens if op1 is followed by op2 and how they can or can't fuse.

The queue-based micro-fusion is built upon the idea that many operators employ a queue to work out backpressure-related or asynchrony-related cases when notifying the downstream and happen to face their queue towards each other.

For example, UnicastProcessor has a backend-queue that holds values until the downstream requests them whereas concatMap has a front-queue that holds the source values to be mapped into Publishers. When subscribed, a value goes from one queue into the other, forming a dequeue-enqueue pair without anything functional between the two other than the atomics overhead of request management and wip-accounting.

Clearly, if we could somehow use a single queue between the two and somehow decrease the atomics overhead via it, we'd have a much lower overhead in terms of computation and memory usage.

However, what if there is an operator between the two that does something with the values? What if the fusion shouldn't happen in this case?

To solve this coordination problem, we can reuse the onSubscribe(Subscription) rail in RS and extend the protocol. Enter QueueSubscription.

public interface QueueSubscription<T> extends Queue<T>, Subscription {

    int NONE = 0;
    int SYNC = 1;
    int ASYNC = 2;
    int ANY = SYNC | ASYNC;
    int THREAD_BOUNDARY = 4;

    int requestFusion(int mode);

    @Override
    default boolean offer(T t) {
        throw new UnsupportedOperationException();
    }

    // ...
}

The QueueSubscription is a combination of Queue and Subscription interfaces, adding a new requestFusion() method, and other than keeping the following methods from the base interfaces, all other methods are defaulted to UnsupportedOperationException as we won't need them. (Java 7- note: yes you may have to manually do this for classes that can't extend a base class.):

void request(long n);
void cancel()
T poll()
boolean isEmpty()
void clear();

(Some libraries may choose to implement size() as well, for diagnostic purposes.)

When a source supports queue-based fusion, it can send a QueueSubscription implementation through onSubscribe. Those who can deal with it can act on it, the rest will simply see it as a regular Subscription.

The idea is that those who can deal with it can use it as a Queue instead of instantiating their own, saving on allocation and overhead at the same time. In addition, a source such as range() can itself pretend to be a queue, returning the next value through poll() or null if no more integers remain.

Since there are cases where fusion can't or should not happen, we need to perform a protocol switch during the subscription phase of a flow. This switch can be requested via the requestFusion() method, that takes and returns the constants from the interface.

(Sidenote: I know enums would be more readable, but EnumSet has a nice additional overhead you know...)

As an input, it can take:

SYNC - indicates the consumer wants to work with a synchronous upstream, with often known length
ASYNC - indicates the consumer wants to work with an asynchronous upstream with often unknown length and emission timing
ANY - indicates a consumer can work with both SYNC and ASYNC upstream
(SYNC, ASYNC) | THREAD_BOUNDARY - indicates that the consumer goes over a thread boundary and poll() happens on some other thread.

It can return:

NONE - fusion can't happen/rejected
SYNC - synchronous fusion mode activated
ASYNC - asynchronous fusion mode activated

If the upstream is unable to work in the requested mode or is sensitive to thread-boundary effects, it can return NONE. In this case, the flow behaves just like the regular, non-fused RS stream would. (Note that conditional fusion is still may be an option.)

Because fusion is optional, a successfully negotiated mode requires different mode of execution in either or both parties. In addition, this mode switch has to happen before any events fly through the chain, therefore, onSubscribe is an ideal place for it due to the underlying RS protocol spec.

Both SYNC and ASYNC modes have extra rules implementors must adhere.

In SYNC mode, consumers should never call request() and producers should never return null from poll() unless they mean completion. Since the only interaction between the two are through poll() and isEmpty(), sources have no opportunity to call onError but must throw a runtime exception from these two methods. On the other side, consumers now have to wrap these methods into try-catches and handle/unwrap exceptions there.

In ASYNC mode, the producer enqueues events in its own queue and has to signal the availability to the consumer. The best way for this is through onNext. One can either signal the value itself or null - the only place where you can do this. On the consumer side, the ASYNC mode onNext now has meaningless value and should be ignored. The other methods, onError, onComplete, request and cancel should be used as in regular RS cases. In this mode, poll() can return null indicating a temporary lack of values; the termination will be indicated by onError and onComplete as usual.

Implementing fusion-enabled sources

Now let's see the API in action. First, let's make range() fusion enabled:

static final class RangeSubscription extends QueueSubscription<Integer> {

    // ... the Subscription part is the same

    @Override
    public Integer poll() {
        long i = index;
        if (i == (long)start + count) {
            return null;
        }
        index = i + 1;
        return (int)i;
    }

    @Override
    public boolean isEmpty() {
        return index == (long)start + count;
    }

    @Override
    public void clear() {
        index = (long)start + count;
    }

    @Override
    public int requestFusion(int mode) {
        return SYNC;
    }
}

No sign of request accounting whatsoever because range() works in synchronous pull mode; consumer does backpressure by calling poll() when it needs a new value.

UnicastProcessor (which is somewhat like onBackpressureBuffer()) can support fusion in ASYNC mode specifically:

public final class UnicastProcessor<T> implements Processor<T, T>, QueueSubscription<T> {

    volatile Subscriber<? super T> actual;

    final Queue<T> queue;

    int mode;

    // ...

    @Override
    public void onNext(T t) {
        Subscriber<? super T> a = actual;
        if (mode == ASYNC && a != null) {
            a.onNext(null);
        } else {
            queue.offer(t);
            drain();
        }
    }

    @Override
    public int requestFusion(int m) {
        if ((m & ASYNC) != 0) {
            mode = ASYNC;
            return ASYNC;
        }
        return NONE;
    }

    @Override
    public T poll() {
        return queue.poll();
    }

    @Override
    public boolean isEmpty() {
        return queue.isEmpty();
    }

    @Override
    public void clear() {
        queue.clear();
    }

    @Override
    public void subscribe(Subscriber<? super T> s) {
        if (ONCE.compareAndSet(this, 0, 1)) {
            s.onSubscribe(this);
            actual = s;
            if (cancelled) {
                actual = null;
            } else {
                if (mode != NONE) {
                    if (done) {
                        if (error != null) {
                            s.onError(error);
                        } else {
                            s.onComplete();
                        }
                    } else {
                        s.onNext(null);
                    }
                } else {
                    drain();
                }
            }
        } else {
            EmptySubscription.error(s, new IllegalStateException("..."));
        }
    }
}

The fusion mode requires the following behavior changes:

onNext has to call actual.onNext instead of drain(),
requestFusion has to see if the downstream actually wants ASYNC fusion,
the queue methods have to be delegated to the instance queue,
the subscribe() has to call actual.onNext instead of drain() as well.

Doesn't look too complicated, does it? At this point, you can check your understanding of supporting fusion through an exercise: can UnicastProcessor support SYNC fusion and if so, when and how; if not, why not?

Implementing fusion-enabled intermediate operators

In practice, usually there are some intermediate operators between a fuseable source and a fusion-enabled consumer. Unfortunately, this can break the fusion (and thus reverting to the classical RS) mode or worse, the data may skip the intermediate operator altogether, causing all sorts of failures.

The latter manifests itself when an operator forwards the Subscription it received via its onSubscribe method. Now imagine if map() does this; what would be the output of the following sequence:

range(0, 10).map(v -> v + 1).concatMap(v -> just(v)).subscribe(System.out::println);

In a classical flow, you'd get values 1 through 10 printed to the console. If both range() and concatMap() do fusion but map() forwards its Subscription, the surprising output is 0 through 9! This can affect any operator.

The solution is to require all operators that don't want to participate in fusion to never forward the upstream's subscriber verbatim. A possible manifestation of this rule is to implement Subscription on yourself:

static final class MapSubscriber<T, R> implements Subscriber<T>, Subscription {
    // ...

    @Override
    public void onSubscribe(Subscription s) {
        this.s = s;

        actual.onSubscribe(this);
    }

    @Override
    public void request(long n) {
        s.request(n);
    }

    @Override
    public void cancel() {
        s.cancel();
    }

    // ...
}

In practice, many operators that either manipulate requests or cancellation does this so the indirection is an acceptable trade-off for the benefit of a lower overhead dataflow in general.

This rule, unfortunately affects cross-library behavior. Even though other libraries may not speak the same fusion protocol, they could end up forwarding Subscriptions, thus if you go into and out of some other library, the same problem may appear again. Generally, libraries supposed to have a method hide() or asObservable() to hide the identity of a source as well as preventing the propagation of unwanted internal features.

Luckily, map() can participate in the fusion: it only has to be fuseable itself, mediate the requestFusion between its upstream and downstream, plus place itself at the exit point: poll().

static final class MapSubscriber<T, R> implements Subscriber<T>, QueueSubscription<R> {
    final Subscriber<? super R> actual;

    final Function<? super T, ? extends R> mapper;

    QueueSubscription<T> qs;

    Subscription s;

    int mode;

    // ...

    @Override
    public void onSubscribe(Subscription s) {
        this.s = s;
        if (s instanceof QueueSubscription) {
            qs = (QueueSubscription<T>)s;
        }

        actual.onSubscribe(this);
    }

    @Override
    public void onNext(T t) {
        if (mode == NONE) {

            // error handling omitted for brevity

            actual.onNext(mapper.apply(t));

        } else {
            actual.onNext(null);
        }
    }

    @Override
    public int requestFusion(int m) {
        if (qs == null || (m & THREAD_BOUNDARY) != 0) {
            return NONE;
        }
        int u = qs.requestFusion(m);
        mode = u;
        return u;
    }

    @Override
    public R poll() {
        T t = qs.poll();
        if (t == null) {
            return null;
        }
        return mapper.apply(t);
    }

    @Override
    public boolean isEmpty() {
        return qs.isEmpty();
    }

    @Override
    public void clear() {
        qs.clear();
    }
}

The operator map() can implement QueueSubscription itself and have a field for the potential upstream's QueueSubscription as well. In requestFusion, if the upstream does support fusion and the downstream isn't a boundary, the request is forwarded to upstream; rejected otherwise.

Now poll() can't just forward to the upstream because the types are different. Here comes the mapper function that is applied to the upstream's value. Note that null indicates termination or temporary lack of values and should not be mapped.

The main reason THREAD_BOUNDARY was introduced as a flag is due to map(), or in a more broader sense: the restriction on where user-supplied computations happen. In fusion mode, the execution of the mapper function happens on the exit side of the queue, which could be in some other thread. Now imagine you have a heavy computation in map which would run off the main thread before reaching an observeOn. When unfused, the result of the computation would be queued up in observeOn, then dequeued on the target thread (let's say the main thread). However, if fusion is allowed, the target thread is doing the poll() and now the heavy calculation runs on the main thread.

The operator filter() can be implemented in a similar fashion, but our old request(1) comes back unfortunately:

static final class FilterSubscriber<T> implements Subscriber<T>, QueueSubscription<T> {
    // ...

    @Override
    public T poll() {
        for (;;) {
            T v = qs.poll();

            if (v == null || cancelled) {
                return null;
            }

            if (predicate.test(v)) {
                return v;
            }

            if (mode == ASYNC) {
                qs.request(1);
            }
        }
    }

    @Override
    public boolean isEmpty() {
        return qs.isEmpty();
    }

    // ...
}

Since filter() drops values, we need to loop in poll() until the predicate matches or no more upstream values are available for some reason. If the predicate doesn't match, we have to replenish our ASYNC source (remember, you are not supposed to call request() in sync mode!).

Implementing fusion-enabled consumers

Generally, operator fusion is not very useful (or really happens) with end-subscribers, such as your favorite Subscriber subclass or with subscribe(System.out::println).

The consumers I'm talking about can be considered intermediate operators as well, but since all operators are basically custom Subscribers that are subscribed to the upstream, they are consumers as well.

As I mentioned, many operators feature some internal queue on their front side (e.g., concatMap, observeOn) or when they consume some inner Publisher (i.e., flatMap,zip). These are the primary consumers and drivers of the fusion lifecycle.

Now that we are familiar with how observeOn is implemented, let's see how can we enable fusion with it:

static final class ObserveOnSubscriber<T> implements Subscriber<T>, Subscription {

    Queue<T> queue;

    int mode;

    Subscription s;

    // ...

    @Override
    public void onSubscribe(Subscription s) {
         this.s = s;

         if (s instanceof QueueSubscription) {
             QueueSubscription<T> qs = (QueueSubscription<T>)s;

             int m = qs.requestFusion(QueueSubscription.ANY
                  | QueueSubscription.THREAD_BOUNDARY);

             if (m == QueueSubscription.SYNC) {
                 q = qs;
                 mode = m;
                 done = true;

                 actual.onSubscribe(this);

                 return;
             }

             if (m == QueueSubscription.ASYNC) {
                 q = qs;
                 mode = m;

                 actual.onSubscribe(this);

                 s.request(prefetch);

                 return;
             }       
         }

         queue = new SpscArrayQueue<>(prefetch);

         actual.onSubscribe(this);

         s.request(prefetch);
    }

    @Override
    public void onNext(T t) {
        if (mode == QueueSubscription.NONE) {
            queue.offer(t);
        }

        drain();
    }

    void drain() {

        // ...

            if (mode != QueueSubscription.SYNC) {

                request(p);
            }

        // ...

    }

    // ...

}

Enabling fusion has two implications: 1) queue can no longer be final but has to be created in onSubscribe, 2) onNext should not offer if fusion is enabled.

The fusion mode is requested in the onSubscribe after identifying the upstream as QueueSubscription. Since the algorithm inside drain() only sees the Queue interface and doesn't particularly care when values are available in the queue, we request the ANY mode from upstream in addition to indicating the consumer is also a THREAD_BOUNDARY. This should prevent the poll() side to change the location of some user-defined function unexpectedly.

If SYNC mode is granted, we assign the QueueSubscription to our queue and call onSubscribe on the downstream Subscriber. In this mode, the prefetch amount is not requested in accordance with the synchronous fusion protocol. The big win in SYNC mode is the fact that if poll() returns null, that is an indication of termination. We already exploit this in the standard queue-drain algorithm: if the done flag is set and the queue reports null/empty, we have completed. Note however, that we have to adjust the drain algorithm a bit because we can't call request in SYNC mode anymore.

If ASYNC mode is granted, we store the queue again, but can't set the done flag as we don't know when the upstream finishes - poll() returning null is just the indication of unavailability of values at the time. In addition, once the downstream Subscriber is notified, we still have to signal a prefetch-request to upstream, so it can trigger its own sources even further up.

Note that once requestFusion returns SYNC or ASYNC, there is no going back (you may try to call requestFusion() again which may change the mode, but that's undefined behavior at the moment; it may be forbidden entirely in the future), definitely not after elements have been delivered already in any mode.

General warnings around micro-fusion

In my experience, some of my colleagues tend to become enthusiastic about micro-fusion; they want to apply it everywhere. Whenever an operator has any queue, they see fusion happening.

I must warn against such relentlessness because fusion has some requirements, implications and generally subject to cost-benefit trade-offs:

If an operator is a thread boundary, my current understanding is that you can't fuse both its front and back side at the same time.
Fusion can shift computations in time and sometimes in location (even without an explicit boundary).
The fact an operator has a queue doesn't mean it can be exposed/replaced. A good example of this is combineLatest: my current understanding is that the post-processing of the queue elements makes this infeasible for back side fusion. Another example is flatMap where I'm not convinced the collector logic can be integrated into a poll()/isEmpty() back-side fusion.
Some sources, such as 0 or 1 are likely not worth it and are better off with macro-fusions.
Fusion is an extra behavior which also can be buggy or in fact, hide a bug on the regular path (i.e., groupBy) and requires extra care. In addition, it increases the test method count because now you have to test with and without fusion (see hide()).

To cheer you up, there is a great counter-example operator that supports full fusion: front and back side at the same time: flattenIterable, or as you may know it, concatMapIterable/flatMapIterable.

Conclusion

In this post, I've detailed the structures and protocols of operator fusion and shown some examples how it can be utilized in source, intermediate and terminal operators.

Since operator fusion is an active research area, I can't say these are all that can happen and we are eager to hear about interesting chains of operators where fusion can happen, or in contrast, were fusion should not happen. See the Rsc repository for examples of all kinds of fusions.

In addition, I hope these fusion protocols will be standardized and be part of Reactive-Streams 2.0, allowing a full, cross-library efficient operation that maintains fusion as long as possible.

My next topic will be to finish up the series about ConnectableObservables.

Introduction

If you are following events around Android development, or just happen to follow all things reactive, there was a "big" announcement from Google: they've released their reactive programming library targeting Android specifically: Agera. Of course, one has to look into the details to get an accurate picture.

"By Google" means a team in Google working on Google Play Movies. Certainly its sounds more amplified to say Google than the full path to the team. I happen to do this as well when someone asks where I work: in a lab at the Hungarian Academy of Sciences instead of at the Engineering and Management Intelligence Research Laboratory at the Institute for Computer Science and Control of the Hungarian Academy of Sciences. (Plus, you don't get tired and lost while I'm emitting these words :)

It doesn't really matter who released it, all that matters what they released and how it relates to the well established reactive libraries, RxJava, Reactor and Akka-Streams, altogether.

The Core API

The Agera library is built around the valueless Observer pattern: Observables take Updatables and signal change via update() calls. It is then the responsibility of those Updatables to figure out what changed. This is practically a zero argument reactive dataflow which relies on side-effects per update().

interface Updatable {
    void update();
}

interface Observable {
   void addUpdatable(Updatable u);
   void removeUpdatable(Updatable u);
}

They look innocent and reactive, right? Unfortunately, they've run into the issue with the original java.util.Observable and the other addListener/removeListener based reactive APIs (which I categorized as 0th generation).

Agera Observable

The problem with this pair of methods is that every Observable who adds behavior over an incoming Updatable has to remember the original Updatable in some whay for the case when the same Updatable is removed:

public final class DoOnUpdate implements Observable {
    final Observable source;

    final Runnable action;

    final ConcurrentHashMap<Updatable, DoOnUpdatable> map;

    public DoOnUpdate(Observable source, Runnable action) {
         this.source = source;
         this.action = action;
         this.map = new ConcurrentHashMap<>();
    }

    @Override
    public void addUpdatable(Updatable u) {
        DoOnUpdatable wrapper = new DoOnUpdatable(u, action);
        if (map.putIfAbsent(u, wrapper) != null) {
            throw new IllegalStateException("Updatable already registered");
        }
        source.addUpdatable(wrapper);
    }

    public void removeUpdatable(Updatable u) {
        DoOnUpdatable wrapper = map.remove(u);
        if (wrapper == null) {
            throw new IllegalStateException("Updatable already removed");
        }
        source.removeUpdatable(wrapper);
    }

    static final class DoOnUpdatable {
        final Updatable actual;

        final Runnable run;

        public DoOnUpdatable(Updatable actual, Runnable run) {
            this.actual = actual;
            this.run = run;
        }

        @Override
        public void update() {
            run.run();
            actual.update();
        }
    }
}

This causes a contention point between independent downstream Updatables at every stage of a pipeline.

True, a similar contention point can be found with RxJava's Subjects and ConnectableObservables, but chained operators after them don't have the contentions. Unfortunately, the Reactive-Streams spec, in its current version, mandates something similar from Publishers. Now RxJava 2.x, Rsc and Reactor completely ignored this, turning out to be over-restrictive in practice, and we are pushing back instead to lighten the spec.

The second problem, although minor, is that you can't add the same Updatable multiple times. First because you can't distinguish between the different "subscriptions" via Map and second the spec mandates throwing an exception. Usually, this rarely happens because most end-consumers are solo.

The third problem is a bigger issue: throwing when the Updatable is no longer registered with the Observable. This creates an unfortunate race condition between end-consumers triggering removal while some intermediate operator such as take also triggers it; one of them will get an exception. This is why modern reactive libraries have idempotent cancellation.

The fourth problem is that in theory, addUpdatable and removeUpdatable can race with each other: some downstream operator would want to disconnect before an upstream operator has actually called addUpdatable. A possible outcome is that the removeUpdate chain throws yet addUpdatable succeeds, causing the signals to flow anyway and causing an unwanted retention of all associated objects.

Agera Updatable

Let's see the API from the consumer's perspective. Updatable is a single method functional interface which makes it easy to attach a listener to an Observable:

Observable source = ...

source.addUpdatable(() -> System.out.println("Something happened"));

Simple enough, now let's remove our listener:

source.removeUpdatable(() -> System.out.println("Something happened"));

Which yields a nice Exception: the two lambdas are not the same instance/reference. This is a very common problem with addListener/removeListener based APIs. The solution is to store the lambda in a reference and use that when needed:

Updatable u = () -> System.out.println("Something happened");

source.addUpdatable(u);

// ...

source.removeUpdatable(u);

A small inconvenience indeed, but it gets worse. What if you have many Observables and many Updatables? You have to remember who is registered with who, and keep references to them in some fields. One of the great ideas of the original Rx.NET design was to reduce this necessity to a single reference:

interface Removable extends Closeable {
    @Override
    void close(); // remove the necessity of try-catch around close()
}

public static Removable registerWith(Observable source, Updatable consumer) {
    source.addUpdatable(consumer);
    return () -> source.removeUpdatable(consumer);
}

Of course, we have to consider idempotence of calling close() here as well:

public static Removable registerWith(Observable source, Updatable consumer) {
    source.addUpdatable(consumer);
    final AtomicBoolean once = new AtomicBoolean();
    return () -> {
        if (once.compareAndSet(false, true)) {
            source.removeUpdatable(consumer);
        }
    });
}

Agera MutableRepository

The Agera MutableRepository holds a value and signals update() to registered Updatables if the value changes. This somewhat resembles to the BehaviorSubject we have, with the distinction that the new value doesn't flow to the consumers (remember, update() has no arguments) but has to be get() from the repository:

MutableRepository repo = Repositories.mutableRepository(0);

repo.addUpdatable(() -> System.out.println("Value: " + repo.get());

new Thread(() -> {
    repo.accept(1);
}).start();

When created via the factory method, it has the interesting property that the observation of the update() happens on the Looper where the repository has been created. (Looper is like a per-thread trampoline scheduler/Executor that let's one execute code on a specific thread, such as the Android main thread).

This out-of-band property creates an interesting case:

Set<Integer> set = new HashSet<>();

MutableRepository repo = Repositories.mutableRepository(0);

repo.addUpdatable(() -> set.add(repo.get()));

new Thread(() -> {
    for (int i = 0; i < 100_000; i++) {
        repo.accept(i);
    }
}).start();

Thread.sleep(20_000);

System.out.println(set.size());

Assuming 20 seconds is enough, what is the final size of the Set? One would expect it contains all 100.000 integers. In reality, the value could be anywhere between 1 and 100.000! The reason for this is because the accept() and get() run concurrently and if the consumer is slower, accept() simply overwrites the current value in the repository.

In some cases, this may be acceptable (i.e., similar to when onBackpressureDrop is applied in RxJava), sometimes its not and you may end up spending a lot of time hunting for lost values.

Error handling

Being asynchronous usually means you have asynchronous errors. RxJava and the others compose nicely in this regard: somebody errors out, the whole processing graph is cleaned up automatically unless the programmer wishes otherwise by suppressing, replacing or retrying the flow. The error and cleanup can be very complicated in some cases, but we library developers put in a lot of effort so you don't have to worry about it most of the time.

The Agera base API doesn't handle error by itself, you have to do it out-of-band just like with values. If you have multiple services composed via Agera, you have to establish the same error-management "framework" similar to how you'd have to do it in callback-hell situations. Very cumbersome and error-prone by itself due to concurrency and terminal state considerations.

Termination

Again, Agera doesn't have a notion for a completed stream - you have to figure out when that happens on your own. This might not be an issue in GUI cases where your consumer starts with your activity and ends with it as well and signals are delivered continuously. However, asynchronous background Observables now have to somehow tell or specify how many signals they will emit and how will you know the update() signal didn't happen because there is no data available.

How to design a modern zero-parameter reactive API

First of all, perhaps you shouldn't bother with one and just use an existing library for this:

rx.Observable<Void> signaller = ...

rx.Observer<Void> consumer = ...

Subscription s = signaller.subscribe(consumer);

// ...

s.unsubscribe();

You get all the infrastructure, operators and performance from them at basically no additional cost. Better yet, if you generally want to deal with signals of values, you can use the appropriate type instead of Void.

If an existing library feels to cumbersome to learn due to a lot of operators, you can perhaps fork it, delete the unnecessary stuff and use that. Of course, now you have to keep up with bugfixes and performance enhancements.

If forking and pruning doesn't sound attractive, you can develop your own library on top of the Reactive-Streams specification; Publisher<Void>, Subscriber<Void> and all the things between them you need. You get practically free interop with other Reactive-Streams libraries and consumers, plus, you can test your solution via its Test Compatibility Kit (TCK).

Of course, writing a reactive library is hard, writing a reactive library over Reactive-Streams is even harder. As a final resort, you may decide to write a barebone API from scratch.

If you really want to do a zero-argument reactive flow, here are a few tips you should consider:

1) Don't have separate addListener and removeListener. A single entry point simplifies the development of intermediate operators:

interface Observable {
    Removable register(Updatable u);
}

interface Removable {
    void remove();
}

2) Consider injecting the cancellation/remove support instead of returning a cancellation token or remover action:

interface Observable {
    void register(Updatable u);
}

interface Updatable {
    void onRegister(Removable remover);
    void update();
}

// or

interface Updatable {
    void update(Removable remover);
}

3) Consider adding error signal delivery at least:

Certainly, this complicates the lives of the library writers but can save a lot of on the side of your library's users.

interface Updatable {
    void onRegister(Removable remover);
    void update();
    void error(Throwable ex);
}

4) Consider offer asynchronous boundary as an option in the sequence.

I.e., with the MutableRepository example, you may want to react to the new value on the caller's thread before moving back to the main thread. This means observeOn and perhaps subscribeOn if you intend to have cold sources.

Conclusion

Writing a reactive library is not an easy task and one can fall into a lot of mistakes if one is not familiar with the history and evolution of field. In many companies, the "not invented here" or "we can do better" is so strong they rather start from scratch than learn/build upon somebody else's working solution.

(Funny thing, I sometimes offer RxJava for an in-house project and I'm still getting raised eyebrows, even though it practically being "developed here", mostly.)

You may ask, why do I care what Google/Agera does? Aren't I confident in RxJava? Of course I am confident and Agera's existence doesn't really strike me.

However, my experience shows, if you have big name banner over your head, unchallenged self-confidence and sub-par outcome may be forced upon an entire community. I don't really want to give out ideas here but imagine the next Android version would mandate Agera, in its current form, to be the standard for asynchrous programming!

(In addition, interop is inevitable at some point and I really don't want to get complaints on the main RxJava issue list if they down work together properly.)

Let me finish with a wisdom I came up with (as there are now 2 cases to back it up):

You want to write a reactive library? Please don't (just yet)!

Introduction

Backpressure is essential if one wants to avoid buffer bloat and excessive memory usage if two stages in a reactive pipeline consume events with different speed. RxJava and Reactive-Streams developed a non-blocking, request-coordinating protocol to solve this problem, but you may have heard there are alternatives to it. One alternative that comes up from time to time is Async Iterables (Java terminology) or Async Enumerables (C# terminology).

In fact Rx.NET has an Ix.NET (stands for Interactive Extensions) sub-project in which there is the Async Enumerables library. It solves this backpressure problem by having a Task (~ CompletableFuture, ~ Promise) returned from its MoveNext() (~ hasNext()) method and when that Task fires, you can consume the Current property (~ next() method). The backpressure behavior comes from the fact that you'd call MoveNext() again only after you processed the the current element.

Unfortunately, I haven't found a Java implementation for the IAsyncEnumerable (haven't really looked beyond a few Google searches), so I decided I'll implement it on my own in Java 8, see what it takes to get data across with it and how performant is it compared to my current cutting-edge understanding of reactive-flows: the Reactive-Streams-Commons library.

Base API

Since Async Enumerables are designed in deferred execution in mind, the base API consists of two interfaces:

interface IAsyncEnumerable<T> {
    IAsyncEnumerator<T> enumerator();
}

interface IAsyncEnumerator<T> {

    CompletionStage<Boolean> moveNext(CompositeSubscription cancel);

    T current();
}

The IAsyncEnumerable is the equivalent of Iterable and it hands out IAsyncEnumerators. IAsyncEnumerator has a moveNext method which returns a CompletionStage indicating if there is value available via current() (signals true) or the sequence ended (signals false). C# CancellationToken looks like our CompositeSubscription so I'm reusing it as the way for cancellation.

(Sidenote: I'm not sure how cancellation composes yet, the original Ix.NET IAsyncEnumerator is an IDisposable plus their Task can also be disposed, unlike CompletionStage. Luckily, I don't need this feature too extensively in this post.)

Consuming an IAsyncEnumerable

Consuming such IAsyncEnumerator is straightforward, although not as easy without C# async/await. If we are only interested in exactly one value, we can write:

IAsyncEnumerable<T> source = ...

IAsyncEnumerator<T> enumerator = source.enumerator();

enumerator.moveNext(new CompositeSubscription())
.whenComplete((b, e) -> {
    if (e != null) {
        e.printStackTrace();
    } else
    if (b) {
        System.out.println(enumerator.current());
    } else {
        System.out.println("Empty!");
    }
});

Of course, given the CompletionStage API, you are free to process the single result as you see fit.

Consuming more than one value from an IAsyncEnumerator is more involved. You have to recursively call moveNext until it errors or completes:

public void consumeAll(IAsyncEnumerator<T> enumerator, CompositeSubscription csub) {
    if (csub == null) {
        csub = new CompositeSubscription();
    }

    CompositeSubscription fcsub = csub;

    enumerator.moveNext(new CompositeSubscription())
    .whenComplete((b, e) -> {
        if (e != null) {
            e.printStackTrace();
        } else
        if (b) {
            System.out.println(enumerator.current());

            // go recursive            
            consumeAll(enumerator, fcsub);

        } else {
            System.out.println("Empty!");
        }
    });    
}

Unfortunately, there is a slight problem: if the CompletionStage is a synchronous stage, you may end up with StackOverflowError because of the recursive call to consumeAll. Therefore, to be sure, we have to trampoline the call to consumeAll to ensure the stack dept doesn't grow too large:

public final class AsyncConsumer<T> implements Subscription {

    final Consumer<? super T> onNext;

    final Consumer<Throwable> onError;

    final Runnable onComplete;

    final IAsyncEnumerator<T> enumerator;

    final AtomicInteger wip;

    final Queue<CompletionStage<Boolean>> queue;

    final CompositeSubscription csub;

    final CountDownLatch cdl;

    public AsyncConsumer(
         IAsyncEnumerator<T> enumerator,
         Consumer<? super T> onNext,
         Consumer<Throwable> onError,
         Runnable onComplete
    ) {
         this.enumerator = enumerator;
         this.onNext = onNext;
         this.onError = onError;
         this.onComplete = onComplete;
         this.wip = new AtomicInteger();
         this.queue = new SpscLinkedArrayQueue<>(16);
         this.csub = new CompositeSubscription();
         this.cdl = new CountDownLatch();
    }

    public void consumeAll() {
         if (csub.isUnsubscribed()) {
             cdl.countDown();
             return;
         }
         CompletionStage<T> stage = enumerator.moveNext(csub);
         queue.offer(stage);
         if (wip.getAndIncrement() == 0) {
             do {
                 stage = queue.poll();
                 stage.whenComplete((b, e) -> {
                     if (csub.isUnsubscribed()) {
                         cdl.countDown();
                         return;
                     } else
                     if (e != null) {
                         onError.accept(e);
                         cdl.countDown();
                     } else
                     if (b) {
                         onNext.accept(enumerator.current());
                         consumeAll();
                     } else {
                         onComplete.run();
                         cdl.countDown();
                     }
                 });
             } while (wip.decrementAndGet() != 0);
         }
    }

    public void await() throws InterruptedException {
        cdl.await();
    }

    public boolean await(long timeout, TimeUnit unit) throws InterruptedException {
        return cdl.await(timeout, unit);
    }

    @Override
    public void unsubscribe() {
         csub.unsubscribe();
    }

    @Override
    public boolean isUnsubscribed() {
         return csub.isUnsubscribed();
    }
}

Looks intriguing. Apart from having the callbacks for the signal types and the enumerator instance, we need the work-in-progress wip counter and a queue just like with our typical queue-drain approach back in reactive-streams land. For cancellation, we have the CompositeSubscription and for blocking waits, we have a CountDownLatch. The consumeAll() method moves the enumerator one element forward and the trampoline loop makes sure there is only one whenComplete() active at a time. Inside the handler, we call the appropriate functional interface and in case of a value signal, we call the consumeAll recursively. The trampolining will make sure we don't get reentrant behavior whether or not consumeAll is called synchronously or asynchronously.

Writing an IAsyncEnumerable source

Of course, we need some data source to work with. Perhaps the most basic and most standard source is the range() operator in all reactive libraries - range is the counted for-loop of the reactive world.

public final class AsyncRange implements IAsyncEnumerable<Integer> {
    final int start;
    final int count;

    @Override
    public IAsyncEnumerator<Integer> enumerator() {
        return new AsyncRangeEnumerator(start, count);
    }

    static final class AsyncRangeEnumerator implements IAsyncEnumerator<Integer> {
        final long end;
        long index;

        static final CompletionStage<Boolean> TRUE = 
            CompletableFuture.completedFuture(true);

        static final CompletionStage<Boolean> FALSE = 
            CompletableFuture.completedFuture(false);

        public AsyncRangeEnumerator(int start, int count) {
            this.index = start - 1;
            this.end = (long)start + count;
        }

        @Override
        public CompletionStage<Boolean> moveNext(CompositeSubscription csub) {
            long i = index + 1;
            if (i == end) {
                return FALSE;
            }
            index = i;
            return TRUE;
        }

        @Override
        public Integer current() {
            return index;
        }
    }
}

The operation itself is pretty synchronous. The way I understand, CompletableFuture is like an AsyncSubject and because we only return a constant true or false stage, we can use a shared and completed instance (which should be stateless and non-interfering). Because moveNext() is called before current() we start the index from start - 1 and increment it by one in the moveNext(). If it is equal to the end, we return the false stage. If it hasn't reached end yet, we update the index field and return the true stage.

Going asynchronous

Of course, we are here for the asynchronous possibility, therefore, let's write an observeOn and subscribeOn operators. The first one makes sure default continuations on the CompletionStage<Boolean> happens on a specific thread whereas the second will make sure the actual call to moveNext() happens on the specific thread (so you can do blocking IO in moveNext() or in enumerator()). Plus, we are proficient in writing these operators, aren't we?

observeOn

If you have some unfamiliar operator to implement, the best advice I got from Erik Meijer's Channel 9 videos is: follow the types. We know we have an upstream source and some source of asynchrony. For simplicity, let's use Executor since CompletionStageXXXAsync methods take them verbatim.

public final class AsyncObserveOn<T> implements IAsyncEnumerable<T> {
    final IAsyncEnumerable<T> source;

    final Executor executor;

    public AsyncObserveOn(IAsyncEnumerable<T> source, Executor executor) {
        this.source = source;
        this.executor = executor;
    }

    @Override
    public IAsyncEnumerator<T> enumerator() {
        return new AsyncObserveOnEnumerator<>(source.enumerator(), executor);
    }

    // ... 
}

A very familiar pattern so far. The real work, hovewer, happens inside AsyncObserveOnEnumerator:

    static final class AsyncObserveOnEnumerator<T> implements IAsyncEnumerator<T> {

        final IAsyncEnumerator<T> enumerator;

        final Executor executor;

        public AsyncObserveOnEnumerator(IAsyncEnumerator<T> enumerator, Executor executor) {
            this.enumerator = enumerator;
            this.executor = executor;
        }

        @Override
        public CompletionStage<Boolean> moveNext(CompositeSubscription csub) {
            return enumerator.moveNext(csub).thenApplyAsync(v -> v, executor);
        }

        @Override
        public T current() {
            return enumerator.current();
        }
    }

I admit I'm not too familiar with CompletionStage so it appeared to me thenApplyAsync is the closest thing to have the value delivery moved to a specific executor. Otherwise, this looks like quite straightforward and much shorter than our Rx-style observeOn().

subscribeOn

Since we don't know IAsyncEnumerable won't have synchronous action in its moveNext() method, we need a way, via subscribeOn, to make sure moveNext() is called on some other thread:

public final class AsyncSubscribeOn<T> implements IAsyncEnumerable<T> {
    final IAsyncEnumerable<T> source;

    final Executor executor;

    public AsyncSubscribeOn(IAsyncEnumerable<T> source, Executor executor) {
        this.source = source;
        this.executor = executor;
    }

    @Override
    public IAsyncEnumerator<T> enumerator() {
        AxSubscribeOnEnumerator enumerator = new AxSubscribeOnEnumerator<>(executor);
        executor.execute(() -> {
            IAsyncEnumerator<T> ae = source.enumerator();
            enumerator.setEnumerator(ae);
        });
        return enumerator;
    }

    // ... 
}

Instead of directly calling source.enumerator() we offload it to the executor, which when executes will do the call with a valid IAsyncEnumerator - which is now deferred from the consumer's perspective - but still have to return ans IAsyncEnumerator ourselves. The difficulty is now how to allow calling moveNext() when we don't have the upstream's enumerator yet. Luckily, CompletionStage will come to our rescue:

    static final class AxSubscribeOnEnumerator<T> implements IAsyncEnumerator<T> {

        final Executor executor;

        final CompletableFuture<IAsyncEnumerator<T>> onEnumerator;

        public AxSubscribeOnEnumerator(Executor executor) {
            this.executor = executor;
            this.onEnumerator = new CompletableFuture<>();
        }

        void setEnumerator(IAsyncEnumerator<T> enumerator) {
            onEnumerator.complete(enumerator); 
        }

        @Override
        public CompletionStage<Boolean> moveNext(CompositeSubscription token) {
            return onEnumerator.thenComposeAsync(ae -> ae.moveNext(token), executor);
        }

        @Override
        public T current() {
            IAsyncEnumerator<T> ae = onEnumerator.getNow(null);
            return ae != null ? ae.current() : null;
        }

    }

We setup an onEnumeratorCompletableFuture which will be completed via setEnumerator upon receiving the actual upstream IAsyncEnumerator. The big trick is how we use composition over this deferred onEnumerator value to call moveNext() on the upstream's enumerator once available on the executor. The operator thenComposeAsync is basically flatMap. The method current() needs some extra logic, we try to get the upstream's enumerator and if it's not yet available, we simply return null - shouldn't call current() without the corresponding CompletionStage firing anyway.

Benchmark

Now that we have the three most basic operator's available, let's benchmark our IAsyncEnumerable implementations against the cutting edge equivalent in Reactive-Streams-Commons. For the source code, please refer to the benchmark implementation in my repository. For convenience, I've implemented the operators above in a fluent way where the base type is Ax - Async Extensions.

Results of the throughput benchmark (bigger is better): (i7 4790, Windows 7 x64, Java 8u92)

The benchmark range is just the basic range(1, count), the rangeAsync is a range(1, count).observeOn(executor) and the rangePipeline is range(1, count).subscribeOn(executor1).observeOn(executor2). In the columns, ax is my implementation of IAsyncEnumerable, px is the Reactive-Streams-Commons (Rsc) Publisher Extensions fluent API entry point. Since Rsc uses operator fusion, rangeAsync() is run with and without operator fusion enabled (the others don't fuse in Rsc), the latter is in the pxf column.

Evaluation

Looks like Rsc outperforms the IAsyncEnumerable implementation considerably, both in synchronous and asynchronous use. Without an independent library to compare against, I can only speculate why IAsyncEnumerable has so much overhead. Naturally, my limited experience with CompletionStage could explain some of it, but I doubt that's the main reason. Since both libraries use the same single-threaded Executor in the benchmark, we can rule out the executor overhead itself.

What remains is the architectural and conceptual differences:

We have possibly one allocation of the CompletionStage plus a known continuation stage per value - Rsc doesn't allocate anything
CompletionStage is actually between hot and cold and acts like an AsyncSubject, when one attaches the continuation to it, it could be still running or already completed - determining this and acting accordingly adds overhead - Rsc calls onNext as directly as possible
The longer the pipeline the more temporary CompletionStages get involved, which means allocation and individual task scheduling - Rsc exploits the emergent batching property of the streams over an async boundary.

Conclusion

I believe what we have here as IAsyncEnumerable is a corner case of the reactive-flow approach when one has basically request(1) at each stage plus some allocation overhead, making the approach more overhead than the highly optimized flow approach.

It certainly looks simpler and implements shorter operators, but I have to ask, what's the benefit over the Reactive-Streams approach?

If somebody have some tips for optimizing our IAsyncEnumerable implementation or can suggest me an independent implementation, I'd be glad to benchmark and compare it and re-evaluate my position on the topic!

Introduction

In the past year, I've been posting benchmark results under the mysterious Shakespeare Plays (Reactive) Scrabble name. In this blog post, I'll explain what this benchmark is, where does it come from, how it works, what the intent is and how to apply it to your favorite and not-yet-benchmarked library.

History

The benchmark was designed and developed by Jose Paumard and results presented in his 2015 Devoxx talk (a bit long but worth watching). The benchmark measures how fast a certain data-processing library can find the most valuable word from a set of words taken from (one of) Shakespeare's work based on the rules and point schema of Scrabble. RxJava at the time was in its 1.0.x version and to my surprise, it performed poorly compared to Java 8 Streams:

https://youtu.be/fabN6HNZ2qY?t=8369

The benchmark, utilizing JMH, is completely synchronous; no thread hopping happens yet RxJava performs 10x slower, or more likely, it has 10x more overhead in the associated set of operators. In addition, Jose also added a parallel-stream version which runs the main "loop" in parallel before joining for the final result.

More disappointingly, RxJava 2 developer preview the time was terrible as well (relatively, measured on a weak CPU in February).

Therefore, instead of blaming the benchmark or the author, I set out on a quest to understand the benchmark's expectations and improve RxJava 2's performance and if possible, port that back to RxJava 1.

The original Stream-benchmark

Perhaps the most easy way to understand how the computation in the benchmark works, Let's see the original, non-parallel Stream version of it. Since going sequential or parallel requires only a sequential() or parallel() operator on a Stream, they both extend an abstract superclass containing the majority of the code and only get specialized for the operation mode in two additional classes.

ShakespearePlaysScrabbleWithStreamBeta.java

I added postfix "Beta" - meaning alternate version in this context - to distinguish between a version that has a slight difference in one of the computation steps. I'll explain why when I describe the original RxJava-benchmark down below.

The benchmark is built in a somewhat unconventional, perhaps over-functionalized manner and I had a bit of trouble putting the functionality back together in my head. It isn't that complicated though.

The inputs to the benchmark are hidden in a base class' fields shakespeareWords (HashSet<String>), scrabbleWords (HashSet<String>), letterScore (int[]) and scrabbleAvailableLetters (int[]). shakespeareWords contains all the words, lowercased, of Shakespeare's work. scrabbleWords contains the allowed words, lowercased, by Scrabble itself. letterscore contains the scores of the scores of letters through a-z and scrabbleAvailableLetters (seems to me) is there to limit the score if the particular letter appears multiple times in a word.

The benchmark, due to dependencies of one step on the other, is written in "backwards" order, starting with a function that finds the score of a letter. Given an English letter with code 96-121, the function maps it to the 0-25 range and gets the score from the array.

IntUnaryOperator scoreOfALetter = letter -> letterScores[letter - 'a'];

The next function, given a histogram of letters in a word in the form of a Map.Entry (where the key is the letter and the value is the number of occurrence in the word), calculates a bounded score of that letter in the word.

ToIntFunction<Entry<Integer, Long>> letterScore =
    entry ->
        letterScores[entry.getKey() - 'a'] *
        Integer.min(
            entry.getValue().intValue(),
            scrabbleAvailableLetters[entry.getKey() - 'a']
        );

For that, we need the actual histogram of words which is computed by the following function:

Function<String, Map<Integer, Long>> histoOfLetters =
    word -> word.chars()
                .boxed()
                .collect(
                    Collectors.groupingBy(
                        Function.identity(),
                        Collectors.counting()
                    )
                );

This is where a particular dataflow library comes into play. Given a word as Java String, split it into individual characters and count how many of each character is in that word. For example, "jezebel" will count 1-j, 3-e, 1-z, 1-b and 1-l. In the Stream version, the IntStream of characters provided by String itself is converted into a boxed Stream<Integer> and grouped into a Map by a counting standard collector with no key mapping. Note that the return type of the function is Map and not Stream<Map>.

The next function calculates the blank score of a character occurrence:

ToLongFunction<Entry<Integer, Long>> blank =
    entry ->
        Long.max(
            0L,
            entry.getValue() -
                scrabbleAvailableLetters[entry.getKey() - 'a']
        );

Given an entry from the histogram above, it gives bonus points if the particular letter occurs more than its score in the scrabbleAvailableLetters array. For example, if the letter 'd' appears twice, the scrabbleAvailableLetters for it is 1 and this computes to 1. If the letter 'e' appears twice, the array entry for it is 12 and the function computes 0.

The next function combines the histoOfLetters with the blank function to compute the number of blanks in an entire word:

Function<String, Long> nBlanks =
    word -> histoOfLetters.apply(word)
                          .entrySet().stream()
                          .mapToLong(blank)
                          .sum();

Here the histogram of the letters in the given word is computed and returned in a Map, then each entry of this Map is streamed, mapped into the blank letter value and then summed up into a final value. (Honestly, I'm not familiar with the rules of Scrabble and this last two functions seem to be extra convolution to have the computation work harder.)
The follow-up function takes the result of nBlanks and checks if a word can be written with 2 or less blanks:

Predicate<String> checkBlanks = word -> nBlanks.apply(word) <= 2;

The next 2 functions pick the first 3 and last 3 letters of a word:

Function<String, IntStream> first3 = word -> word.chars().limit(3);

Function<String, IntStream> last3 =

    word -> word.chars().skip(Integer.max(0, word.length() - 4));

These won't stay separated and are immediately combined back together:

Function<String, IntStream> toBeMaxed =
    word -> Stream.of(first3.apply(word), last3.apply(word))
                  .flatMapToInt(Function.identity());

Practically, the first 3 and last 3 letters (with possibly overlap for shorter words) are concatenated back into a single IntStream via flatMapToInt, i.e., "jezebel" will stream letter-by-letter as "jezbel".

Given the merged character stream, we compute the maximum score of the letters:

ToIntFunction<String> bonusForDoubleLetter =
     word -> toBeMaxed.apply(word)
                      .map(scoreOfALetter)
                      .max()
                      .orElse(0);

Note that IntStream.max() returns Optional.

We then calculate the final score of a word:

Function<String, Integer> score3 =
    word ->
        2 * (score2.apply(word) + bonusForDoubleLetter.applyAsInt(word))
        + (word.length() == 7 ? 50 : 0);

This involves a bonus 50 points for words with length 7 and twice the score of the base word and the double letter bonus. Note that both score2 and bonusForDoubleLetter are evaluated once and multiplied by the literal two.

Now we reach the actual "loop" for calculating the scores of each word in the Shakespeare word set:

Function<Function<String, Integer>, Map<Integer, List<String>>> buildHistoOnScore =
    score -> shakespeareWords.stream()
             .filter(scrabbleWords::contains)
             .filter(checkBlanks)
             .collect(
                 Collectors.groupingBy(
                     score,
                     () -> new TreeMap<Integer, List<String>>(Comparator.reverseOrder()),
                     Collectors.toList()
                 )
             );

This is an odd function because it takes another function, the score function as input, returns a Map keyed by a score and a list of words that have that score. The body takes the set of shakespeareWords, streams it (the parallel version has parallelStream()) here, filters out those that are in the allowed Scrabble words set, filters out those that have less than two blank score, then groups the "remaining" words based on their computed score into a reverse-ordered TreeMap with Integer key and List elements - all with the help of standard Stream Collectors.

Finally, we are not interested in all words but only the top 3 scoring set of words:

List<Entry<Integer, List<String>>> finalList =
                buildHistoOnScore.apply(score3)
                    .entrySet()
                    .stream()
                    .limit(3)
                    .collect(Collectors.toList()) ;

We apply the score3 function to the "loop" and put the top 3 entries into the final list. If all went well, we should get the following entries:

120 = jezebel, quickly
118 = zephyrs
116 = equinox

The original RxJava benchmark

Given the fact that RxJava and Java Streams have quite similar APIs and equivalent operators, the original RxJava benchmark was written with an odd set of helper components and changes to the pattern of the functions above.

The first oddity is the introduction of a functional style of counting: LongWrapper

interface LongWrapper {
    long get();

    default LongWrapper incAndSet() {
        return () -> get() + 1;
    }
}

LongWrapper zero = () -> 0;
LongWrapper one = zero.incAndSet();

(This over-functionalization is a recurring theme with Jose, see this year's JavaOne video for example.)

The second oddity that many return types that were scalar in the Stream version above were turned into Observables - which adds unnecessary overhead and no operational benefit. So let's see how the original benchmark looks like with RxJava 1:

First, the letterScore now returns a single element Observable with the score value:

Func1<Integer, Observable<Integer>> scoreOfALetter =

    letter -> Observable.just(letterScores[letter - 'a']) ;

Func1<Entry<Integer, LongWrapper>, Observable<Integer>> letterScore =
    entry ->
        Observable.just(
            letterScores[entry.getKey() - 'a'] *
            Integer.min(
                (int)entry.getValue().get(),
                scrabbleAvailableLetters[entry.getKey() - 'a']
            )
        ) ;

This has cascading effects as depending functions now have to deal with an Observable. RxJava doesn't have the direct means to stream the characters of a String so a helper indirection was introduced by reusing Stream's tools and turning that into an Iterable RxJava can understand:

Func1<String, Observable<Integer>> toIntegerObservable =
    string -> Observable.from(
        IterableSpliterator.of(string.chars().boxed().spliterator())) ;

Building the histogram now uses the LongWrapper and RxJava's collect() operator to build the Map with it:

Func1<String, Observable<HashMap<Integer, LongWrapper>>> histoOfLetters =
    word -> toIntegerObservable.call(word)
            .collect(
                () -> new HashMap<>(),
                (HashMap<Integer, LongWrapper> map, Integer value) -> {
                    LongWrapper newValue = map.get(value) ;
                    if (newValue == null) {
                        newValue = () -> 0L ;
                    }
                    map.put(value, newValue.incAndSet()) ;
                }
             ) ;

Calculating blanks also return Observable instead of a scalar value:

Func1<Entry<Integer, LongWrapper>, Observable> blank =
    entry ->
        Observable.just(
            Long.max(
                0L,
                entry.getValue().get() -
                    scrabbleAvailableLetters[entry.getKey() - 'a']
            )
        ) ;

Func1<String, Observable<Long>> nBlanks =
    word -> histoOfLetters.call(word)
            .flatMap(map -> Observable.from(() -> map.entrySet().iterator()))
            .flatMap(blank)
            .reduce(Long::sum) ;

Func1<String, Observable<Boolean>> checkBlanks =
     word -> nBlanks.call(word)
                    .flatMap(l -> Observable.just(l <= 2L)) ;

Now calculating the scores:

Func1<String, Observable<Integer>> score2 =
     word -> histoOfLetters.call(word)
             .flatMap(map -> Observable.from(() -> map.entrySet().iterator()))
             .flatMap(letterScore)
             .reduce(Integer::sum) ;

Func1<String, Observable<Integer>> first3 =
     word -> Observable.from(
              IterableSpliterator.of(word.chars().boxed().limit(3).spliterator())) ;


Func1<String, Observable<Integer>> last3 =
    word -> Observable.from(
                IterableSpliterator.of(word.chars().boxed().skip(3).spliterator())) ;


Func1<String, Observable<Integer>> toBeMaxed =
    word -> Observable.just(first3.call(word), last3.call(word))
                .flatMap(observable -> observable) ;

Func1<String, Observable<Integer>> bonusForDoubleLetter =
    word -> toBeMaxed.call(word)
            .flatMap(scoreOfALetter)
            .reduce(Integer::max) ;

(Note that last3 returns the letters 4..n instead of the last 3 letters, not sure if this was intentional or not. Changing it to really return the last 3 letters has no measurable performance difference.)

Then we compute the final score per word:

Func1<String, Observable<Integer>> score3 =
    word ->
        Observable.just(
            score2.call(word),
            score2.call(word),
            bonusForDoubleLetter.call(word),
            bonusForDoubleLetter.call(word),
            Observable.just(word.length() == 7 ? 50 : 0)
        )
        .flatMap(observable -> observable)
        .reduce(Integer::sum) ;

Remember the "times 2" from the Stream benchmark, here both scores are streamed again instead of multiplying their result by 2 (via map). This inconsistency with the original Stream alone is responsible of ~30% overhead with the original RxJava 1 benchmark. For comparison, when the same double-streaming is applied to the original Stream benchmark, its measured sample time goes from 27 ms/op up to 39 ms/op.

Lastly, the processing of the entire Shakespeare word set and picking the top 3:

Func1<Func1<String, Observable<Integer>>, Observable<TreeMap<Integer, List<String>>>>
buildHistoOnScore =
     score -> Observable.from(() -> shakespeareWords.iterator())
              .filter(scrabbleWords::contains)
              .filter(word -> checkBlanks.call(word).toBlocking().first())
              .collect(
                  () -> new TreeMap<Integer, List<String>>(Comparator.reverseOrder()),
                  (TreeMap<Integer, List<String>> map, String word) -> {
                      Integer key = score.call(word).toBlocking().first() ;
                      List<String> list = map.get(key) ;
                      if (list == null) {
                          list = new ArrayList<>() ;
                          map.put(key, list) ;
                      }
                      list.add(word) ;
                  }
               ) ;

List<Entry<Integer, List<String>>> finalList2 =
    buildHistoOnScore.call(score3)
    .flatMap(map -> Observable.from(() -> map.entrySet().iterator()))
    .take(3)
    .collect(
        () -> new ArrayList<Entry<Integer, List<String>>>(),
        (list, entry) -> {
            list.add(entry) ;
        }
    )
    .toBlocking()
    .first() ;

Here, we need to go blocking to get the first (and) only value for the checkBlanks as well as getting the only List of results of the final list of top 3 scores and words. The collector function taking the TreeMap and the current entry had to be explicitly typed because for some reason Eclipse can't properly infer the types in that expression.

The optimized version

The mistakes and drawbacks of the original RxJava version has been identified over a long period of time and the optimized benchmark is still "under optimization". Using the right operator for the right job is essential in synchronous processing and some are better suited for this type of work and have less overhead due to their need to support an asynchronous operation mode. The other important thing is to know when to use a reactive type and when to stick to a scalar value.

As mentioned above, the original RxJava benchmark had a bunch of the functions return Observable with a scalar value for no apparent benefit. Changing these back to scalar functions - just like the Stream version helps avoid unnecessary indirection and allocation:

Func1<Integer, Integer> scoreOfALetter = letter -> letterScores[letter - 'a'];

Streaming the characters of a word is the hottest operation and is executed several tens of thousands of time. Instead of the Stream-Spliterator indirection, one can simply index-map a string into its characters:

word -> Observable.range(0, word.length()).map(i -> (int)word.charAt(i));

Instead of the convoluted LongWrapper and its lambda-capture overhead, we can define a simple mutable container for the histogram:

public final class MutableLong {

    public long value;

    public void incAndSet() {
        value++;
    }
}

Func1<String, Observable<HashMap<Integer, MutableLong>>> histoOfLetters =
     word -> toIntegerObservable.call(word)
             .collect(
                 () -> new HashMap<>(),
                 (HashMap<Integer, MutableLong> map, Integer value) -> {
                     MutableLong newValue = map.get(value) ;
                     if (newValue == null) {
                         newValue = new MutableLong();
                         map.put(value, newValue);
                     }
                     newValue.incAndSet();
                 }

              ) ;

The next optimization is the use of flatMapIterable and there is no need to get the iterator of an entySet() but just iterate it since it already implements Iterable:

Func1<String, Observable<Long>> nBlanks =
    word -> MathObservable.sumLong(
                histoOfLetters.call(word)
                .flatMapIterable(map -> map.entrySet())
                .map(blank)
            ) ;

In addition, reduce() has some overhead because of constant boxing and unboxing of a sum or max value of the stream and can be replaced by a dedicated operator from the RxJavaMath library: MathObservable.sumLong().

In synchronous scenarios, concat works better than merge/flatMap most of the time:

Func1<String, Observable<Integer>> toBeMaxed =
    word -> Observable.concat(first3.call(word), last3.call(word));

Func1<String, Observable<Integer>> score3 =
    word ->
        MathObservable.sumInteger(
            Observable.concat(
                score2.call(word).map(v -> v * 2),
                bonusForDoubleLetter.call(word).map(v -> v * 2),
                Observable.just(word.length() == 7 ? 50 : 0)
            )
        );

Note the use of map(v -> v * 2) to multiply the two score components instead of streaming them again.

In addition, there were several internal optimizations to RxJava to improve performance with this type of usage: concat(o1, o2, ...) received a dedicated operator instead of delegating to the Observable of Observables overload. The toBlocking().first() overhead has been improved as well. Currently, the optimized benchmark with RxJava 1.2.4 runs under 67 ms/op, the "Beta" benchmark runs under 100 ms/op and the original benchmark runs under 170 ms/op.

Benchmarking other libraries

Following similar patterns, other streaming libraries (synchronous and asynchronous) were benchmarked over the year. The following subsections summarize what it takes to have them do the Scrabble computation with the functional structures above, how they perform and why are they at the speed they are.

https://twitter.com/akarnokd/status/808995627237601280

Kotlin

Kotlin has its own, rich synchronous streaming standard library and performs quite well in the optimized benchmark: 20 ms/op. It requires a separate project due to a complete separate JVM language which works best under IntelliJ. I'm not deeply familiar with Kotlin thus I'm not sure what it makes that much faster than Stream (or IxJava).

The streaming part of the language is certainly well optimized but it is also possible using HashMap with primitive types gets custom implementation. The streaming standard library is very rich and the whole Scrabble logic could be expressed without building new operators or invoking external libraries.

IxJava

IxJava, short for Iterable eXtensions for Java started out as a companion library to Reactive4Java, the first black-box re-implementation of the Rx.NET library on the JVM (2011). Since then, it has been rewritten from scratch based on the advanced ideas of RxJava 2. The optimized benchmark runs around 23 ms/op, 3 ms faster than the Stream version. Currently, this is the fastest Java library to do the Scrabble benchmark and has all the operators built in for the task. It features less allocation and less indirection, plus there are optimizations for certain shapes of inputs (constant-scalar, single-element sources).

RxJava 2

RxJava is the de-facto standard reactive library for Java 6+ and version 2 supports the Reactive-Streams initiative with its Flowable type. Version 2 was rewritten from scratch in late 2015 and then has been drastically re-architected in mid 2016. The late 2015 version performed poorly with the scrabble benchmark but still 2 times faster than RxJava 1 at the time. Since the dataflow types in RxJava have to anticipate asynchronous and/or backpressured usage with largely the same code path, they have a noticeable overhead when using them in a pure synchronous manner.

Therefore, the Scrabble benchmark is implemented for the backpressure-enabled Flowable and the backpressure-lacking (but otherwise similarly advanced) Observable types. They perform 27.75 ms/op and 26.83 ms/op respectively. Unfortunately, the main RxJava library lacks dedicated operators such as streaming the characters of a String and summing up a stream of numbers and these were implemented in the RxJava 2 Extensions companion library. The additional performance improvement over RxJava 1.x come from the much leaner architecture with fewer indirections, fewer allocations, dedicated concat(source1, source2, ...) operator, a very low overhead blockingFirst() and generally the operator fusion many operators participate in. In the late release-candidate phase, it was decided certain operators return Single, Completable or Maybe instead of its own type. The change did not affect the benchmark result in any measurable way (but the code had to change to work with the new types of course).

In addition, the extension library features a ParallelFlowable type that allows parallel computations over a regular Flowable sequence, somewhat similar to parallel Streams. The parallelization happens for the set of Shakespeare words and requires a manual reduction back to sequential reactive type:

Function<Function<String, Flowable<Integer>>, Flowable<TreeMap<Integer, List<String>>>>
buildHistoOnScore =
    score ->
        ParallelFlowable.from(Flowable.fromIterable(shakespeareWords))
        .runOn(scheduler)
        .filter(scrabbleWords::contains)
        .filter(word -> checkBlanks.apply(word).blockingFirst())
        .collect(
            () -> new TreeMap<Integer, List<String>>(Comparator.reverseOrder()),
            (TreeMap<Integer, List<String>> map, String word) -> {
                Integer key = score.apply(word).blockingFirst();
                List<String> list = map.get(key) ;
                if (list == null) {
                    list = new ArrayList<>() ;
                    map.put(key, list) ;
                }
                list.add(word) ;
            }
        )
        .reduce((m1, m2) -> {
            for (Map.Entry<Integer, List<String>> e : m2.entrySet()) {
                 List<String> list = m1.get(e.getKey());
                 if (list == null) {
                     m1.put(e.getKey(), e.getValue());
                 } else {
                     list.addAll(e.getValue());
                 }
            }
            return m1;
        });

The parallel version measures 7.23 ms/op compared to the Java parallel Streams version with 6.71 ms/op.

Reactor 3

Pivotal's Reactor-Core library is practically RxJava 2 under a different company banner and implementation differences due to being Java 8+, originally contributed by me and as of today, the relevant components of Reactor 3 required by the Scrabble benchmark still uses my algorithms. The few implementation differences come from the use of atomic field updaters instead of atomic classes (such as AtomicInteger) which reduces the allocation amount even further. Unfortunately, even though the same field updaters are available for Java 6 and Android, certain devices don't play nicely with the underlying reflection mechanics. The optimized benchmark code uses custom implementation for streaming the characters of a word and finding the sum/max of a sequence.

Given this difference, Reactor measures 27.39 ms/op, putting it between RxJava 2's Observable and Flowable, somewhat expectedly.

Reactor 3 has direct support for converting to its parallel type, ParallelFlux, which is also practically the same as RxJava 2 Extensions'ParallelFlowable. The ParallelFlux' benchmark clocks in at 8,53 ms/op, however, that 1 ms difference to RxJava 2 is certainly odd and unclear why.

Guava

Google Guava is library with lots of features, among other things, offering sub-library with a fluent-API support with FluentIterable. It has a limited set of streaming operators and the implementation has some unccessary overhead in it. The design reminds me of RxJava 1's Observable where there is a mandatory indirection to an inner type.

Given the limited API, the optimized benchmark code uses custom operators such as streaming the characters of a word, custom sum/max and custom collect() operators, all written with the vocabulary of FluentIterable by me. Therefore, the measured 35.98 ms/op is not entirely the achievement of the library authors.

Interestingly, the backpressure-enabled, async capable Flowable/Flux outperforms the sync-only and thus theoretically lower overhead FluentIterable.

Ix.NET

When the Rx.NET was developed several years ago, they implemented its dual, the Interactive eXtensions building a rich API over their standard, synchronous streaming IEnumerable interface.

Ix.NET is well optimized, most likely due to the nice language features (yield return) and great compiler (state machine building around yield return). Even though .NET supports "primitive specialization", their JIT compiler is not a runtime optimizing compiler and this is likely why the ported Scrabble benchmark measures only 45.4 ms/op.

Unfortunately, there were some missing operators from Ix.NET I had to write manually, such as the now-typically needed streaming of characters and the reduce() operator to support sum/max. (There is no need for custom sum because of the primitive specialization of reduce() provided automatically.)

Reactor.NET

About a year ago, there was a non-zero chance I had to learn and include C# development in my professional line of work. Unfortunately, Rx.NET was and still is an old library with a significant performance overhead due to its synchronous ties, namely returning an IDisposable from Subscribe() instead of injecting it via an OnSubscribe() like all the other 3rd generation (inspired) libraries do. When 3.x didn't change the architecture, I decided instead of battling them over advancing, I could just roll my own library. Since in early 2016 I was involved with Pivotal's Reactor and its third-the-size API surface, I started working on Reactor-Core.NET with all the 4th generation goodies RxJava 2 and Reactor now feature. Unfortunately, the risk of me doing C# faded and I took over leading RxJava, sending this project into sleep.

Regardless, enough operators were implemented already so the Scrabble benchmark for it is available and measures 80.51 ms/op. It may be party due to the .NET platform and also due to a less-than-optimal implementation for streaming characters.

JOOLambda

Back to the Java land, this library is part of the JOOx family of extension libraries, supporting JVM operations such as JDBC-based database interactions to extending the standard Java Stream with features. This, unfortunately, means wrapping a Stream or their Seq type and thus adding a level of indirection. This wouldn't be much of a problem but the API lacks operators that stay in the Seq type for tasks such as collect or sum/max. Therefore, these operators had to be emulated with other operators. A second unfortunate property of JOOLambda is the difficulty of extending it (even non-fluently). I could't find any way of implementing my own operator directly (as with the Rx style and Ix-style APIs) and the closest thing wanted me to implement 70+ standard Stream operators again.

I believe it is still interesting to show how a convenient collect() operator can be implemented if there is no reduce() or even scan() to help us:

Function<String, Seq<HashMap<Integer, MutableLong>>> histoOfLetters =
    word -> {
        HashMap<Integer, MutableLong> map = new HashMap<>();
        return charSeq.apply(word)
               .map(value -> {
                    MutableLong newValue = map.get(value) ;
                    if (newValue == null) {
                        newValue = new MutableLong();
                        map.put(value, newValue);
                    }
                    newValue.incAndSet();
                    return map;
               })
               .skip(Long.MAX_VALUE)
               .append(map);
        };

First, the resulting HashMap is instantiated, knowing that this function will be invoked sequentially, non-recursively thus there won't be any clash between computations of different words. Second, we stream the characters of the word, map each character into the histogram inside the map. We need only a single element of the HashMap but there is no takeLast() operator to ignore all but the very last time the map is forwarded. Instead, we skip all elements and concatenate the single HashMap again to the now empty Seq.

Summing up values is none the less convoluted with JOOL:

Function<String, Seq<Integer>> score2 =
    word -> {
        int[] sum = { 0 };
        return histoOfLetters.apply(word)
               .flatMap(map -> Seq.seq(map.entrySet()))
               .map(letterScore)
               .map(v -> sum[0] += v)
               .skip(Long.MAX_VALUE)
               .append(0)
               .map(v -> sum[0]);
    };

We setup a single element array to be the accumulator for the summing, stream the histogram and sum up the letter scores into this array. We then skip all of it and concatenate 0 followed by mapping (this zero) to the contents of the sum array. Note that append(sum[0]) is evaluated at assembly time (before the sum actually happens) yielding the initial zero every time.

The code measures 86-92 ms/op, however, this might not be that bad because when I'm writing this post, I've noticed a missing optimization that adds unnecessary burden to a core computation - my bad. No worries, I'll remeasure everything again next year since some libraries have since updated their code.

Cyclops-React

This is an odd library, developed mainly by one person. Looking at the Github site I'm sure it used to say Reactive-Streams in the title. I've come across this library a month or so back when the author posted an extensive post about the benefits of it by extending Java Stream with missing features and reactive concepts. When I see "library" and "Reactive-Streams" I jump - writing a reactive library is a very difficult task. It turns out, the library's call in of "Reactive-Streams" was a bit misleading. It is no more reactive than IxJava, which is a completely synchronous streaming API, with the exception that there is a wrapper/converter to a Reactive-Streams Publisher. IxJava has that one but only in various other reactive libraries: Flux.fromIterable() and Flowable.fromIterable().

That aside, it is still a kind of dataflow library and as such can be benchmarked with Scrabble. Cyclops-React builds on top of JOOLambda and my first naive implementation performed similarly to JOOLambda (to be precise, I measured Cyclops-React first, then JOOLambda to see where the poor performance might come from).

Cyclops-React at the time didn't have any collect()/reduce() operators but it has scan (called scanLeft) and takeLast (called takeRight), allowing me to build the necessary computation steps:

Function<String, ReactiveSeq<HashMap<Integer, MutableLong>>> histoOfLetters =
    word ->  toIntegerIx.apply(word)
             .scanLeft(new HashMap<Integer, MutableLong>(), (map, value) -> {
                 MutableLong newValue = map.get(value) ;
                 if (newValue == null) {
                     newValue = new MutableLong();
                     map.put(value, newValue);
                 }
                 newValue.incAndSet();
                 return map;
             })
             .takeRight(1);

From allocation perspective, this is very similar to JOOLambda's workaround since the HashMap is instantiated when the outer function is called and not for the consumer of the aggregation like with RxJava's collect() operator. One convenience though is the takeRight(1) that picks the very last value of the map (as scan emits it every time a new source comes up).

The first benchmarks with version 1.0.3 yielded 108 ms/op. The diagram at the beginning of this section lists it twice. The author of Cyclops-React and I tried to work out a better optimization, but due to the different understanding what the Scrabble benchmark represents, we didn't come to an agreement on the proper optimization (he practically wanted to remove ReactiveSeq, the base type of the library, and basically benchmark Java Stream again; I want to measure the overhead of ReactiveSeq itself).

Since then, version 1.0.5 has been released with library optimizations and my code runs under 54 ms/op while having the same structure as before. The author has also run a few Scrabble benchmarks of his own that show lower overhead, comparable to Stream now. If he achieved it by honoring the structure, that's fantastic. If he practically skipped his own type as the workhorse, that's bad.

Rx.NET

The first, modern reactive library was designed and developed more than 8 years ago at Microsoft. Since then, Rx.NET has become open source, had 3 major releases, and helps (drives?) famous technologies such as Cortana.

It's a bit sad it couldn't evolve beyond its 1st generation reactive architecture. First, it has heavily invested developers who are quite comfortable with how it is implemented, second, the .NET platform has absorbed its base interface types, IObservable and IObserver, that have the unfortunate design of requiring a synchronous IDisposable to be returned. Luckily, the 4th generation architecture works on the .NET platform and the community driven Reactive-Streams.NET initiative may give some hope there as well.

This unfortunate design remnant is visible in the Scrabble benchmark: 413 ms/op. The main overhead comes from the trampolining the range() and enumerable-to-Observable conversion have. This trampolining is necessary to solve the synchronous cancellation problem RxJava solved by having a stateful consumer with a flag and callback mechanism indicating cancellation (which lead to the Subscription injection method in Reactive-Streams).

Interestingly, I've implemented a minimalist, non-backpressured type Ox, similar to RxJava 2's Observable type and it measures45 ms/op, practically in par with the Ix.NET benchmark.

Swave

Perhaps this library is the youngest of the "reactive" libraries. It's implementation resembles of Akka-Stream with the graph-like internal workings, but it is not a native Reactive-Streams library. It has conversion from and to Publisher but steps themselves aren't Publishers. This adds interoperation overhead. In addition, the library is part of the Yoda-family of reactive libraries; there is no retry. (Maybe because for retry to work, one needs to hold onto the chain that establishes the flow and allow resubscribing without the need for manual reassembing the entire flow.) The library is written in Scala entirely and I gave up on trying to call it from a Java project, hence a separate project for it.

The library itself appears to be single developer only and the documentation is lacking a bit at the moment - not that I can't find operators on my own but a few times it was unclear I'm fighting with the Scala compiler (through IntelliJ) or with this library (you know, when IntelliJ says all is okay but then the build fails with a compilation error due to implicits). The library, version 0.5 at least, didn't have collect, reduce, sum, max but it does have takeLast and the emulations mentioned before work.

None the less, I managed to port the benchmark to Scala and run it, getting a surprising 781 ms/op. Since I can't read Scala code, I can only speculate this comes from the graph-architecture overhead and/or some mandatory asynchronous-ness implicitly present.

Akka-Stream

I've read so much goodness about Akka-Stream, about the technologies and frameworks it supports, its advanced and high performance optimizations over the flow-graph, the vibrant community and developer base around it, the spearheading of the Reactive-Streams initiative itself yet it constantly fails to deliver for me. In addition I've recently found out Akka-Stream is just inspired by Reactive-Streams and the reason they provide converter/wrapper to a Publisher instead of implementing it at every step is because working Reactive-Streams' deferred nature is too hard. Also I couldn't find any means for retrying an Akka-Stream Source so it could be yet another Yoda-library (so how does it support resilience then?).

At least Akka-Stream has a Java DSL so I could implement the Scrabble benchmark within the familiar Java context. The DSL doesn't have collect but supports reduce (thus sum and max requires minimal work). Therefore, the collect operations were implemented with the same map+drop(Long.MAX_VALUE)+concat(map).

The benchmark results are "mind-blasting": 5563 ms/op, that is, it takes about 5.5 seconds to compute the Scrabble answer once. Since Akka-Stream is originally written in Scala, I don't know for sure the source of this overhead but I have a few ideas: the graph-overhead, the mandatory asynchronous nature and perhaps the "fusion optimization" they employ that wastes time trying to optimize a graph that can't be further optimized.

This problem seem to hit any use case that has flatMap in it - one of the most common operator involved in Microservices composition. Of course, one can blame the synchronous nature Scrabble use case which is not the target for Akka-Stream, however, its interoperation capabilities through Reactive-Streams Publisher shows some serious trouble (ops/s, larger is better):

Here, the task is to deliver 1M elements (part of an Integer array) from one thread to another where the work is divided between Akka-Stream and RxJava 2: one delivers count number of 1M/count items, and the other flattens the latter sub-section back to a single stream at the other side. Surprisingly, using Rx as the driver or middle worker improves throughput significantly (but not always). This benchmark stresses mostly the optimizer of Akka-Stream. Do people flatMap with Akka-Stream at all and nobody noticed this?

Conclusion

Writing a reactive library is hard, writing a benchmark to measure those libraries is at best non-trivial. Figuring out why some of them is extremely fast while others are extremely slow requires mastery in both synchronous and asynchronous design and development.

Instead of getting mad at the Scrabble benchmark a year ago, I invested time and effort into improving and optimizing libraries that I could effect and thanks to it, those libraries are now considerably better at this benchmark and in general use due to the deep architectural and conceptional improvements.

I must warn the reader about interpreting the results of the Scrabble benchmarks as the ultimate ranking of the libraries. The fact that libraries perform as they do in this particular benchmark doesn't mean they perform the same in any other situations with other type of tasks. The computation and/or IO overhead may hide the subtle differences in those cases, evening the field between them at the end.

Introduction

Java 9 is becoming more reactive by introducing the Reactive-Streams interfaces under the parent class java.util.concurrent.Flow, enabling a new standard interoperation between future libraries built on top.

There is almost no documentation beyond a underwhelming Oracle documentation and the SubmissionPublisher class' JavaDoc about how to write Publishers, Subscriptions and Subscribers under the Flow API. Plus the Oracle document practically concludes with see RxJava.

Indeed, replacing the imports of org.reactivestreams.* with java.util.concurrent.Flow.* in RxJava 2's sources get's one a fully fledged reactive library but there seems to be one crucial expectation with components built on the Flow API: they have to be asynchronous at every stage. I could argue that the underlying concepts work totally fine in synchronous mode, but who am I to question the established definitions?

Oh well, if the constraint is to be asynchronous, then let's do it in an asynchronous way.

To see what it takes, we could start with a relatively simple source: an asynchronous integer range.

Since both Java 9 and the IDE support is in non-final state, I recommend IntelliJ 2017.1 EAP for this "exercise".

Asynchronous integer range source

Unfortunately, Java 9 won't introduce any standard fluent API entry point with all the well loved map(), filter(), flatMap() etc. operators but one has to build individual Publishers and compose them stage-by-stage.

This involves creating a parent Publisher class with the following typical pattern to host the input parameters of the flow to be observed:

import java.util.concurrent.*;

public final class FlowRange implements Flow.Publisher<Integer> {
    final int start;
    final int end;
    final Executor executor;

    public FlowRange(int start, int count, Executor executor) {
        this.start = start;
        this.end = start + count;
        this.executor = executor;
    }

    @Override
    public void subscribe(Flow.Subscriber<? super Integer> subscriber) {
        // TODO implement
    }
}

For brevity, the potential integer overflow of start + count was ignored here. Otherwise, so far nothing special.

Generally in RxJava, if one writes a backpressure-enabled source, one has to implement something on top of the Subscription interface. For intermediate operators (such as map()), one usually has to implement a Subscriber + Subscription wrapper together. Since the integer range is a plain source, we have to take the Flow.Subscription route.

The general pattern with that is to repeat the input parameters along with the actual Flow.Subscriber that will receive the notifications:

    // inside FlowRange

    static final class RangeSubscription
    extends AtomicLong                                          // (1)
    implements Flow.Subscription,
               Runnable                                         // (2)
    {
        final Flow.Subscriber<? super Integer> actual;

        final int end;

        final Executor executor;

        int index;                                              // (3)
        boolean hasSubscribed;                                  // (4)

        volatile boolean cancelled;
        volatile boolean badRequest;                            // (5)

        RangeSubscription(
                Flow.Subscriber<? super Integer> actual, 
                int start, int end,
                Executor executor) {
             this.actual = actual;
             this.index = start;
             this.end = end;
             this.executor = executor;
        }

        @Override
        public void request(long n) {
            // TODO implement
        }

        @Override
        public void cancel() {
            // TODO implement
        }

        @Override
        public void run() {
             // TODO implement
        }
    }

There are a couple of things that need a bit of an explanation:

In order to ensure no more than the requested amount is emitted, we have to track the downstream's request amounts. Generally, you'd want to use a volatile long requested field along with a VarHandleREQUESTED for fast atomics, but our range source has only the requested amount itself needing atomic support, hence extending AtomicLong is a cheap way to get those atomics.
Since we have to be asynchronous when interacting with the actual Subscriber, task(s) have to be submitted to the Executor. We'd like to avoid creating excess amount of Runnables in general and in this particular case, we don't need to since all cross-thread communication is done via thread-safe fields.
Speaking of thread-safety, the index field, that follows how many items have been emitted will be confined to the thread that runs the emission logic in run(). We initialize it to the start value of the range and we'll let it run until it reaches the end value.
One of the implications of going fully async is that the call to onSubscribe() has to happen asynchronously as well, unlike what we can see in RxJava. This is a tradeoff between eager-cancellation and thread-confinement.
This may seem to be an odd field. In the Reactive-Streams specification, calling request() with a non-positive value must be rewarded with an IllegalArgumentException that contains the rule number "3.9" and has to be sent via onError() downstream. Since calling the onXXX methods has to be serialized (no concurrent invocations), we have to communicate the violation in some way to the emitting thread. The easiest way is to use this volatile field.

So far, since we have the skeleton-definition of the integer range source, there is nothing too complicated or convoluted in the code.

However, we now have a few problems to solve when trying to implement the TODO marked methods:

Unlike Scheduler.Worker, the Executor interface gives no guarantees that submitting two Runnables, one after the other from the same thread, will execute in the same order by the underlying thread(pool). Therefore, we need a way to make sure there is no concurrent execution happening when the downstream requests concurrently for example.
The implementation of request() must be thread-safe, reentrant-safe and has to trigger emission of the requested amount of values on the given Executor. Bad requests should be also signalled through the Executor.
Flow.Subscriber.onSubscribe() has to be called before any other signal is emitted on the given Executor as well.

To resolve these problems, maybe surprisingly, the core component we need is the request accounting (AtomicLong) itself by cleverly using its value transitions along with extra fields we see in the skeleton above. In headlights:

This is called trampolining in RxJava's terminology and we'll use the request amount's (atomic) transition from 0 to N (where N > 0L), at which point we will "schedule" the RangeSubscription itself via Executor.execute(). This transition guarantees that when the request amount is 0 there is no concurrent modification and notification happening and is safe to start a new run of emission.
By using the same trampolining and atomics guarantees, calling request() is also thread-safe and reentrant-safe. Since the bad request may come from any thread as well, we have to set the badRequest flag and "imitate" a request(1) situation to get the emission thread going. Of course, the emission thread has to detect that this "1" is not a real downstream request by reading the badRequest flag first and signalling the required exception.
For making sure onSubscribe() is always called first and exactly once, we have to check and store the hasSubscribed flag accordingly. Since this has to happen asynchronously and as of the consequence of subscribing to FlowRange, we will use the same request(1) call trick to avoid reentrancy problem from the real requests as well as jumping to the right thread via the Executor.

Now let's see how these look like in code. The subscribe() is straightforward based on (3)

    // ...

    @Override
    public void subscribe(Flow.Subscriber<? super Integer> subscriber) {
        RangeSubscription sub = new RangeSubscription(subscriber, start, end, executor);
        sub.request(1);
    }

    // ...

Next, we have to deal with the requests (2):

    // ...

        @Override
        public void request(long n) {
            if (n <= 0L) {
                badRequest = true;
                n = 1;
            }
            for (;;) {
                long r = get();
                long u = r + n;
                if (u < 0L) {
                    u = Long.MAX_VALUE;
                }
                if (compareAndSet(r, u)) {
                    if (r == 0L) {
                        executor.execute(this);
                    }
                    break;
                }
            }
        }

    // ...

First, we check for non-positive request amounts and set the badRequest flag to notify the emitter thread about the problem. Then, we perform the typical, atomic request addition capped to Long.MAX_VALUE and in case the previous request was zero, we start the emission by submitting this to the Executor. If the previous request was non-zero, this atomic change will indicate the emitter loop inside run() to loop a bit more.

The cancellation is trivial, set the cancelled to true since we don't have to execute any cleanup with this type of source. On the emitter thread, the emissions will stop reasonably quickly.

    // ...

        @Override
        public void cancel() {
            this.cancelled = true;
        }

    // ...

Finally, the most complicated part is the run() method responsible for emitting signals on the Executor's thread (1).

    // ...

        @Override
        public void run() {

            Subscriber<? super Integer> a = actual;

            if (!hasSubscribed) {                        // (1)
                hasSubscribed = true;
                a.onSubscribe(this);
                if (decrementAndGet() == 0) {            // (2)
                    return;
                }
            }

            long r = get();                              // (3)
            int idx = index;
            int f = end;
            long e = 0L;

            for (;;) {
                while (e != r && idx != f) {             // (4)
                    if (cancelled) {
                        return;
                    }
                    if (badRequest) {                    // (5)
                        cancelled = true;
                        a.onError(new IllegalStateException(
"§3.9 violated: non-positive request received"));
                        return;
                    }

                    a.onNext(idx);

                    idx++;                               // (6)
                    e++;
                }

                if (idx == f) {                          // (7)
                    if (!cancelled) {
                        a.onComplete();
                    }
                    return;
                }

                r = get();                               // (8)
                if (e == r) {
                    index = idx;
                    r = addAndGet(-e);
                    if (r == 0L) {
                        break;
                    }
                    e = 0L;
                }
            }
        }

    // ...

Let's see what's happening next to the notable lines:

Once the run() is executing, the very first step is to make sure onSubscribe() is called exactly once.
Decrementing the requested amount has two purposes here: first remove the virtual request(1) that came from the subscribe() method as the first signal to trigger the call to onSubscribe() itself. This decrement has to happen after the call to onSubscribe() because, as second, the downstream may now issue real requests on top. If it does, we need the correct amount later on. If there is no request, we can quit because there is no reason to emit anything at that point.
We read out the current request amount, the index where we have to start or have left off in the previous emission loop and load the end value (exclusive) into a local variable since we are going to access it frequently.
After the typical queue-drain loop pattern is entered, we loop until the emission count e and the initially known request amount r matches or we reach the end of the range.
Since a bad request triggers a virtual request(1) as well, we have to check the badRequest flag and signal the error instead of emitting a value (which was probably not requested by the downstream anyway) and quit the method.
Once the current index value has been emitted, we move the emission count and the index itself forward.
If the loop in (4) was stopped because we reached the end of the range, we emit the onComplete signal (unless cancelled in the mean time) and quit the method.
Since atomic operations are expensive and it is very likely more requests arrive from downstream while the emission loop executes, we can avoid the atomic subtraction by first checking if the request amount has changed since the last time it was read in (3) and if so, just going another round and continue emitting. If it hasn't changed, we atomically subtract the emitted count. At this point, it is still possible a concurrent request() changes the amount and we have to resume the loop again, this time starting the emitted count from zero.

Now that we have the full source ready, let's test it!

Testing

Unfortunately, Java 9 doesn't offer any built-in, reusable consumer we could use to verify the FlowRange source, therefore, we have to manually build one from scratch. Depending on the convenience we'd like to have, the test consumer, let's call it TestFlowSubscriber can be relatively simple:

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.Flow;
import java.util.concurrent.TimeUnit;

public class TestFlowSubscriber<T>
implements Flow.Subscriber<T> {

    protected final List<T> values;

    protected final List<Throwable> errors;

    protected int completions;

    protected Flow.Subscription subscription;

    protected final CountDownLatch done;

    public TestFlowSubscriber() {
        this.values = new ArrayList<>();
        this.errors = new ArrayList<>();
        this.done = new CountDownLatch(1);
    }

    @Override
    public final void onSubscribe(
            Flow.Subscription subscription) {
        this.subscription = subscription;
        onStart();
    }

    public void onStart() {
        subscription.request(Long.MAX_VALUE);
    }

    @Override
    public void onNext(T item) {
        values.add(item);
    }

    @Override
    public void onError(Throwable throwable) {
        errors.add(throwable);
        done.countDown();
    }

    @Override
    public void onComplete() {
        completions++;
        done.countDown();
    }

    public final List<T> values() {
        return values;
    }

    public final List<Throwable> errors() {
        return errors;
    }

    public final int completions() {
        return completions;
    }

    public final boolean await(long timeout, TimeUnit unit) 
            throws InterruptedException {
        return done.await(timeout, unit);
    }
}

The TestFlowSubscriber offers these basic features:

Override onStart() to issue custom request amount upfront, the rest can be requested via this.subscription.request() from onNext() later on.
Override onNext(), onError() or onComplete() to perform custom actions instead/on top of saving the item, error or incrementing the completion counter.
Since sources are expected to be asynchronously emitting, the internal CountDownLatch's await() is exposed which waits until the source completes normally or with an error.

Now let's validate the FlowRange source via a JUnit 4 test case:

public class FlowRangeTest {
    @Test
    public void normal() {
        FlowRange source = new FlowRange(1, 5, Runnable::run);

        TestFlowSubscriber<Integer> ts = new TestFlowSubscriber<>();

        source.subscribe(ts);

        assertEquals(Arrays.asList(1, 2, 3, 4, 5), ts.values());
        assertEquals(1, ts.completions());
        assertTrue(ts.errors().isEmpty());
    }
}

Now wait a minute, I said it asynchrony is required yet this test uses Runnable::run as the Executor! This could be surprising to newcommers but it is a pretty standard property of the design employed here (and in RxJava 2): asynchrony is orthogonal to the emission in some sense and due to the trampolining/co-routine structure, it works both in synchronous and asynchronous mode!

Therefore, let's see a real asynchronous test case:

    // ...

    @Test
    public void async() throws InterruptedException {
        FlowRange source = new FlowRange(1, 5, ForkJoinPool.commonPool());
        TestFlowSubscriber<Integer> ts = new TestFlowSubscriber<>();

        source.subscribe(ts);

        assertTrue(ts.await(5, TimeUnit.SECONDS));
        assertEquals(Arrays.asList(1, 2, 3, 4, 5), ts.values());
        assertEquals(1, ts.completions());
        assertTrue(ts.errors().isEmpty());
    }

    // ...

Still works, great!

Conclusion

Java 9 becomes reactive but documentation and guides, at the moment, are scarce and since many developers on the desktop/server JVM are unaware of the state-of-the art of reactive libraries available today, having a new set of guides and posts written specifically from the Java 9 Flow API's perspective and terminology could help extend the JDK's own use of the reactive technology much earlier.

Don't underestimate the difficulty of building reactive components this way, the state/flow management can become quite complicated and it is often difficult to undersand why tricks, such as reusing the requested amount in RangeSubscription works for example. (However, if you saw the typical concurrency-related source code in the JDK, such as SubmissionPublisher, I believe the style in this blog post and in RxJava in general is more comprehensible.)

In the next post of the Java 9 Flow API series, I'm going to show how one can implement asynchronous map() and filter() intermediate operators with it.

Introduction

In most reactive libraries, mapping and filtering can be done on a flow via separate operators map() and filter() respectively. One rare occasions, the functions to these operators would need to communicate with each other without sharing information in a flow-external manner and without using defer(). Such combined and standard mapFilter() operator doesn't exist and one has to write one of its own.

Given that the Java 9 Flow API is brand new, one has to definitely write a custom operator for it as Java 9 itself doesn't provide any rich set of predefined operations on Flow.Publishers unlike its dual, the Stream API.

Shameless advertising

By the way, if you are looking for a Java 9 Flow-based, native and modern reactive library with rich set of operators, similar to RxJava 2 (even including some operators from its extension project), I happen to have one for you: Reactive4JavaFlow. It is free and open-source with the promising outlook that one day, it may form the basis for the next major RxJava version...

MapFilter API design

When the Reactive4Java library was first concieved in 2011, the first significant stumbling block was not the lack of lambdas in Java 6/7 but the lack of extension methods. C# had it and made Rx.NET conveniently extendable (assuming you managed to understand how to write operators for it as it wasn't open source at the time). Java still doesn't have any sign of ever getting extension methods, therefore, we either need a rich abstract base class, such as Flowable or Flux, or an utility class whose methods almost look like extension method definitions with the exception that the developer has to stack them on top of one another:

import java.util.concurrent.*;
import static FlowUtils.*;

Flow.Publisher<String> f =
    timeout(
        mapFilter(
            new FlowRange(1, 10, Runnable::run),
            (v, e) -> {
                if (v % 2 == 0) {
                    e.next(v.toString())
                }
                if (v == 7) {
                    e.complete();
                }
            }
        ),
        5, TimeUnit.MILLISECONDS
    );

When thinking about a combined map and filter operator, the problem arises that we'd need to allow the function to emit a mapped item, indicate the input item is dropped, fail and while we are at it, stop the sequence entirely. A functional interface can only return one thing at once or throw, trying to encode the return value or stop indicator introduces potential manual casting and null-use otherwise absent in modern reactive flows.

At first thought, exposing the Flow.Subscriber to an user provided function may be attractive, however, such direct access has implications and susceptible to incorrect use:

The operator has to honor backpressure, hopefully without buffering at all. Calling any of the methods multiple times deliberately or by accident would violate the Reactive-Streams protocol.
The upstream's Flow.Subscription has to be cancelled if the user function called onError() or onComplete().
Call to onSubscribe() should be prevented.
At best, the three previous options would need an extra wrapper around the downstream Flow.Subscriber which adds one layer of indirection and one extra allocation.

Instead, we'll define an MapFilterEmitter interface that limits the API surface to the 4 possible response to an upstream value in the operator:

interface MapFilterEmitter<R> {

    void next(R result);

    void error(Throwable throwable);

    void complete();
}

The choice of dropping an upstream value via this interface design is indicated by simply not calling any of the methods. The reason the method names don't use the onX patterns is that by having different names, the same class can implement both the Flow.Subscriber and MapFilterEmitter interfaces without name and functionality clashing.

Now the operator's method signature in the FlowUtils class can be defined as follows:

public static <T, R> Flow.Publisher<R> mapFilter(Flow.Publisher<T> source,
        java.util.function.BiConsumer<? super T, MapFilterEmitter<R>> handler) {

    return new MapFilterPublisher<>(source, handler);
}

We employ the Java 8+ standard functional interface BiConsumer instead of a Function because we indicate the mapping outcome by requiring the implementor of this handler to call methods on the provided MapFilterEmitter instance. Naturally, the upstream's value we'd like the handler to respond to must be also provided.

Sometimes, it is more user friendly to allow the user provided lambda to throw checked exceptions. Unfortunately, the Java 8+ standard functional interfaces don't let us do that and one has to define its own functional interface with the single abstract method on it declaring throws Exception or even throws Throwable. Adapting the operator API to this case is left as excercise to the reader.

The operator implementation

Working with a reactive stream of data practically means creating a Flow.Subscriber to receive data from an upstream and then calling the methods of a downstream Flow.Subscriber at the right time. When the Flow.Publisher.subscribe() happens, the downstream's Flow.Subscriber becomes available, it gets wrapped into the operator's own Flow.Subscriber instance and this instance gets subscribed to the Flow.Publisher upstream.

This is usually surrounded by a class implementing Flow.Publisher itself and can be considered a boilerplate to set up: have a constructor taking the parameters and callbacks to be used by the operator and perform the subscription in its subscribe() method:

public final class MapFilterPublisher<T, R> implements Flow.Publisher<R> {

    final Flow.Publisher<T> upstream;

    final BiConsumer<? super T, MapFilterEmitter<R>> handler;

    public MapFilterPublisher(
            Flow.Publisher<T> upstream,
            BiConsumer<? super T, MapFilterEmitter<R>> handler
    ) {
        this.upstream = upstream;
        this.handler = handler;
    }

    @Override
    public void subscribe(Flow.Subscriber<? super R> downstream) {

        upstream.subscribe(new MapFilterSubscriber<>(downstream, handler));

    }
}

The main benefit of this structure is that the same, usually complicated, stream can be run multiple times and independently of each other, or repeatedly if (these) parts of the flow must be retried due to failure.

The MapFilterSubscriber practically shimmed between the upstream and downstream for the purpose of altering the stream's emission pattern. For this, it has to implement the Flow.Subscriber interface and to save on complications, we will implement the MapFilterEmitter interface on it the same time.

It is recommended to implement Flow.Subscription as well and delegate the downstream's request() and cancel() up the chain, even though this means a level of indirection. The reason for this to become operator-fusion friendly even though the current operator won't support operator-fusion.

static final class MapFilterSubscriber<T, R> implements
        Flow.Subscriber<T>, MapFilterSubscriber<R>, Flow.Subscription {

    final Flow.Subscription<? super R> downstream;

    final BiConsumer<? super T, MapFilterEmitter<R>> handler;

    Flow.Subscription upstream;

    R result;

    Throwable error;

    boolean done;

    MapFilterSubscriber(
             Flow.Subscription<? super R> downstream,
             BiConsumer<? super T, MapFilterEmitter<R>> handler
    ) {
        this.downstream = downstream;
        this.handler = handler;
    }

    @Override
    public void onSubscribe(Flow.Subscription s) {
        this.upstream = s;
        downstream.onSubscribe(this);
    }

    @Override
    public void request(long n) {
        upstream.request(n);
    }

    @Override
    public void cancel() {
        upstream.cancel();
    }

    // +++++++++++++++++++++++++++++++++++++++++++++++++++++++

    @Override
    public void onNext(T item) {
        // TODO implement
    }

    @Override
    public void onError(Throwable throwable) {
        // TODO implement
    }

    @Override
    public void onComplete() {
        // TODO implement
    }

    @Override
    public void next(R result) {
        // TODO implement
    }

    @Override
    public void error(Throwable throwable) {
        // TODO implement
    }

    @Override
    public void complete() {
        // TODO implement
    }

}

The fields result, error an done store the operator's state in response to the call to the given handler. This indirection is required to detect invalid multiple calls to the next, error and complete methods of the MapFilterEmitter. The onSubscribe(), request() and cancel() is straightforward: forward the calls to the downstream and upstream respectively.

First, the methods onNext(), onError() and onComplete(), which respond to upstream signals will be implemented:

@Override
public void onNext(T item) {
    if (done) {                                    // (1)
        return;
    }

    try {
        handler.accept(item, this);                // (2)
    } catch (Throwable ex) {

        upstream.cancel();                         // (3)

        Throwable error = this.error;
        if (error != null) {                       // (4)
            error.addSuppressed(ex);
        } else {
            this.error = ex;
        }
        done = true;   
    }

    R v = result;
    if (v != null) {                               // (5)
        result = null;

        downstream.onNext(v);
    }

    if (done) {

        Throwable error = this.error;              // (6)
        this.error = null;

        if (error == null) {
            downstream.onComplete();
        } else {
            downstream.onError(error);
        }
        return;
    }

    if (v == null) {
        upstream.request(1);                       // (7)
    }
}

In case the upstream can't react to cancellation immediately when the operator reaches its terminal state due to crashing/completing, checking for the done flag and returning immediately will prevent the unnecessary call to the handler later on.
The handler is called with the current item from upstream and with the this instance, which implements the MapFilterEmitter to receive the decision inside the handler regarding the item.
User provided functions are prone to crashing which we capture via catching Throwable. In this case, the upstream Flow.Subscription gets cancelled.
The error gets saved or appended to an existing error in case the handler called error() first then crashed after for some reason.
Since the handler could have signalled a result value, we will first emit that value to downstream.
If the handler terminated or crashed, the done flag will be true, indicating the terminal event should be emitted to downstream as well.
Finally, if there was no result value from the handler, indicating the item should be dropped, we have to request a "replacement" value from upstream because the downstream won't know it should request more - there is no signal for "no value produced" but only the lack of signals.

Implementing onError() and onComplete() only requires preventing the upstream to call onError() or onComplete() on the downstream in case the handler in onNext() has indicated termination or crashed:

@Override
public void onError(Throwable throwable) {
    if (!done) {
        downstream.onError(throwable);
    }
}

@Override
public void onComplete() {
    if (!done) {
        downstream.onComplete();
    }
}

Finally, the role of the remaining next(), error() and complete() methods is to ensure they are called at most once per handler.accept() invocation, and in case of a violation, cancel the upstream and prepare the error to be emitted by (6) in onNext() listed above.

@Override
public void next(R item) {
    if (done) {
        return;
    }

    if (item == null) {
        error(new NullPointerException("item == null");
    }

    if (this.result != null) {
        error(new IllegalStateException("Multiple next() calls not allowed");
    } else {
        this.result = item;
    }
}

In next(), first we check if the sequence has been already terminated by a prior error() or complete() call from within the handler. Then we check for null as null values are not allowed in Reactive-Streams. One can observe that when writing operators, using null internally is beneficial and unless it will lead to an onNext(null) call, we can use it for our purposes such as indicating the next() has not been called before from the same invocation of the handler.accept() method.

@Override
public void error(Throwable throwable) {
    if (done) {
        return;
    }
    if (throwable == null) {
        throwable = new NullPointerException("throwable == null");
    }

    upstream.cancel();

    this.error = throwable;
    done = true;
}

Similary, error() checks for an already terminated state, turns a nullThrowable into a NullPointerException and stores it in the error field to be emitted from (6) in onNext(). Since such call is a terminal signal, the upstream has to be cancelled as no further items should and would be handled by the operator from that point on.

@Override
public void complete() {
    upstream.cancel();
    done = true;
}

Finally, complete() will just cancel the upstream for the same reason mentioned before and set the done flag. Multiple calls to complete() has no (observable) effect and is considered idempotent, therefore, no special checks for multiple calls are necessary. One call, of course, do the same if (done) return; as with the other methods in the listing.

Conclusion

Writing in-sequence operators that don't amplify item amounts usually don't require complicated request management and/or the typical queue-drain approach mentioned a while ago in this blog. Combining operator features, however, may require its own operator-specific API definition to interact with, help and at the same time limit the user's interaction with the flow to ensure it remains Reactive-Streams compliant.

You might say: "I'm using RxJava 2, how do I implement such operator for it?". Easily; since conceptionally both the Java 9 Flow and RxJava 2 are derived from and built upon the Reactive-Streams principles, you only have to

replace the imports of java.util.concurrent.Flow.* with org.reactivestreams.*,
remove the prefix Flow. if you from the codes above and
replace implements Publisher<R> with extends Flowable<R>:

import org.reactivestreams.*;
import io.reactivex.*;

public final class MapFilterPublisher<T, R> extends Flowable<R> {
     // ...
}

(Before you rush and propose a PR for RxJava 2, it should be noted that this type of operator is already available in the RxJava 2 Extensions project via the FlowableTransformers.mapFilter() operator.)

Interestingly, there was no concurrency-related component to this operator. This is no accident since the handling of an upstream item happens in-sequence due to onNext being called in a sequential manner and as long as the onNext() method doesn't return control to its called, we are safe from reentrant calls. From the handler's perspective, this means it must not go asynchronous by taking the emitter instance provided to it as the second argument of the BiConsumer.accept() it implements. It has to produce a response synchronously and may block while doing it.

However, in case the mapping and/or filtering response is the result of an asynchronous computation "forked out" from the current element, concurrency has to be handled quite differently. We will explore this case in the next blog post.

Introduction

Does Kotlin Coroutines make RxJava and reactive programming obsolete? The answer depends on who you ask. Enthusiasts and marketing departments would say yes without hesitation. If so, sooner or later developers would have to convert Rx code into coroutines or write something with coroutines from the start.

Since Coroutines are currently experimental, there is always the prospect deficiencies, especially regarding the overhead, will be resolved eventually. Therefore, this post will focus more on usability than raw performance.

The scenario

Let's say we have two functions imitating unreliable service: f1 and f2, both returning a number after some delay. We have to call these services, sum up their returned values and present it to the user. However, if this doesn't happen within 500 milliseconds, we don't expect it to happen reasonably faster, thus we'd like to cancel and retry the two services for a limited amount of time before giving up after some number of retries.

The Coroutine Way

Programming via coroutines feels like programming with the traditional ExecutorService- and Future-based toolset with the difference that the underlying infrastructure will use suspension, state machine(s) and task rescheduling instead of blocking a thread.

First, we need the functions that exhibit the delaying behavior:

suspend fun f1(i: Int) {
    Thread.sleep(if (i != 2) 2000L else 200L)
    return 1;
}

suspend fun f2(i: Int) {
    Thread.sleep(if (i != 2) 2000L else 200L)
    return 2;
}

Functions that participate in a coroutine execution should be declared with the suspend keyword and executed within a coroutine context. For demonstration purposes, the logic will sleep for 2 seconds if the parameter supplied to the functions is not 2. This will give a chance to the timeout logic to kick in yet the 3rd attempt to succeed before the timeout.

Since going asynchronous usually ends up leaving the main thread, we need a way to block it until the business logic completes before letting the JVM quit. For this, we can use the runBlocking execution mode in the main method:

fun main(arg: Array) = runBlocking {

     coroutineWay()

     reactiveWay()
}

suspend func coroutineWay() {
    // TODO implement
}

func reactiveWay() {
    // TODO implement
}

The coroutine way of writing the desired logic promises some simplicity compared to the functional ways of RxJava; it should look like as if everything is written in a sequential and synchronous manner.

suspend fun coroutineWay() {
    val t0 = System.currentTimeMillis()

    var i = 0;
    while (true) {                                       // (1)
        println("Attempt " + (i + 1) + " at T=" +
            (System.currentTimeMillis() - t0))

        var v1 = async(CommonPool) { f1(i) }             // (2)
        var v2 = async(CommonPool) { f2(i) }

        var v3 = launch(CommonPool) {                    // (3)
            Thread.sleep(500)
            println("    Cancelling at T=" +
                (System.currentTimeMillis() - t0))
            val te = TimeoutException();
            v1.cancel(te);                               // (4)
            v2.cancel(te);
        }

        try {
            val r1 = v1.await();                         // (5)
            val r2 = v2.await();
            v3.cancel();                                 // (6)
            println(r1 + r2)
            break;                                       
        } catch (ex: TimeoutException) {                 // (7)
            println("         Crash at T=" +
                (System.currentTimeMillis() - t0))
            if (++i > 2) {                               // (8)
                throw ex;
            }
        }
    }
    println("End at T="
        + (System.currentTimeMillis() - t0))             // (9)

}

The printlns were added to see what's happening when in this logic.

In traditional sequential programming, there is no convenient way of retrying an operation under certain conditions, therefore, we first need a loop with a retry counter i.
We fork off the async computations via the async(CommonPool) that will start and execute the functions immediately on some background thread. It returns a Deferred<Int> we will need later. If we applied await() to get var v1 to be the resulting value, that would suspend the current thread and the calculation for v2 wouldn't start until the first one resumes it. Plus we'll need a way of cancel the ongoing computation in case of a timeout. See steps 3 and 5.
If we'd like to timeout both computations, it seems we have to do the timed waiting ourselves with another async task. The method launch(CommonPool), returning a Job, will be used for this. The difference from async is that such tasks can't return values. We save the returned Job because in case the previous async calls succeed in time, we no longer need the timer to fire anymore.
In the timeout job, we cancel v1 and v2 with a TimeoutException, that will unblock any routine that is suspended on getting a result from either of them.
We await the results of the two computation. If there is a timeout, the await will rethrow the exception we used in 4.
If there was no exception, we cancel the timeout task itself as its services are no longer needed, and break out the loop.
If there was a timeout, we catch it the traditional way and perform state checks to determine what to do. Note that any other exception simply falls through and exits the loop.
In case this was the 3rd (or later) attempt, we simply give up and rethrow the exception.
If everything went okay, we print the total time the run took and leave the function.

Looks straightforward, although the cancellation management can get scary: what if v2 crashes with some other exception (such as IOException due to network access)? Certainly we have to keep those task references around so they can get cancelled in such cases as well (i.e., try with resources in Kotlin?). However, this case also has the drawback that if v1 would return in time after all, we can't cancel v1 or detect the crash from v2 until there is an attempt to await it.

Regardless, the setup works and we get a printout something like this:


Attempt 1 at T=0
    Cancelling at T=531
         Crash at T=2017
Attempt 2 at T=2017
    Cancelling at T=2517
         Crash at T=4026
Attempt 3 at T=4026
3
End at T=4229

3 attempts, last one succeeds and we get the sum of 3. Looks reasonable, right? Not so fast (pun intended)! We can see the cancellation happened about time, ~500 milliseconds after the two unsuccessful attempts, yet the crash detection printout happened 2000 milliseconds after the attempt! We know the cancel() invocation worked because it was the source of the exception we actually caught. Therefore, it looks like the Thread.sleep() in the functions were not actually interrupted, or in coroutine terms, not resumed with the interruption exception. This could be a property of the CommonPool , the use of Future.cancel(false) in the underlying infrastructure or simply a limitation of it.

The Reactive Way

Now let's see how to accomplish the same with RxJava 2. Unfortunately, once a function is marked suspended, one can't call it from regular contexts, therefore we have to redo them in a traditional fashion:

fun f3(i: Int) : Int {
    Thread.sleep(if (i != 2) 2000L else 200L)
    return 1
}

fun f4(i: Int) : Int {
    Thread.sleep(if (i != 2) 2000L else 200L)
    return 2
}

To match the functionality of a blocking outer context, we will use the BlockingScheduler from the RxJava 2 Extensions project that allows returning to the main thread. As it name says, it blocks the caller/main thread when started until something submits a task through the scheduler to be executed.

fun reactiveWay() {
    RxJavaPlugins.setErrorHandler({ })                         // (1)

    val sched = BlockingScheduler()                            // (2)
    sched.execute {
        val t0 = System.currentTimeMillis()
        val count = Array<Int>(1, { 0 })                       // (3)

        Single.defer({                                         // (4)
            val c = count[0]++;
            println("Attempt " + (c + 1) +
" at T=" + (System.currentTimeMillis() - t0))

            Single.zip(                                        // (5)
                    Single.fromCallable({ f3(c) })
                        .subscribeOn(Schedulers.io()),
                    Single.fromCallable({ f4(c) })
                        .subscribeOn(Schedulers.io()),
                    BiFunction<Int, Int> { a, b -> a + b }               // (6)
            )
        })
        .doOnDispose({                                         // (7)
            println("    Cancelling at T=" + 
                (System.currentTimeMillis() - t0))
        })
        .timeout(500, TimeUnit.MILLISECONDS)                   // (8)
        .retry({ x, e ->
            println("         Crash at " + 
                (System.currentTimeMillis() - t0))
            x < 3 && e is TimeoutException                     // (9)
        })
        .doAfterTerminate { sched.shutdown() }                 // (10)
        .subscribe({
            println(it)
            println("End at T=" + 
                (System.currentTimeMillis() - t0))             // (11)
        },
        { it.printStackTrace() })
    }
}

A slightly longer implementation and certainly may look scary to those who are not used to so much lambdas.

RxJava 2 notoriously delivers exceptions in one way or other. On Android, undeliverable exceptions will crash the app unless handled with the RxJavaPlugins.setErrorHandler. Here, since we know a cancellation will interrupt a Thread.sleep(), the resulting stacktrace printed to the console would just clutter it and it is decided we ignore such excess exceptions.
We setup the BlockingScheduler and issue the first task to be executed on it, containing the rest of the logic to be executed in the main thread. This is due to the fact that because it blocks, a regular start() will livelock the main thread as any subsequent work, that would otherwise unblock it, wouldn't get executed.
We setup a heap variable that will count the number of retries.
We increment this counter and print out the "Attempt" string whenever there is a subscription via Single.defer. The operator allows us to have a per subscription state which we expect from the resubscriptions of a retry() operator down the chain.
We use the zip operator that starts two single-element asynchronous calculation, each calling the respective function from a background thread.
Once both finish, we add the resulting number together.
To make a cancellation from the timeout visible, we add the doOnDispose operator to print out the indicator and timestamp of such event.
We define the overall timeout to get the sum via the timeout operator. The overload will signal a TimeoutException if the timeout happens (i.e., no fallback for this scenario).
The retry operator overload provides the number of times the retry happened and the current error. After printing the error, we should return true - which indicates the retry must happen - if the number of retries so far is less than 3 and the error itself is of TimeoutException. Any other error will simply fall through without triggering a retry.
Once we are done, we should shut down the scheduler so it can release the main thread and the JVM can quit.
Hovever, just before that, we print the resulting sum and the time it took the whole operation to finish.

One could say, it is more convoluted compared to the coroutine version. At least it works:


    Cancelling at T=4527

Attempt 1 at T=72
    Cancelling at T=587
         Crash at 587
Attempt 2 at T=587
    Cancelling at T=1089
         Crash at 1090
Attempt 3 at T=1090
    Cancelling at T=1291
3
End at T=1292

The Cancelling at T=4527, interestingly comes from the coroutineWay() call if we run the two functions together from the main method above: even though there was no timeout at last, cancelling the timeout itself suffered the same non-interruptible computation problem, hence the additional and mute signal about cancelling the already finished tasks.

RxJava, on the other hand, promptly cancels and retries the functions at least. There is, however, a practically unnecessary Cancelling at T=1291 entry in the printout too. This is an artifact, or rather my sloppyness, in how Single.timeout is implemented: if it succeeds without timeout, the internal CompositeDisposable hosting the upstream's Disposable gets cancelled along with the timeout task regardless of the actual state of the operator.

Conclusion

As a final thought, let's illustrate the power of reactive design by a small change in the expectations: Why retry the whole sum if we could only retry that function which doesn't respond in time? The solution is straightforward in RxJava: move the doOnDispose().timeout().retry() into each of the function call sequence (perhaps through a transformer to avoid code duplication):

val timeoutRetry = SingleTransformer<Int, Int> { 
    it.doOnDispose({
        println("    Cancelling at T=" + 
            (System.currentTimeMillis() - t0))
    })
    .timeout(500, TimeUnit.MILLISECONDS)
    .retry({ x, e ->
        println("         Crash at " + 
            (System.currentTimeMillis() - t0))
        x < 3 && e is TimeoutException
    })
}

// ...

Single.zip(
    Single.fromCallable({ f3(c) })
        .subscribeOn(Schedulers.io())
        .compose(timeoutRetry)
    ,
    Single.fromCallable({ f4(c) })
        .subscribeOn(Schedulers.io())
        .compose(timeoutRetry)
    ,
    BiFunction<Int, Int> { a, b -> a + b }
)
// ...

I welcome the reader to try and update the coroutine implementation to accomplish the same behavior (including any other form of cancellation possibility while you are at it).

One of the benefits of declarative reactive programming is the ability to not bother with complications such as threading, propagation of cancellation and operation composition most of the time. Libraries such as RxJava give an API and a viewpoint that hide these lower level "evils" from the typical user.

So, are coroutines useful after all? Certainly they are, but I believe this usefulness is rather limited and I have my doubts on how it could replace reactive programming in general.

Introduction

There are cases where mapping an upstream value of type T has to be mapped to type U , one-for-one, but the mapping process itself involves asynchronous work. With RxJava, this is a de-facto use case for concatMap, concatMapEager and flatMap, depending on the concurrency expectations about the mapping itself (i.e., one at a time, multiple at once but in-order and arbitrary order respectively).

Let's assume we don't want to run multiple concurrent mapping thus concatMap would suffice. We can (and will in a future post) write that operator, but we should face two additional challenges: the standard Java 9 Flow API has no notion of 0..1 reactive type so we have to restrict the inner Flow.Publisher to at most one element (take(1)); and we'd sometimes zip the original and the mapped result into a third type R.

These requirements warrant their own custom operator, enter mapWhen().

The mapWhen operator

I must admit, the name comes from Reactor-Core after they picked my implementation named mapAsync() from RxJava 2 Extensions. It certainly matches the naming of other operators, such retryWhen(), but arguably the function parameter signature is different (i.e., not a Publisher->Publisher transformation):

public static <T, U> Flow.Publisher<U> mapWhen(Flow.Publisher<T> source,
        Function<? super T, ? extends Flow.Publisher<U>> mapper) {
    return mapWhen(source, mapper, (t, u) -> u);
}

public static <T, U, R> Flow.Publisher<R> mapWhen(Flow.Publisher<T> source,
        Function<? super T, ? extends Flow.Publisher<U>> mapper,
        BiFunction<? super T, ? super U, ? extend R> combiner
) {
    return new FlowMapWhen<>(source, mapper, combiner);
}

One would think that supporting the combiner case with the same operator implementation adds unreasonable overhead. We'll see later that this is not the case because both the original and mapped value will be available in a way that makes application (t, u) -> u bi-function a trivial, and when JIT-ed, a fall-through case.

I'll omit the outer FlowMapWhen class' implementation because it is just a boilerplate forwarding the Flow.Subscriber and the two functional callbacks to the actual MapWhenSubscriber operator implementation. Let's see the skeleton of the MapWhenSubscriber:

// inside class FlowMapWhen
static final class MapWhenSubscriber<T, U, R> 
implements Flow.Subscriber<T>, Flow.Subscription {

    final Flow.Subscriber<? super R> downstream;

    final Function<? super T, ? extends Flow.Publisher<U>> mapper;

    final BiFunction<? super T, ? super U, ? extend R> combiner;

    final T[] buffer;

    long producerIndex;

    long consumerIndex;

    Flow.Subscription upstream;

    int wip;

    boolean done;

    Throwable error;

    boolean cancelled;

    long emitted;

    int consumed;

    long requested;

    U mapperResult;

    int mapperState;

    MapWhenInnerSubscriber<T, U, R> inner;

    // -------------------------------------

    static final VarHandle WIP;

    static final VarHandle ERROR;

    static final VarHandle DONE;

    static final VarHandle CANCELLED;

    static final VarHandle REQUESTED;

    static final VarHandle BUFFER;

    static final VarHandle MAPPER_STATE;

    static final VarHandle INNER;

    static final MapWhenInnerSubscriber<Object, Object, Object> INNER_CANCELLED =
        new MapWhenInnerSubscriber<>(null);

    static {
        Lookup lk = MethodHandles.lookup();
        try {
            WIP = lk.findVarHandle(
                MapWhenSubscriber.class, "wip", int.class);
            ERROR = lk.findVarHandle(
                MapWhenSubscriber.class, "error", Throwable.class);
            DONE = lk.findVarHandle(
                MapWhenSubscriber.class, "done", boolean.class);
            CANCELLED = lk.findVarHandle(
                MapWhenSubscriber.class, "cancelled", boolean.class);
            REQUESTED = lk.findVarHandle(
                MapWhenSubscriber.class, "requested", long.class);

            MAPPER_STATE = lk.findVarHandle(
                MapWhenSubscriber.class, "mapperState", int.class);
            INNER = lk.findVarHandle(
                MapWhenSubscriber.class, "inner", MapWhenInnerSubscriber.class);
        } catch (Throwable ex) {
            throw new InternalError(ex);
        }

        BUFFER = MethodHandles.arrayElementVarHandle(Object[].class);
    }

    // -------------------------------------

    MapWhenSubscriber(
            Flow.Subscriber<? super R> downstream,
            Function<? super T, ? extends Flow.Publisher<U>> mapper,
            BiFunction<? super T, ? super U, ? extend R> combiner) {

        this.downstream = downstream;
        this.mapper = mapper;
        this.combiner = combiner;
        this.buffer = (T[])new Object[Flow.defaultBufferSize()];
    }

    @Override
    public void onSubscribe(Flow.Subscription s) {
        // TODO implement    
    }

    @Override
    public void onNext(T item) {
        // TODO implement    
    }

    @Override
    public void onError(Throwable throwable) {
        // TODO implement    
    }

    @Override
    public void onComplete() {
        // TODO implement    
    }

    @Override
    public void request(long n) {
        // TODO implement    
    }        

    @Override
    public void cancel() {
        // TODO implement    
    }

    void innerSuccess(U result) {
        // TODO implement    
    }

    void innerError(Throwable throwable) {
        // TODO implement    
    }

    void innerComplete() {
        // TODO implement    
    }

    void updateError(Throwable throwable) {
        // TODO implement    
    }

    void drain() {
        // TODO implement    
    }
}

We have quite a set of fields so let's describe them from top to bottom:

downstream represents the recipient of each mapped value.
mapper is the primary mapping function that creates a Flow.Publisher for each individual upstream value
combiner is the secondary mapping function that turns the original upstream value and the asnyc-mapped value into the end result.
buffer is a power-of-2 fixed size circular buffer for prefetching and holding the upstream values efficiently. Since we'll match the buffer size with backpressure requests, it will never overflow. Accessing the array elements will have to be atomic, thus we will use the BUFFERVarHandle to accomplish that.
producerIndex and consumerIndex are pointers into the buffer and will be updated by the onNext()'s thread and the drain() thread respectively.
upstream is the connection to the upstream Flow.Publisher; we'll request from it in a stable prefetch manner.
wip is the work-in-progress counter for the usual queue-drain serialization approach: all sorts of events (requesting, upstream values, mapping responses) will be in flight but only one thread can act on them at a time. Changing this value has to be atomic thus the WIP companion VarHandle is also present.
done indicates the upstream has finished emitting events. It could be volatile but we can save some nanoseconds by using the DONEVarHandle and by using setRelease otherwise not available. This is correct because, as we'll see, setting done is always followed by a full-barrier atomic integer to wip.
cancelled is an indication from downstream that the processing should stop. We'll check this inside the drain loop. Again, this could be also simply a volatile field but we'll use compareAndSet in cancel() to make the cancellation process happen at most once. For this, the VarHandleCANCELLED will be used.
emitted tracks how many result items have been sent to the downstream consumer. When the emitted amount is equal to the requested amount, we quit emitting temporarily until the downstream requests more. No concurrent access happens to this variable because it will be only incremented from within the drain-loop.
consumed tracks the number of items processed from the upstream so that when a certain threshold has been reached, a replenishing request() call will be issued for more. Reusing consumerIndex is not an option because we have to reset the counter whenever such replenishing request() call happens.
requested tracks the total amount the downstream consumer has requested. This has to be atomically updated and the REQUESTEDVarHandle will help us ensure that. Tracking requested and emitted separately instead of decrementing the requested amount saves us again a few nanoseconds.
mapperResult stores the result of the mapper's activity or null if there was no response.
mapperState indicates where the inner mapping is at: 0 indicates no mapper Flow.Publisher is being observed, 1 indicates observation is in progress but no values yet, 2 indicates a value has been received and the main flow can continue and 3 indicates the observation completed without any value. Since observing and changing this state can happen from different threads, the MAPPER_STATEVarHandle's atomics will ensure proper visibility.
inner holds onto the currently running MapWhenInnerSubscriber observing the generated Flow.Publisher for an upstream value. If the downstream cancels the entire flow, the operator should prevent any further subscription to an inner Flow.Publisher that may happen concurrently. The VarHandleINNER helps with this "terminal atomics" scenario through a static cancellation indicator instance INNER_CANCELLED of type MapWhenInnerSubscriber.

The implementation of MapWhenInnerSubscriber should make sure at most one result value is considered. We could, of course, request 1 with it but I prefer requesting unbounded because either way, we still have to cancel the inner Flow.Subscription and ensure only one of the innerX methods of MapWhenSubscriber is called.

static final class MapWhenInnerSubscriber<T, U, R>
extends AtomicReference<Flow.Subscription>
implements Flow.Subscriber<U>, Flow.Subscription {

    final MapWhenSubscriber<T, U, R> parent;

    MapWhenInnerSubscriber(MapWhenSubscriber<T, U, R> parent) {
        this.parent = parent;
    }

    @Override
    public void onSubscribe(Flow.Subscription s) {
        if (compareAndSet(null, s)) {
            s.request(Long.MAX_VALUE);   
        } else {
            s.cancel();
        }
    }

    @Override
    public void onNext(U item) {
        Flow.Subscription upstream = getPlain();
        if (s != this) {
            setPlain(this);
            s.cancel();

            parent.innerSuccess(item);
        }
    }

    @Override
    public void onError(Throwable throwable) {
        Flow.Subscription upstream = getPlain();
        if (s != this) {
            setPlain(this);

            parent.innerError(throwable);
        }
    }

    @Override
    public void onComplete() {
        Flow.Subscription upstream = getPlain();
        if (s != this) {
            setPlain(this);

            parent.innerComplete();
        }
    }

    void cancel() {
        Flow.Subscription s = getAndSet(this);
        if (s != null && s != this) {
            s.cancel();
        }
    }
}

When working with inner subscriber such as this, I found it practical to delegate most logic back to the parent coordinator class; sometimes this allows the reuse of the inner Flow.Subscriber implementation, other times it allows us to focus on the behavior of the operator now confined into a single parent class.

One new pattern that the MapWhenInnerSubscriber shows is the use of getPlain() and setPlain(). These were Java 9 additions to the AtomicXXX classes that allows us to access its contents without barriers. Since onNext(), onError() and onComplete() are guaranteed to execute sequentially, we don't really need atomics in the case when an onNext() should prevent any subsequent effects of an onError() on onComplete(). Usually, a library has its standard cancelled Flow.Subscription indicator defined somewhere but we will simply resort to using this to indicate a cancelled MapWhenInnerSubscriber, hence it also implements Flow.Subscription to support this case.

Now let's get back to the main MapWhenSubscriber and implement the methods.

onSubscribe

@Override
public void onSubscribe(Flow.Subscription s) {
    this.upstream = s;
    downstream.onSubscribe(this);
    s.request(Flow.defaultBufferSize());
}

A typical stable-prefetch onSubscribe implementation: save the upstream handle, introduce ourselves to downstream and then request the fixed amount from upstream.

onNext

@Override
public void onNext(T item) {
    T[] buf = buffer;
    int mask = buf.length - 1;
    long pi = producerIndex;

    BUFFER.setRelease(buf, (int)pi & mask, item);
    producerIndex = pi + 1;

    drain();
}

The code before drain() is usually hidden behind a Queue.offer() implementation; you may recognize algorithm from the JCTools library's SpscArrayQueue implementation. Of course I could have just used that but then we'd miss a nice use for VarHandles, that unlike field updaters, can target array elements directly. Compared to the JCTools version, there are two important differences: no look-ahead, no item padding (avoid false sharing) and no overflow detection. The latter is omitted because we are in a backpressured flow: remember we sized the buffer with Flow.defaultBufferSize() and requested the same amount? By definition a reactive flow should not emit more than requested. Note that we don't map the upstream item into a Flow.Publisher and subscribe MapWhenInnerSubscriber here, unlike a flatMap() for example, because we want only one of them to be active at a time and we want to wait for it to signal.

onError

Before we implement onError(), a helper method needs to be introduced that captures a behavior required by later handler methods as well:

void updateError(Throwable throwable) {
    for (;;) {
        Throwable current = (Throwable)ERROR.getAcquire(this);
        Throwable next;
        if (current == null) {
            next = throwable;
        } else {
            next = new Throwable();
            next.addSuppressed(current);
            next.addSuppressed(throwable);
        }
        if (ERROR.compareAndSet(this, current, next)) {
            break;
        }
    }
}

Perhaps the easier behavior is to delay errors until all upstream values have been processed and present the accumulated errors at once at the end. The trouble is that both the upstream and the inner Flow.Publisher may signal error at the same time. To handle this case, we do something like a copy-on-write atomic update. If error delaying is not preferred, a simple ERROR.compareAndSet(this, null, throwable) suffices (don't forget to do something with throwable in case the compareAndSet returns false!).

@Override
public void onError(Throwable throwable) {
    updateError(throwable);
    DONE.setRelease(this, true);
    drain();
}

Handling the upstream error requires us to update the errors we are collecting, indicating the upstream has completed and trigger a drain() that will perform the necessary emissions of events.

onComplete

@Override
public void onComplete() {
    DONE.setRelease(this, true);
    drain();
}

Handling the normal upstream completion is trivial (and similar to onError): indicate the upstream completed and issue drain() to emit events as the current overall state indicates.

request

@Override
public void request(long n) {
    if (n <= 0L) {
        updateError(new IllegalArgumentException("non-negative request expected"));
    } else {
        for (;;) {
            long current = (long)REQUESTED.getAcquire(this);
            long next = current + n;
            if (next < 0L) {
                next = Long.MAX_VALUE;
            }
            if (REQUESTED.compareAndSet(this, current, next)) {
                break;
            }
        }
    }
    drain();
}

Handling downstream requests is done by aggregating them atomically and capping it to Long.MAX_VALUE. Non-positive requests are honored by an IllegalArgumentException as the spec requires. Either way, drain() will actually deliver the relevant events.

cancel

@Override
public void cancel() {
    if (CANCELLED.compareAndSet(this, false, true)) {
        upstream.cancel();

        MapWhenInnerSubscriber inner = 
            (MapWhenInnerSubscriber)INNER.getAndSet(this, INNER_CANCELLED);
        if (inner != null && inner != INNER_CANCELLED) {
            inner.cancel();
        }

        if ((int)WIP.getAndAdd(this, 1) == 0) {
            innerResult = null;
            Arrays.fill(buffer, null);
        }
    }
}

We atomically change into the cancelled state (exactly once) and cancel the upstream. Since there might be an active inner observation going on, we have to cancel that as well and prevent any further observation to take place by storing INNER_CANCELLED. Note the unfortunate need for casting the returned object from the VarHandle invocation. The last part, incrementing the wip counter is there to make sure the buffer and any current result is cleaned up even when there is no drain loop running at the moment.

innerSuccess

@Override
public void innerSuccess(U result) {
    this.innerResult = result;
    MAPPER_STATE.setRelease(this, 2);
    drain();
}

Handling the emission from the inner source requires storing the mapped result into the field, indicating that there is a value available (2) and issuing a drain() to perform the signal emissions accordingly.

innerError

@Override
public void innerError(Throwable throwable) {
    updateError(throwable);
    MAPPER_STATE.setRelease(this, 3);
    drain();
}

Handling the error from the MapWhenInnerSubscriber looks similar, update the errors, indicate the empty state (3) for the inner result and issue a drain().

innerComplete

@Override
public void innerComplete() {
    MAPPER_STATE.setRelease(this, 3);
    drain();
}

Pretty much a copy-paste of innerError(), minus the updating of the error of course.

drain

Arguably, the methods described in previous sections were only to prepare the state for the actual workhorse: the drain() method which is a corner stone for most lock-free operator implementations. Since we are to coalesce the handling of upstream values with the handling of inner results, the drain() method will be described in several code listings.

First, let's write the skeleton drain loop:

void drain() {

    if ((int)WIP.getAndAdd(this, 1) != 0) {
        return;
    }

    int missed = 1;

    Flow.Subscriber<? super R> downstream = this.downstream;

    T[] buf = buffer;
    int mask = buf.length - 1;

    int limit = buf.length - (buf.length >> 2);

    long ci = consumerIndex;

    int c = consumed;
    long e = emitted;

    for (;;) {
        long r = (long)REQUESTED.getAcquire(this);

        while (e != r) {
            // TODO implement
        }

        if (e == r) {
            // TODO implement
        }

        consumerIndex = ci;
        consumed = c;
        emitted = e;

        missed = (int)WIP.getAndAdd(this, -missed) - missed;
        if (missed) {
            break;
        }
    }
}

So far, this is a typical drain loop that loads frequently needed values into local variables, can emit when there are unfulfilled requests and saves any changes to the tracking variables before (potentially) giving up the emission thread. Unfortunately, there is no VarHandle.addAndGet (unlike AtomicInteger) thus we have to get the effect from its dual getAndAdd: we have to manually subtract the value from the current value returned by the method to get back the "current" after-value of the wip field.

The next step is to implement the while loop. It has two main responsibilities: figuring out there are no more upstream values and acting on the result of the inner Flow.Publisher.

while (e != r) {

    if ((boolean)CANCELLED.getAcquire(this)) {
        mapperResult = null;
        Arrays.fill(buf, null);
        return;
    }

    boolean d = (boolean)DONE.getAcquire(this);

    T value = (T)buf[(int)ci & mask];

    boolean empty = value == null;

    if (d && empty) {
        Throwable error = (Throwable)ERROR.getAcquire(this);
        if (error == null) {
            downstream.onComplete();
        } else {
            downstream.onError(error);
        }
        return;
    }

    if (empty) {
        break;
    }

    // TODO implement the rest    
}

First, we detect a cancellation and clean up the internal storage of the operator before quitting. Then, we gather the state of the operator. Is it done? Is there a current upstream value to work with?

If the upstream is done and there are no further buffered values, we check for an error and terminate the downstream accordingly. Note that the buffer is not "dequeued" for the consumerIndex (ci) until its associated inner Flow.Publisher has produced its result. If the upstream has simply not produced its next item, we quit the while loop.

The next step is to determine what to do with the upstream value. There are four cases to handle:

int state = (int)MAPPER_STATE.getAcquire(this);

if (state == 0) {
    Flow.Publisher<U> publisher;

    try {
        publisher = Objects.requireNonNull(mapper.apply(value));
    } catch (Throwable ex) {
        updateError(ex);
        MAPPER_STATE.setRelease(3);
        continue;
    }

    MapWhenInnerSubscriber<T, U, R> next = new MapWhenInnerSubscriber<>(this);

    MapWhenInnerSubscriber current = (MapWhenInnerSubscriber)INNER.getAcquire(this);

    if (current != INNER_CANCELLED && INNER.compareAndSet(this, current, next)) {

        MAPPER_STATE.setRelease(1);

        publisher.subscribe(next);
    } else {
        return;
    }

} else if (state == 2) {

    // TODO implement

} else if (state == 3) {

    // TODO implement

} else {
    break;
}

State 0 is when there is no ongoing inner observation and given an upstream value, we can setup that observation. First, we map the upstream value into a Flow.Publisher instance, checking nullness along the way. If this mapping crashes for some reason, we update the error tracking and jump to state 3 similar to innerError(). Otherwise, we prepare a new MapWhenInnerSubscriber and unless there has been a concurrent cancellation, we atomically store it in the inner field, update the state to 1 and subscribe to the Flow.Publisher with it. Whether or not this inner source is synchronous or not, the drain loop protects against reentrance.

State 1 indicates an ongoing inner observation in which case there is nothing to do but quit the while loop and wait for another drain() invocation by one of the innerXXX() methods.

State 2 happens when the inner observation produces a result of type U:

if (state == 2) {
    U result = mapperResult;
    mapperResult = null;

    R output;

    try {
        output = Objects.requireNonNull(combiner.apply(value, result));
    } catch (Throwable ex) {
        updateError(ex);
        MAPPER_STATE.setRelease(3);
        continue;
    }

    downstream.onNext(output);

    e++;

    BUFFER.setRelease(buf, (int)ci & mask, null);
    ci++;

    if (++c == limit) {
        c = 0;
        upstream.request(limit);
    }
    MAPPER_STATE.setRelease(0);
}

We pick up the inner result, apply the combiner function to it to get back the output for the downstream. Again, if the mapping crashes, we update the error tracking and move to state 3. Once the emission happened, we increment the emission counter, release the buffer slot for the original item, increment the consumer index for getting the next upstream value later on. We increment the number of items consumed from upstream (c) and once it hits the prefetch limit (75% of the buffer size), we reset the count and request more from upstream. Finally, we move back to state 0 to allow the handling the next item.

The last state, 3, handles the case when there was no inner result at all; we have to move on to the next upstream item:

if (state == 3) {
    BUFFER.setRelease(buf, (int)ci & mask, null);
    ci++;

    if (++c == limit) {
        c = 0;
        upstream.request(limit);
    }
    MAPPER_STATE.setRelease(0);
}

This time, there is nothing to combine and we don't have anything of type R to emit, thus what remains is the release of the buffer slot. The emission count is unchanged in this case but we still have to replenish as we have just handled an item from upstream. Finally, we move back to state 0.

The last section deals with the case when a terminal state has been reached but the downstream didn't request anything. The Reactive-Streams specification allows signalling a terminal event without requests and it is often desirable to end a stream eagerly.

if (e == r) {
    if ((boolean)CANCELLED.getAcquire(this)) {
        mapperResult = null;
        Arrays.fill(buf, null);
        return;
    }

    if ((boolean)DONE.getAcquire(this) && buf[(int)ci & mask] == null) {
        Throwable error = (Throwable)ERROR.getAcquire(this);
        if (error == null) {
            downstream.onComplete();
        } else {
            downstream.onError(error);
        }
        return;
    }
}

Practically, it is just the front half of the code from the while loop in a more concise fashion: cleanup if cancellation happened when there was no request or see if we run out of upstream items and no more will arrive.

Conclusion

I think we can agree that these types of operators can become quite complicated due to the potential asynchronous execution of its parts. Working out the state management is a difficult without some experience and even then, ensuring proper visibility (and finding bugs due the lack of it) is an advanced developer responsibility.

Perhaps the main takeaway for mapWhen is that when events from multiple streams have to be managed, the safest bet is to use the queue-drain approach. Other method should focus on preparing the "queue" or state of the operator, including proper visibility, so that the draining thread can see and act upon that state while also honoring the sequential call requirements of the specification.

In the next post, we'll see how one can handle the task of merging multiple streams while keeping the order of the resulting sequence according to a Comparator.

Introduction

Someone influential stated that RxJava should be rewritten with Kotlin Coroutines. I haven't seen any attempt of it as of now and declaring such a thing to be (not) worth without actually trying is irresponsive.

As we saw in the earlier post and the response in the comment section, following up on the imperative-reactive promise leads to some boilerplate and questionable cancellation management, and the idiomatic Kotlin/Coroutine enhancement suggested is to ... factor out the imperative control structures into common routines and have the user specify lambda callback(s); thus it can become declarative-reactive, just like RxJava interpreted from a higher level viewpoint. Kind of defeats one of the premises in my understanding.

This doesn't diminish the power of coroutine-based abstraction but certainly implies a relevant question: who is supposed to write these abstract operators?

One possible answer is, of course, library writers who not only have experience with abstracting away control structures but perhaps wield deeper knowledge about how the coroutine infrastructure can be utilized in certain complicated situations.

If this assumption of mine is true, that somewhat defeats another premise of coroutines: the end user will likely have to stick to writing suspendable functionals and discover operators provided by a library most of the time.

So what's mainly left is to see if implementing a declarative-reactive library on top of coroutines gives benefits to the library developer (i.e., ease of writing) over hand crafted state-machines and (reasonable) performance to the user of the library itself.

The library implementation

Perhaps one of the more attractive properties of RxJava is the deferred lazy execution of a reactive flow (cold). One sets up a template of transformations and issues a subscribe() call to begin execution. In contrast, CompletableFuture and imperative Coroutines can be thought as eager executions - in order to retry them one has to recreate the whole chain, plus their execution may be ongoing while one still is busy applying operators on top of them.

Base interfaces

Since the former structure is more enabling at little to no overhead, we'll define our base types as follows:

interface CoFlow<out T> {
    suspend fun subscribe(consumer: CoConsumer<T>)
}

The main interface, CoFlow, matches the usual pattern of the Reactive-Streams Publisher.

interface CoConsumer<in T> {

    suspend fun onSubscribe(connection: CoConnection)

    suspend fun onNext(t: T)

    suspend fun onError(t: Throwable)

    suspend fun onComplete()
}

The consumer type, CoConsumer, is also matching the Reactive-Streams Subscriber pattern.

interface CoConnection {
    suspend fun close()
}

The final type, CoConnection, is responsible for cancelling a flow. Unlike the Reactive-Streams Subscription, there is no request() method because we will follow up on the non-blocking suspension promise of the coroutines: the sender will be suspended if the receiver is not in the position to receive, thus there should be no need for request accounting as the state machine generated by the compiler will implicitly do it for us.

Those with deeper understanding of how cancellation works with coroutines may object to this connection object. Indeed, there are probably better ways of including cancellation support, however, my limited understanding of the coroutine infrastructure didn't yield any apparent concept-match between the two. Suggestions welcome.

Entering the CoFlow world

Perhaps the most basic way of creating a flow of values is the Just(T)operator that when subscribed to, emits its single item followed by a completion signal. Since we don't have to deal with a backpressure state machine, this should be relatively short to write:

class Just<out T>(private val value: T) {
    override suspend fun subscribe(consumer: CoConsumer<T>) {
        consumer.onSubscribe(???)
        consumer.onNext(value)
        consumer.onComplete()
    }
}

In order to allow the downstream to indicate cancellation, we have to send something along onSubscribe. Since coroutines appear as synchronous execution, we would have the same synchronous cancellation problem that the Reactive-Streams Subscription (and RxJava before it) solves: inversion of control by sending down something cancellable first, then checking if the consumer had enough.

class BooleanConnection : CoConnection {

   @Volatile var cancelled : Boolean = false

   override suspend fun close() {
       cancelled = true
   }
}

Which we now can use with Just(T):

class Just<out T>(private val value: T) {
    override suspend fun subscribe(consumer: CoConsumer<T>) {
        val conn = BooleanConnection()
        consumer.onSubscribe(conn)

        if (conn.cancelled) {
            return
        }
        consumer.onNext(value)

        if (conn.cancelled) {
            return
        }
        consumer.onComplete()
    }
}

Since everything is declared suspend, we should have no problem interacting with an operator downstream that suspends execution in case of an immediate backpressure.

Let's see a source that emits multiple items, but for an (expectable) twist, we implement an uncommon source: Chars(String) which emits the characters of a string as Ints:

class Chars(private val string: String) : CoFlow<Int> {
    override suspend fun subscribe(consumer: CoConsumer<Int>) {
        val conn = BooleanConnection()
        consumer.onSubscribe(conn)

        for (v in 0 until string.length) {
            if (conn.cancelled) {
                return
            }
            consumer.onNext(v.asInt())
        }
        if (conn.cancelled) {
            return
        }
        consumer.onComplete()
    }
}

And lastly for this subsection, we will implement FromIterable(T):

class FromIterable<T>(private val source: Iterable<T>) : CoFlow<T> {
    override suspend fun subscribe(consumer: CoConsumer<T>) {
        val conn = BooleanConnection()
        consumer.onSubscribe(conn)

        for (v in source) {
            if (conn.cancelled) {
                return
            }
            consumer.onNext(v)
        }
        if (conn.cancelled) {
            return
        }
        consumer.onComplete()
    }
}

So far, these sources look pretty much like how the non-backpressured RxJava 2 Observable is implemented. I'm sure there are more concise way of expressing them; I have, unfortunately, only limited knowledge about Kotlin's syntax improvements over Java, however, since the blog's audience I think is mainly Java programmers, something familiar looking should be "less alien" at this point.

Transformations

What is the most common transformation in the reactive world? Mapping of course! Therefore, let's see how the instance extension method Map(T -> R) looks like.

suspend fun <T, R> CoFlow<T>.map(mapper: suspend (T) -> R): CoFlow<R> {
    val source = this

    return object: CoFlow<R> {
        override suspend fun subscribe(consumer: CoConsumer<R>) {

            source.subscribe(object: CoConsumer<T> {

                var upstream: CoConnection? = null
                var done: Boolean = false

                override suspend fun onSubscribe(conn: CoConnection) {
                    upstream = conn
                    consumer.onSubscribe(conn)
                }

                override suspend fun onNext(t: T) {
                    val v: R;
                    try {
                        v = mapper(t)
                    } catch (ex: Throwable) {
                        done = true
                        upstream!!.close()
                        consumer.onError(ex)
                        return
                    }
                    consumer.onNext(v)
                }

                override suspend fun onError(t: Throwable) {
                    if (!done) {
                        consumer.onError(t)
                    }
                }

                override suspend fun onComplete() {
                    if (!done) {
                        consumer.onComplete()
                    }
                }
            })
        }
    }
}

Perhaps what I most envy of Kotlin is the extension method support. I can only hope for it in Java now that Oracle switches to a 6 months feature enhancement cycle. The val source = this may seem odd to a Kotlin developer; maybe there is a syntax for it so that the outer this may be accessible from the anonymous inner class (object: CoFlow<R>) in some other way. Note also the suspend (T) -> R signature: we will, of course, mainly support suspendable functions.

The logic, again, resembles of RxJava's own map() implementation. We save and forward the upstream connection instance to the consumer as there is no real need to intercept the close call. We apply the upstreams value to the mapper function and forward the result to the consumer. If the mapper function crashes, we stop the upstream and emit the error. This may happen for the very last item and the upstream may still emit a regular onComplete(), which should be avoided just like with Reactive-Streams.

The next common operator is Filter(T):

suspend fun <T> CoFlow<T>.filter(predicate: suspend (T) -> Boolean): CoFlow<T> {
    val source = this

    return object: CoFlow<T> {
        override suspend fun subscribe(consumer: CoConsumer<R>) {
            source.subscribe(object: CoConsumer<T> {

                var upstream: CoConnection? = null
                var done: Boolean = false

                override suspend fun onSubscribe(conn: CoConnection) {
                    upstream = conn
                    consumer.onSubscribe(conn)
                }

                override suspend fun onNext(t: T) {
                    val v: Boolean;
                    try {
                        v = predicate(t)
                    } catch (ex: Throwable) {
                        done = true
                        upstream!!.close()
                        consumer.onError(ex)
                        return
                    }
                    if (v) {
                        consumer.onNext(t)
                    }
                }

                override suspend fun onError(t: Throwable) {
                    if (!done) {
                        consumer.onError(t)
                    }
                }

                override suspend fun onComplete() {
                    if (!done) {
                        consumer.onComplete()
                    }
                }
            })
        }
    }
}

I guess the pattern is now obvious. Let's see a couple of other operators.

Take

suspend fun <T> CoFlow<T>.take(n: Long): CoFlow<T> {

// ...

     var remaining = n

     override suspend fun onNext(t: T) {
         val r = remaining
         if (r != 0L) {
             remaining = --r;
             consumer.onNext(t)
             if (r == 0L) {
                 upstream!!.close()
                 consumer.onComplete()
             }
         }
     }

// ...

     override suspend fun onComplete() {
         if (remaining != 0L) {
             consumer.onComplete()
         }
     }
}

Skip

suspend fun <T> CoFlow<T>.skip(n: Long): CoFlow<T> {

// ...

     var remaining = n

     override suspend fun onNext(t: T) {
         val r = remaining
         if (r == 0L) {
             consumer.onNext(t)
         } else {
             remaining = r - 1
         }
     }

     // ...
}

Collect

suspend fun <T, R> CoFlow<T>.collect(
         collectionSupplier: suspend () -> R,
         collector: suspend (R, T) -> Unit
): CoFlow<R> {
    val source = this

    return object: CoFlow<R> {

        override suspend fun subscribe(consumer: CoConsumer<R>) {

            val coll : R

            try {
                coll = collectionSupplier()
            } catch (ex: Throwable) {
                consumer.onSubscribe(BooleanConnection())
                consumer.onError(ex)
                return
            }                     

            source.subscribe(object: CoConsumer<T> {

                var upstream: CoConnection? = null
                var done: Boolean = false
                val collection: R = coll

                override suspend fun onSubscribe(conn: CoConnection) {
                    upstream = conn
                    consumer.onSubscribe(conn)
                }

                override suspend fun onNext(t: T) {
                    try {
                        collector(collection, t)
                    } catch (ex: Throwable) {
                        done = true
                        upstream!!.close()
                        consumer.onError(ex)
                        return
                    }
                }

                override suspend fun onError(t: Throwable) {
                    if (!done) {
                        consumer.onError(t)
                    }
                }

                override suspend fun onComplete() {
                    if (!done) {
                        consumer.onNext(collection)
                        consumer.onComplete()
                    }
                }
            })

        }
    }
}

Sum

suspend fun <T: Number> CoFlow<T>.sumInt(): CoFlow<Int> {


    // ...
    var sum: Int = 0
    var hasValue: Boolean = false

    override suspend fun onNext(t: T) {
        if (!hasValue) {
            hasValue = true
        }
        sum += t.toInt()
    }

    // ...

    override suspend fun onComplete() {
        if (hasValue) {
            consumer.onNext(sum)
        }
        consumer.onComplete()
    }
}

Max

suspend fun <T: Comparable<T>> CoFlow<T>.max(): CoFlow<T> {

    // ...
    var value: T? = null

    override suspend fun onNext(t: T) {
        val v = value
        if (v == null || v < t) {
            value = t
        }               
    }

    // ...

    override suspend fun onComplete() {
        val v = value
        if (v != null) {
            consumer.onNext(v)
        }
        consumer.onComplete()
    }
}

Flatten

suspend fun <T, R> CoFlow<T>.flatten(mapper: suspend (T) -> Iterable<R>): CoFlow<R> {

    // ...

    override suspend fun onNext(t: T) {

        try {
            for (v in mapper(t)) {
                consumer.onNext(v)
            }
        } catch (ex: Throwable) {
            done = true
            upstream!!.close()
            consumer.onError(ex)
            return
        }
    }

}

Concat

suspend fun <T, R> CoFlow<T>.concat(vararg sources: CoFlow<T>): CoFlow<T> {
    return object: CoFlow<T> {
        suspend override fun subscribe(consumer: CoConsumer<T>) {
            val closeToken = SequentialConnection()
            consumer.onSubscribe(closeToken)
            launch(Unconfined) {
                val ch = Channel<Unit>(1);

                for (source in sources) {

                    source.subscribe(object: CoConsumer<T> {
                        suspend override fun onSubscribe(conn: CoConnection) {
                            closeToken.replace(conn)
                        }

                        suspend override fun onNext(t: T) {
                            consumer.onNext(t)
                        }

                        suspend override fun onError(t: Throwable) {
                            consumer.onError(t)
                            ch.close()
                        }

                        suspend override fun onComplete() {
                            ch.send(Unit)
                        }

                    })

                    try {
                        ch.receive()
                    } catch (ex: Throwable) {
                        // ignored
                        return@launch
                    }
                }

                consumer.onComplete()
            }
        }
    }
}

Before concat, we did not have to interact with the cancellation mechanism of the coroutine world. Here, if one wants to avoid unbounded recursion due to switching to the next source, some trampolining is necessary. The launch(Unconfined), as I understand it, should do just that. Note that the returned Job is not joined into the CoConnection rail, partly due to avoid writing a CompositeCoConnection, partly because I don't know how generally such contextual component should interact with our CoFlow setup. Suggestions welcome.

As for the use of Channel(1), I encountered two problems:

I don't know how to hold off the loop otherwise as suspendCoroutine { } doesn't allow its block to be suspendable and we have subscribe() as suspendable.
The plain Channel() is a so-called rendezvous primitive where send() and receive() have to meet. Unfortunately, a synchronously executed CoFlow will livelock because send() suspends - because there is no matching receive() call on the same thread - which would resume receive(). A one element channel solved this.

The (simpler) SequentialConnection is implemented as follows:

class SequentialConnection : AtomicReference<CoConnection?>(), CoConnection {

    object Disconnected : CoConnection {
        suspend override fun close() {
        }
    }

    suspend fun replace(conn: CoConnection?) : Boolean {
        while (true) {
            val a = get()
            if (a == Disconnected) {
                conn?.close()
                return false
            }
            if (compareAndSet(a, conn)) {
                return true
            }
        }
    }

    suspend override fun close() {
        getAndSet(Disconnected)?.close()
    }
}

It uses the same atomics logic as the SequentialDisposable in RxJava.

Leaving the reactive world

Eventually, we'd like to return to the plain coroutine world and resume our imperative code section after a CoFlow has run. One case is to actually ignore any emission and just wait for the CoFlow to terminate. Let's write an await() operator for that:

suspend fun <T> CoFlow<T>.await() {
    val source = this

    val ch = Channel<T>(1)

    source.subscribe(object : CoConsumer<T> {
        var upstream : CoConnection? = null

        suspend override fun onSubscribe(conn: CoConnection) {
            upstream = conn
        }

        suspend override fun onNext(t: T) {
        }

        suspend override fun onError(t: Throwable) {
            ch.close(t)
        }

        suspend override fun onComplete() {
            ch.close()
        }
    })

    try {
        ch.receive()
    } catch (ex: ClosedReceiveChannelException) {
        // expected closing
    }
}

The same Channel(1) trick is used here. Again, I don't know how to attach the CoConnection to the caller's context.

Sometimes, we are interested in the first or last item generated through the CoFlow. Let's see how to get to the first item via an awaitFirst():

suspend fun <T> CoFlow<T>.awaitFirst() : T {
    val source = this

    val ch = Channel<T>(1)

    source.subscribe(object : CoConsumer<T> {
        var upstream : CoConnection? = null
        var done : Boolean = false

        suspend override fun onSubscribe(conn: CoConnection) {
            upstream = conn
        }

        suspend override fun onNext(t: T) {
            done = true
            upstream!!.close()
            ch.send(t)
        }

        suspend override fun onError(t: Throwable) {
            if (!done) {
                ch.close(t)
            }
        }

        suspend override fun onComplete() {
            if (!done) {
                ch.close(NoSuchElementException())
            }
        }
    })

    return ch.receive()
}

The benchmark

Since benchmarking concurrent performance would be somewhat unfair at this point, the next best benchmark I can think of is our standard Shakespeare Plays Scrabble. It can show the infrastructure overhead of a solution without any explicitly stated concurrency need from the solution.

Rather than showing the somewhat long Kotlin source code adapted for CoFlow, you can find the benchmark code in my repository. The environment: i7 4770K, Windows 7 x64, Java 8u144, Kotlin 1.1.4-3, Coroutines 0.18, RxJava 2.1.3 for comparison:

RxJava Flowable: 26 milliseconds / op
CoroutinesCoFlow: 52.4 milliseconds / op

Not bad for the first try with limited knowledge. I can only speculate about a source of the 2x slower CoFlow implementation: Channel. I'm not sure it meant to support multiple senders and multiple receives, thus the internal queue is involved in way more atomics operation than necessary for our single-producer-single-consumer CoFlow/Reactive-Streams architecture.

Conclusion

As demonstrated, it is possible to rewrite (a set of) RxJava operators with coroutines and depending on the use case, even this (unoptimized) 2x overhead could be acceptable. Does this mean the rest of the 180 operators can be (reasonably) well translated?

I don't know yet; flatMap(), groupBy() and window() are the most notoriously difficult operators due to the increased concurrency and backpressure interaction:

flatMap has to manage a dynamic set of sources which each have to be backpressured. Should each of them use the same Channel.send() or go round robin in some way?
groupBy is prone to livelock if the groups as whole and individually are not consumed.
window has a pecuilar operation mode (true for groupBy) that if one takes one window only, the upstream should not be cancelled until items aimed at that window have been emitted by the upstream or the consumption of the window is cancelled.

Can RxJava be ported to Kotlin Coroutines: yes. Should the next RxJava rather be written in Kotlin Coroutines: I don't think so. The reasons I'm still not for "Coroutines everywhere" despite all the code shown in this post are:

I had to do this porting myself, which hardly constitutes as an unbiased and independent verification.
The coroutine concept is great, but tied to Kotlin as a compiler and its standard library. What should happen with the non-Kotlin, non-Android reactive users? What about other JVM languages?
Building the state machine is hidden from the developer by the compiler. There is always the risk the compiler doesn't do reasonable optimization job and/or doesn't introduce certain bugs you can't workaround easily from the user level. How often is the Kotlin language/standard library updated to fix issues? How is that SAM issue doing?

Solving problems developers face is great, hyping about "burrying Reactive programming as obsolete" without supporting evindence is not.