Motivation
You can kinda think of stepping
as “React for backends”.
In ye olde JQuery days, you would:
- Render the page.
- Set up a load of listeners that twiddled with the elements on the rest of the page.
This often turned into a spaghetti mess, as a change of input might need to update many other components and you had competing callbacks happening asynchronously.
React (in theory at least) solved this by allowing you to do:
- Render the page as
f(state)
<input onInput=mutateState(...) >
- Re-render the page by recomputing
f(state)
Deciding which bits of the page to twiddle is taken care of by React.
Using stepping
involves a similar shift, but on the backend, in this case from twiddling with cached data in the database to declaratively describing outputs = f(inputs)
and letting stepping
handle efficient updates.
Why?
The Python backend you’re currently building probably has a really simple “interview-question” version along the lines of:
test-data/
external-api-call-2012-01-01.json
user-input-2012-01-02.json
...
def process_data(
inputs: list[Input],
t: int
) -> Output:
output = f(inputs[:t])
return output
Where t
is time and the Output
is the state of the system considering all the inputs up to and including that time.
There’s many reasons why your production system has more complexity – often, computing process_data(...)
at request-time would be prohibitively expensive – if you squint, a lot of backend code exists to surmount this problem by writing to various caches.
stepping
’s aim is to try and let you write your production system something closer to process_data(...)
– you describe a rich, declarative function of all your inputs, feed it changes, and it tells you what changed in the output.
There are some example applications here.
Incremental View Maintenance
In most SQL dbs there are two ways of declaratively describing outputs = f(inputs)
, each with different pros and cons:
VIEW
s – the output is always up to date, but can be slow toSELECT
from as the data has to be recomputed each query.MATERIALIZED VIEW
s –SELECT
is quick because the data has been precomputed, but the data is only as fresh as the most recentREFRESH MATERIALIZED VIEW
(which might itself be an expensive operation).
Incremental View Maintenance is an attempt to have one’s cake and eat it - fresh data, quickly.
Existing Incremental View Maintenance software
There are numerous existing pieces of Incremental View Maintenance software, notably:
Jamie Brandon has written a nice taxonomy of them.
Then why write stepping?
The niche stepping
tries to sit in is:
- Less focus on big-data pipelines, more focus on application development.
- Allows describing the computation in Python not SQL.
- Can sit next to existing applications, potentially sharing Postgres databases/transactions.
- Provide an educational example of DBSP - about 3000 lines of pure Python at time of writing.
What about Event Sourcing?
Event Sourcing has many meanings depending on who you speak to. For example: the classic Martin Fowler definition, Martin Kleppmann’s influential talk.
In practice, these systems often amount to many services broadcasting changes to each other over message buses. This can lead to a some problems:
- The developer ergonomics can be bad.
- Replaying messages on changes to the code (and the downstream ramifications) are often an afterthought. (
stepping
will in future try to tackle this in an opinionated manner with stepping manager). - No
TRANSACTION
s. - Often no easy way to express
JOIN
s/GROUP BY
s. - Shifting all the messages around over the wire often incurs significant performance cost – see below.
Should I use stepping?
Probably not, at least right now:
- For most applications, storing all the data in normalised form with suitable indexes, then computing everything at request-time in a single thread will outperform
stepping
(or any other Event Sourcing approach for that matter). Profile your code! stepping
is currently very much in Alpha form.