kentnek tech blog

Deferred Computation, Placeholders and Proxies (Part 1)

During my time at Finsify, we had to crawl from a large number of websites. Although each website was different, the core operations were the same: click on links, fill text fields, parse data from tables, and so on. And these operations should run in series, where one step may depend on the previous step’s results, pretty much like a pipeline.

This inspired me to write Flowstrike, a pipeline scripting extension with focus on web-crawling:

flow([
    goto("http://example.com"), click("#some-link"),
    parseElement("#account-link"),
    
    // Match the href of the link parsed above with a regex 
    // to extract some data, then fill some text field
    fill("#text-box", _.href.match(/SomeRegex/)[1]),
    click("#some-button"),

    // this returns a 2D array for all <td> elements of the table
    parseTable("#data-table"),
    ArrayMap( // For each row, creates a new JSON object
        { 
            // 'id' is the row's first column, and so on
            id: _[0],           
            name: _[1].trim(), 
            url: _[2].href
        }         
    ),
    // ...
]);

The underscore1 _ is an interesting feature of Flowstrike, which is called placeholders. They allow steps to manipulate data returned by previous steps.

For example, here _ refers to the link resulted from parseElement.

parseElement("#account-link"),
fill("#account-query", _.href.match(/SomeRegex/)[1])

Meanwhile, in this step, _ refers to an array of <td> elements in a table row:

parseTable("#data-table"),
ArrayMap({ 
    id: _[0],           
    name: _[1].trim(),
    url: _[2].href
})

The value of _ is unknown until the preceding step has completed. How can _ be implemented?

Deferred Computation

It’s possible to defer operations on the placeholder _ by wrapping it around a closure:

// a step's input is the previous step's output
fill("#account-query", input => input.href.match(/SomeRegex/)[1])

But imagine doing that for more complicated statements:

ArrayMap({ 
    id: input => input[0],           
    name: input => input[1].trim(),
    url: input => input[2].href
})

With a typical crawling script of ~200 steps, writing all those arrows => is a pain in the knee ass.

So, we need a better approach to delay computation of expressions like _.some.random.property.

Approach 1 : Store history of property access

Instead of evaluating right away, we just store the necessary information until the placeholder is evaluated with real data. Naturally, we can keep a chain of properties being accessed:

_         ->  []
_.a       ->  ['a']
_.a.b     ->  ['a', 'b']
_.a.b.c   ->  ['a', 'b', 'c']
_.a.x[1]  ->  ['a', 'x', '1']

This leads to a simple resolve mechanism:

function resolve(chain, input) {
    return chain.reduce(
        (current, key) => current[key],
        input
    );
} // resolve(['a', 'b', 'c'], data) returns data.a.b.c

However, we want to replace chain with a placeholder, so we can do stuff like resolve(_.a.b.c, data).

Before diving into the implementation, let’s decide on a concrete definition of placeholders so later we know what we are doing.

Placeholder’s definition

Calling createPlaceholder generates a new placeholder.

const _ = createPlaceholder();

Attempting to access a placeholder’s String property will create a new placeholder:

_  -->  _.a  -->  _.a.b  -->  _.a.b.c  and so on

We also require placeholders to contain a method that “evaluates” itself with some input. A minimal version of createPlaceholder can be:

function createPlaceholder() {
    return {
        evaluate: input => input // no-op
    };
}

We then redefine our resolve method:

function resolve(placeholder, data) {
    return placeholder.evaluate(data);
}
// resolve(_.a.b.c, data) should return data.a.b.c

Although unlikely, data might possibly contain a child with name evaluate, which coincides with the placeholder’s method. Rather than a String key, we shall use a Symbol to store our placeholder’s evaluate method.

const $ = Symbol("evaluate");

function resolve(placeholder, data) {
    return placeholder[$](data);
}

Last but not least, for consistency’s sake, resolve(_, data) will just trivially return the target data itself.

To make it easier to follow this post, I have created a JSFiddle playground including several test cases:

// 'assert' calls 'resolve(placeholder, data)'
// and checks if the result is correct.
assert(_, data); // level 0, trivial base case
assert(_.a      ,  "level 1");
assert(_.b.x    ,  "level 2");
assert(_.b.y.u  ,  "level 3");
assert(_.b.z[1] ,  "element 1");

Now we’re ready to tackle the magic function createPlaceholder.

The first attempt

Based on how we resolved a property chain with reduce earlier, this is our first version of createPlaceholder:

function createPlaceholder(chain = []) {
    return {
        [$]: input => chain.reduce(
            (current, key) => current[key],
            input
        )
    };
} // Note: Square brackets are needed around Symbols as property key

With this, our first (trivial) test passes: assert(_, data).

However in the second test, _.a is undefined, then resolve tries to access _.a[$] and blows up JSFiddle’s JavaScript engine.

TypeError: Cannot read property 'Symbol(evaluate)' of undefined

The correct behavior should have been:

and so on.

As a result, the placeholder should somehow realize when its properties are being accessed, and promptly prevent JavaScript from accessing non-existing stuff.

Luckily, ES6 JavaScript introduces the concept of Proxy, which allows us to hack an object and capture its property access events.

Proxies 101

Proxies are an interesting new feature in JavaScript ES6 that unfortunately receives much less attention than the well-celebrated let/const keywords or the arrow function syntax. Along with Symbol and Reflect, proxies bring various long-waited meta-programming capabilities into JavaScript.

In a nutshell, proxies open up a legitimate way to intercept and customize operations performed on objects. In our case, we want to hack an object’s property access operation. Let’s have a look at the textbook example:

const source = { x: 1 };

const handler = {
    get(target, key) {
        if (key in target) {
            console.log("Accessing existing property " + key);
            return target[key];
        } else {
            console.log("Accessing non-existing property " + key);
            return 42;
        }
    }    
};

const proxy = new Proxy(source, handler);

Here, we wrap our source object with a Proxy, which defines a trap for every property access event via handler.get: print out the name of the property being accessed, then return the corresponding value if the property exists, otherwise print the number 42.

console.log(source.x);          // 1
console.log(source.something);  // undefined

console.log(proxy.x);
// Accessing existing property x
// 1

console.log(proxy.something);
// Accessing non-existing property something
// 42

So even when the original source does not contain the property something, proxy.something still returns 42, plus it tells us something has been accessed. Sweet.

Proxy is a very powerful concept. If you’re interested in learning more, ES6 Proxies in Depth is a pretty good place to start.

The first attempt (proxified)

Let’s rewrite our previous code with Proxy:

function createPlaceholder(chain = []) {
    return new Proxy({
        [$]: input => chain.reduce(
            (current, key) => current[key],
            input
        )
    }, {
        get(target, key) {
            // placeholder[$] simply returns the $ function
            // instead of creating another placeholder
            return (key === $) ? target[$] : createPlaceholder([...chain, key]);
        }
    });
}

We’ve wrapped the placeholder with Proxy, enhancing it with the ability to access whatever property we want. The real magic lies in get(target, key) handler, which recursively calls createPlaceholder with the augmented chain.

And with that, voilà, all of our five tests pass with flying colors!

All our tests passed!

Storing property chains is exactly how placeholders were originally implemented in Flowstrike. However, to support more complicated expressions with function calls like _.a.someString.trim().split(',')[1], we need an even better approach.

In the next blog post, I’ll introduce a bottom-up approach that relies more on the idea of recursion.

  1. I used the underscore _ symbol for placeholders as an allusion to Python REPL, in which _ is the variable holding the result of the last executed statement.