Pattern Matching DataShapes

DataShape includes type variables, as symbols beginning with a capital letter. For example A * int32 represents a one-dimensional array of int32, where the size or type of the dimension is unspecified. Similarly, 3 * A represents a size 3 one-dimensional array where the data type is unspecified.

The main usage of pattern matching in the DataShape system is for specifying function signatures. To provide a little bit of motivation, let’s examine what happens in NumPy ufuncs, and see how we can represent their behaviors via DataShape types.

NumPy ldexp UFunc Example

We’re going to use the ldexp ufunc, which is for the C/C++ function with overloads double ldexp(double x, int exp) and float ldexp(float x, int exp), computing x * 2^exp by tweaking the exponent in the floating point format. (We’re ignoring the long double for now.)

These C++ functions can be represented with the DataShape function signatures:

(float32, int32) -> float32
(float64, int32) -> float64

As a NumPy ufunc, there is an behavior for arrays, where the source arrays are “broadcasted” together, and the function is computed elementwise.

In the simplest case, given two arrays which match, the result is an array of the same size. When one array has size one in a dimension, it gets repeated to match the size of the other dimension. When one array has fewer dimensions, it gets repeated to fill in the outer dimensions. The “broadcast” array shape is the result of combining all these repetitions, and is the shape of the output. Represented as DataShape function signatures, some examples are:

(12 * float32, 12 * int32) -> 12 * float32
(10 * float64, 1 * int32) -> 10 * float64
(float32, 3 * 4 * int32) -> 3 * 4 * float32
(3 * float64, 4 * 1 * int64) -> 4 * 3 * float64

Ellipsis for Broadcasting

To represent the general broadcasting behavior, DataShape provides ellipsis type variables.:

(A... * float32, A... * int32) -> A... * float32
(A... * float64, A... * int64) -> A... * float64

Coercions/Broadcasting as a System of Equations

Let’s say as input we get two arrays with datashapes 3 * 4 * float64 and int32. We can express this as two systems of coercion equations as follows (using ==> as a “coerces to” operator):

# float32 prototype
3 * 4 * float64 ==> A... * float32
int32 ==> A... * int32

# float64 prototype
3 * 4 * float64 ==> A... * float64
int32 ==> A... * int32

To solve these equations, we evaluate the legality of each coercion, and accumulate the set of values the A... type variable must take.:

# float32 prototype
float64 ==> float32 # ILLEGAL
3 * 4 * ==> A... *  # "3 * 4 *" in A...
int32 ==> int32     # LEGAL
* ==> A...          # "*" in A...

# float64 prototype
float64 ==> float64 # LEGAL
3 * 4 * ==> A... *  # "3 * 4 *" in A...
int32 ==> int32     # LEGAL
* ==> A...          # "*" in A...

The float32 prototype can be discarded because it requires an illegal coercion. In the float64 prototype, we collect the set of all A... values {“3 * 4 *”, “*”}, broadcast them together to get “3 * 4 *”, and substitute this in the output. Doing all the substitutions in the full prototype produces:

(3 * 4 * float64, int32) -> 3 * 4 * float64

as the matched function prototype that results.

Disallowing Coercion

In the particular function we picked, ideally we don’t want to allow implicit coercion of the type, because the nature of the function is to “load the exponent” in particular formats of floating point number. Saying ldexp(True, 3), and having it work is kind of weird.

One way to tackle this would be to add an exact type, both as a dimension and a data type, which indicates that broadcasting should be disallowed. For the discussion, in addition to ldexp, lets introduce a vector magnitude function mag, where we want to disallow scalar arrays to broadcast into it.:

# ldexp signatures
(A... * exact[float32], A... * int32) -> A... * float32
(A... * exact[float64], A... * int64) -> A... * float64

# mag signatures
(A... * exact[2] * float32) -> A... * float32
(A... * exact[3] * float32) -> A... * float32

# ufunc but disallowing broadcasting
(exact[A...] * int32, exact[A...] * int32) -> A... * int32

A possible syntactic sugar (which I’m not attached to, I think this needs some exploration) for this is:

# ldexp signatures
(A... * float32=, A... * int32) -> A... * float32
(A... * float64=, A... * int64) -> A... * float64

# mag signatures
(A... * 2= * float32) -> A... * float32
(A... * 3= * float32) -> A... * float32

# ufunc but disallowing broadcasting
(A=.. * int32, A=.. * int32) -> A... * int32

Factoring a Set of Signatures

One of the main things the multiple dispatch in DataShape has to do is match input arrays against a set of signatures very efficiently. We need to be able to hide the abstraction we’re creating, and provide performance competitive with, but ideally superior to, what NumPy provides in its ufunc system.

Factoring the set of signatures into two or more stages which are simpler to solve and can prune the possibilities more quickly is one way to do this abstraction hiding. Let’s use the add function for our example, with the following subset of signatures. We’ve included the datetime signatures to dispel any notion that the signatures will always match precisely.:

# add signatures
(A... * int32, A... * int32) -> A... * int32
(A... * int64, A... * int64) -> A... * int64
(A... * float32, A... * float32) -> A... * float32
(A... * float64, A... * float64) -> A... * float64
(A... * timedelta, A... * timedelta) -> A... * timedelta
(A... * datetime, A... * timedelta) -> A... * datetime
(A... * timedelta, A... * datetime) -> A... * datetime

Because the broadcasting of all these cases is identical, we can transform this set of signatures into two stages as follows:

# broadcasting stage
(A... * X, A... * Y) -> A... * Z

# data type stage matched against (X, Y)
(int32, int32) -> int32
(int64, int64) -> int64
(float32, float32) -> float32
(float64, float64) -> float64
(timedelta, timedelta) -> timedelta
(datetime, timedelta) -> datetime
(timedelta, datetime) -> datetime

Let’s work through this example to illustrate how it works.:

# Stage 1: Input arrays "3 * 1 * int32", "4 * float32"
#    (A... * X, A... * Y) -> A... * Z
int32 ==> X       # "int32" in X
3 * 1 * ==> A...  # "3 * 1 *" in A...
float32 ==> Y     # "float32" in Y
4 * ==> A...      # "4 *" in A...

# Solution: A... is "3 * 4 *", X is "int32", and Y is "float32"
# Stage 2: Input arrays "int32" and "float32"
#    (int32, int32) -> int32
int32 ==> int32     # LEGAL
float32 ==> int32   # ILLEGAL
#    (float32, float32) -> float32
int32 ==> float32   # LEGAL
float32 ==> float32 # LEGAL
# etc.

# Assume we picked (float32, float32) -> float32
# so the variables are:
# X is "float32"
# Y is "float32"
# Z is "float32"
# giving the solution substituted into stage 1:
(3 * 1 * float32, 4 * float32) -> 3 * 4 * float32