Datashape Overview

Datashape is a data layout language for array programming. It is designed to describe in-situ structured data without requiring transformation into a canonical form.

Similar to NumPy, datashape includes shape and dtype, but combined together in the type system.

Units

Single named types in datashape are called unit types. They represent either a dtype like int32 or datetime, or a single dimension like var. Dimensions and a single dtype are composed together in a datashape type.

Primitive Types

DataShape includes a variety of dtypes corresponding to C/C++ types, similar to NumPy.

Bit type Description
bool Boolean (True or False) stored as a byte
int8 Byte (-128 to 127)
int16 Two’s Complement Integer (-32768 to 32767)
int32 Two’s Complement Integer (-2147483648 to 2147483647)
int64 Two’s Complement Integer (-9223372036854775808 to 9223372036854775807)
uint8 Unsigned integer (0 to 255)
uint16 Unsigned integer (0 to 65535)
uint32 Unsigned integer (0 to 4294967295)
uint64 Unsigned integer (0 to 18446744073709551615)
float16 Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32 Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex[float32] Complex number, represented by two 32-bit floats (real and imaginary components)
complex[float64] Complex number, represented by two 64-bit floats (real and imaginary components)

Additionally, there are types which are not fully specified at the bit/byte level.

Bit type Description
string Variable length Unicode string.
bytes Variable length array of bytes.
json Variable length Unicode string which contains JSON.
date Date in the proleptic Gregorian calendar.
time Time not attached to a date.
datetime Point in time, combination of date and time.
units Associates physical units with numerical values.

Many python types can be mapped to datashape types:

Python type Datashape
int int32
bool bool
float float64
complex complex[float64]
str string
unicode string
datetime.date date
datetime.time time
datetime.datetime datetime or datetime[tz=’<timezone>’]
datetime.timedelta units[‘microsecond’, int64]
bytes bytes
bytearray bytes
buffer bytes

String Types

To Blaze, all strings are sequences of unicode code points, following in the footsteps of Python 3. The default Blaze string atom, simply called “string”, is a variable-length string which can contain any unicode values. There is also a fixed-size variant compatible with NumPy’s strings, like string[16, "ascii"].

Dimensions

An asterisk (*) between two types signifies an array. A datashape consists of 0 or more dimensions followed by a dtype.

For example, an integer array of size three is:

3 * int

In this type, 3 is is a fixed dimension, which means it is a dimension whose size is always as given. Other dimension types include var.

Comparing with NumPy, the array created by np.empty((2, 3), 'int32') has datashape 2 * 3 * int32.

Records

Record types are ordered struct dtypes which hold a collection of types keyed by labels. Records look similar to Python dictionaries but the order the names appear is important.

Example 1:

{
    name   : string,
    age    : int,
    height : int,
    weight : int
}

Example 2:

{
    r: int8,
    g: int8,
    b: int8,
    a: int8
}

Records are themselves types declaration so they can be nested, but cannot be self-referential:

Example 2:

{
    a: { x: int, y: int },
    b: { x: int, z: int }
}

Datashape Traits

While datashape is a very general type system, there are a number of patterns a datashape might fit in.

Tabular datashapes have just one dimension, typically fixed or var, followed by a record containing only simple types, not nested records. This can be intuitively thought of as data which will fit in a SQL table.:

var * { x : int, y : real, z : date }

Homogenous datashapes are arrays that have a simple dtype, the kind of data typically used in numeric computations. For example, a 3D velocity field might look like:

100 * 100 * 100 * 3 * real

Type Variables

Type variables are a separate class of types that express free variables scoped within type signatures. Holding type variables as first order terms in the signatures encodes the fact that a term can be used in many concrete contexts with different concrete types.

For example the type capable of expressing all square two dimensional matrices could be written as a datashape with type variable A, constraining the two dimensions to be the same:

A * A * int32

A type capable of rectangular variable length arrays of integers can be written as two free type vars:

A * B * int32

Note

Any name beginning with an uppercase letter is parsed as a symbolic type (as opposed to concrete). Symbolic types can be used both as dimensions and as data types.

Option

An option type represents data which may be there or not. This is like data with NA values in R, or nullable columns in SQL. Given a type like int, it can be transformed by prefixing it with a question mark as ?int, or equivalently using the type constructor option[int]

For example a 5 * ?int array can model the Python data:

[1, 2, 3, None, None, 4]