# Datashape Overview¶

Datashape is a data layout language for array programming. It is designed to describe in-situ structured data without requiring transformation into a canonical form.

Similar to NumPy, datashape includes `shape`

and `dtype`

, but combined
together in the type system.

## Units¶

Single named types in datashape are called `unit`

types. They represent
either a dtype like `int32`

or `datetime`

, or a single dimension
like `var`

. Dimensions and a single dtype are composed together in
a datashape type.

### Primitive Types¶

DataShape includes a variety of dtypes corresponding to C/C++ types, similar to NumPy.

Bit type | Description |
---|---|

bool | Boolean (True or False) stored as a byte |

int8 | Byte (-128 to 127) |

int16 | Two’s Complement Integer (-32768 to 32767) |

int32 | Two’s Complement Integer (-2147483648 to 2147483647) |

int64 | Two’s Complement Integer (-9223372036854775808 to 9223372036854775807) |

uint8 | Unsigned integer (0 to 255) |

uint16 | Unsigned integer (0 to 65535) |

uint32 | Unsigned integer (0 to 4294967295) |

uint64 | Unsigned integer (0 to 18446744073709551615) |

float16 | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa |

float32 | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa |

float64 | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa |

complex[float32] | Complex number, represented by two 32-bit floats (real and imaginary components) |

complex[float64] | Complex number, represented by two 64-bit floats (real and imaginary components) |

Additionally, there are types which are not fully specified at the bit/byte level.

Bit type | Description |
---|---|

string | Variable length Unicode string. |

bytes | Variable length array of bytes. |

json | Variable length Unicode string which contains JSON. |

date | Date in the proleptic Gregorian calendar. |

time | Time not attached to a date. |

datetime | Point in time, combination of date and time. |

units | Associates physical units with numerical values. |

Many python types can be mapped to datashape types:

Python type | Datashape |
---|---|

int | int32 |

bool | bool |

float | float64 |

complex | complex[float64] |

str | string |

unicode | string |

datetime.date | date |

datetime.time | time |

datetime.datetime | datetime or datetime[tz=’<timezone>’] |

datetime.timedelta | units[‘microsecond’, int64] |

bytes | bytes |

bytearray | bytes |

buffer | bytes |

### String Types¶

To Blaze, all strings are sequences of unicode code points, following
in the footsteps of Python 3. The default Blaze string atom, simply
called “string”, is a variable-length string which can contain any
unicode values. There is also a fixed-size variant compatible with
NumPy’s strings, like `string[16, "ascii"]`

.

## Dimensions¶

An asterisk (*) between two types signifies an array. A datashape
consists of 0 or more `dimensions`

followed by a `dtype`

.

For example, an integer array of size three is:

```
3 * int
```

In this type, 3 is is a `fixed`

dimension, which means it is a dimension
whose size is always as given. Other dimension types include `var`

.

Comparing with NumPy, the array created by
`np.empty((2, 3), 'int32')`

has datashape `2 * 3 * int32`

.

### Records¶

Record types are ordered struct dtypes which hold a collection of types keyed by labels. Records look similar to Python dictionaries but the order the names appear is important.

Example 1:

```
{
name : string,
age : int,
height : int,
weight : int
}
```

Example 2:

```
{
r: int8,
g: int8,
b: int8,
a: int8
}
```

Records are themselves types declaration so they can be nested, but cannot be self-referential:

Example 2:

```
{
a: { x: int, y: int },
b: { x: int, z: int }
}
```

### Datashape Traits¶

While datashape is a very general type system, there are a number of patterns a datashape might fit in.

Tabular datashapes have just one dimension, typically `fixed`

or
`var`

, followed by a record containing only simple types, not
nested records. This can be intuitively thought of as data which
will fit in a SQL table.:

```
var * { x : int, y : real, z : date }
```

Homogenous datashapes are arrays that have a simple dtype, the kind of data typically used in numeric computations. For example, a 3D velocity field might look like:

```
100 * 100 * 100 * 3 * real
```

### Type Variables¶

Type variables are a separate class of types that express free variables scoped within type signatures. Holding type variables as first order terms in the signatures encodes the fact that a term can be used in many concrete contexts with different concrete types.

For example the type capable of expressing all square two dimensional
matrices could be written as a datashape with type variable `A`

,
constraining the two dimensions to be the same:

```
A * A * int32
```

A type capable of rectangular variable length arrays of integers can be written as two free type vars:

```
A * B * int32
```

Note

Any name beginning with an uppercase letter is parsed as a symbolic type (as opposed to concrete). Symbolic types can be used both as dimensions and as data types.

### Option¶

An option type represents data which may be there or not. This is like
data with `NA`

values in R, or nullable columns in SQL. Given a type
like `int`

, it can be transformed by prefixing it with a question mark
as `?int`

, or equivalently using the type constructor `option[int]`

For example a `5 * ?int`

array can model the Python data:

```
[1, 2, 3, None, None, 4]
```