This is a proposal for a new DataFrame format in https://github.com/hyparam/hightable.
Goals:
- synchronous rendering
- events to trigger rerenders
- support sorting
- support sampling and shuffling
- support selecting a slice of sorted rows
- support rendering on every cell resolution
- add/remove/reorder columns?
- add/remove/reorder rows? (sampling and shuffling, but controlled)
- update cell values?
- infinite number of rows?
The dataframe is a representation of the underlying data. It has a fixed number of rows and columns, and it can be sorted, sampled, or shuffled. The result of each operation is a new dataframe.
Name | Description | Range |
---|---|---|
df.numRows |
the number of rows in the dataframe. It might be less than the parent dataframe in case of sampling. It cannot be infinite (at least for now). | [0, +Infinity[ |
rowNumber |
the index of the row in the underlying data. | Depends on the parent dataframes or data sources. |
rowIndex |
the index of the row in the dataframe. It might be different from rowNumber in case of sorting, sampling and/or shuffling. |
[0, dfNumRows - 1] |
interface DataFrame {
numRows: number;
header: string[];
// Checks if the required data is available, and it not, it fetches it.
// The method is asynchronous and resolves when all the data has been fetch.
//
// It rejects on the first error, which can be the signal abort (it must throw `AbortError`).
//
// It's responsible for dispatching the "cell:resolve" and "rownumber:resolve" events when data has resolved
// (ie: when some new data is available synchronously with the methods `getCell` and `getRowNumber`).
// It can dispatch the events multiple times if the data is fetched in chunks.
//
// Note that it does not return the data.
fetch(data: {
rowStart: number; // inclusive
rowEnd: number; // exclusive
columns: string[]; // column names to render, along to the row numbers (always fetched). It can be empty to fetch only the row numbers.
signal?: AbortSignal; // optional signal to abort the fetch jobs
}): Promise<void>;
// Returns the row number (index in the underlying data) for the given row index in the dataframe.
// undefined if the row number is not available yet.
getRowNumber(data: {
row: number; // row index in the dataframe
}): ResolvedValue<number> | undefined;
// Returns the cell value for the given row index and column name in the dataframe.
// undefined if the cell value is not available yet.
getCell(data: {
row: number; // row index in the dataframe
column: string; // column name
}): ResolvedValue<any> | undefined;
// Event target to listen for changes in the dataframe.
// It can be used to trigger rerenders when the data is updated.
eventTarget: CustomEventTarget<DataFrameEvents>;
}
interface ResolvedValue<T> {
value: T; // the resolved value
}
interface DataFrameEvents {
"cell:resolve": {
rowStart: number;
rowEnd: number;
columns: string[];
};
"rownumber:resolve": {
rowStart: number;
rowEnd: number;
};
}
interface CustomEventTarget<TDetails> {
addEventListener<TType extends keyof TDetails>(
type: TType,
listener: (ev: CustomEvent<TDetails[TType]>) => any,
options?: boolean | AddEventListenerOptions
): void;
removeEventListener<TType extends keyof TDetails>(
type: TType,
listener: (ev: CustomEvent<TDetails[TType]>) => any,
options?: boolean | EventListenerOptions
): void;
dispatchEvent<TType extends keyof TDetails>(
ev: _TypedCustomEvent<TDetails, TType>
): void;
}
declare class _TypedCustomEvent<
TDetails,
TType extends keyof TDetails
> extends CustomEvent<TDetails[TType]> {
constructor(
type: TType,
eventInitDict: { detail: TDetails[TType] } & EventInit
);
}
New dataframes can be created from a dataframe using functions:
sort(options: { dataFrame: DataFrame, orderBy: OrderBy }): DataFrame
: sorts the dataframe by the given columns and directions. The result is a new dataframe with the same number of rows, but the rows are sorted according to the given order.sample(options: { dataFrame: DataFrame, numRows: number }): DataFrame
: samples the dataframe to the given number of rows. The result is a new dataframe with the given number of rows. The rows are randomly selected from the original dataframe in order.shuffle(options: { dataFrame: DataFrame }): DataFrame
: shuffles the dataframe. The result is a new dataframe with the same number of rows, but the rows are randomly re-ordered.
A dataframe can also be created from an array of objects:
fromArray(data: Record<string, any>[]): DataFrame
: creates a dataframe from the given array of objects. The objects are used as rows, and the keys of the objects are used as columns. The order of the columns is determined by the first object in the array.
The dataframe is responsible to handle the cache of the cells and row numbers, so that the data is fetched only once and reused when needed.
Let's use the following array as the underlying data:
data = [
{ name: "Charlie", age: 25, animal: "cat" },
{ name: "Alice", age: 30, animal: "dog" },
{ name: "Dani", age: 20, animal: "fish" },
{ name: "Bob", age: 20, animal: "cat" },
];
The dataframe represents the underlying data unmodified. It has 4 rows:
df1 = fromArray(data);
expect(df1.numRows).toBe(4);
expect(df1.header).toBe(["name", "age", "animal"]);
expect(df1.toArray()).toBe([
{ name: "Charlie", age: 25, animal: "cat" },
{ name: "Alice", age: 30, animal: "dog" },
{ name: "Dani", age: 20, animal: "fish" },
{ name: "Bob", age: 20, animal: "cat" },
]);
expect(df1.toRowNumbers()).toBe([0, 1, 2, 3]);
Show all the rows and columns:
df1.rowIndex | rowNumber | name | age | animal |
---|---|---|---|---|
0 | 0 | Charlie | 25 | cat |
1 | 1 | Alice | 30 | dog |
2 | 2 | Dani | 20 | fish |
3 | 3 | Bob | 20 | cat |
df1.fetch({ rowStart: 0, rowEnd: 4, columns: ["name", "age", "animal"] });
// on each render, for each row (0-3) and column ("name", "age", "animal"):
df1.getRowNumber({ row });
df1.getCell({ row, column });
Show a subset of the data (2nd and 3rd rows, and two columns):
df1.rowIndex | rowNumber | name | age |
---|---|---|---|
1 | 1 | Alice | 30 |
2 | 2 | Dani | 20 |
df1.fetch({ rowStart: 1, rowEnd: 3, columns: ["name", "age"] });
// on each render, for each row (1-2) and column ("name", "age"):
df1.getRowNumber({ row });
df1.getCell({ row, column });
The dataframe represents the example data, sorted by descending age (the natural order in example data is used in case of tie). It has 4 rows:
const orderBy = [{ column: "age", direction: "descending" }];
df2 = sort({ dataFrame: df1, orderBy });
expect(df2.numRows).toBe(4);
expect(df2.header).toBe(["name", "age", "animal"]);
expect(df2.toArray()).toBe([
{ name: "Alice", age: 30, animal: "dog" },
{ name: "Charlie", age: 25, animal: "cat" },
{ name: "Dani", age: 20, animal: "fish" },
{ name: "Bob", age: 20, animal: "cat" },
]);
expect(df2.toRowNumbers()).toBe([1, 0, 2, 3]);
Show the name and age for the 2nd and 3rd rows:
df2.rowIndex | df1.rowIndex | rowNumber | name | age |
---|---|---|---|---|
1 | 0 | 0 | Charlie | 25 |
2 | 2 | 2 | Dani | 20 |
df2.fetch({ rowStart: 1, rowEnd: 3, columns: ["name", "age"] });
// on each render, for each row (1-2) and column ("name", "age"):
df2.getRowNumber({ row });
df2.getCell({ row, column });
The dataframe represents the example data, sorted by animal, then by age in case of tie (then by the natural order in example data in case of another tie). It has 4 rows:
const orderBy = [
{ column: "animal", direction: "ascending" },
{ column: "age", direction: "ascending" },
];
// Note that df3 is derived from df1, not df2.
df3 = sort({ dataFrame: df1, orderBy });
expect(df3.numRows).toBe(4);
expect(df3.header).toBe(["name", "age", "animal"]);
expect(df3.toArray()).toBe([
{ name: "Bob", age: 20, animal: "cat" },
{ name: "Charlie", age: 25, animal: "cat" },
{ name: "Alice", age: 30, animal: "dog" },
{ name: "Dani", age: 20, animal: "fish" },
]);
expect(df2.toRowNumbers()).toBe([3, 0, 1, 2]);
Show the name and age for the 2nd and 3rd rows:
df3.rowIndex | df1.rowIndex | rowNumber | name | age |
---|---|---|---|---|
1 | 0 | 0 | Charlie | 25 |
2 | 1 | 1 | Alice | 30 |
The dataframe represents a sample of the example data with 3 rows instead of 4. Let's say it removes Alice:
df4 = sample({ dataFrame: df1, numRows: 3 });
expect(df4.numRows).toBe(3);
expect(df4.header).toBe(["name", "age", "animal"]);
expect(df4.toArray()).toBe([
{ name: "Charlie", age: 25, animal: "cat" },
{ name: "Dani", age: 20, animal: "fish" },
{ name: "Bob", age: 20, animal: "cat" },
]);
expect(df4.toRowNumbers()).toBe([0, 2, 3]);
Show the name and age for the 2nd and 3rd rows:
df4.rowIndex | df1.rowIndex | rowNumber | name | age |
---|---|---|---|---|
1 | 2 | 2 | Dani | 20 |
2 | 3 | 3 | Bob | 20 |
df4.fetch({ rowStart: 1, rowEnd: 3, columns: ["name", "age"] });
// on each render, for each row (1-2) and column ("name", "age"):
df4.getRowNumber({ row });
df4.getCell({ row, column });
The dataframe represents the sampled data (df4) but the order is shuffled: let's say Dani, Bob, Charlie. It has 3 rows.
df5 = shuffle({ dataFrame: df4 });
expect(df5.numRows).toBe(3);
expect(df5.header).toBe(["name", "age", "animal"]);
expect(df5.toArray()).toBe([
{ name: "Dani", age: 20, animal: "fish" },
{ name: "Bob", age: 20, animal: "cat" },
{ name: "Charlie", age: 25, animal: "cat" },
]);
expect(df5.toRowNumbers()).toBe([2, 3, 0]);
Show the name and age for the 2nd and 3rd rows:
df5.rowIndex | df4.rowIndex | df1.rowIndex | rowNumber | name | age |
---|---|---|---|---|---|
1 | 2 | 3 | 3 | Bob | 20 |
2 | 0 | 0 | 0 | Charlie | 25 |
df5.fetch({ rowStart: 1, rowEnd: 3, columns: ["name", "age"] });
// on each render, for each row (1-2) and column ("name", "age"):
df5.getRowNumber({ row });
df5.getCell({ row, column });
The dataframe represents the sampled and shuffled data (df5) but sorted by animal. It has 3 rows.
const orderBy = [{ column: "animal", direction: "ascending" }];
df6 = sort({ dataFrame: df5, orderBy });
expect(df6.numRows).toBe(3);
expect(df6.header).toBe(["name", "age", "animal"]);
expect(df6.toArray()).toBe([
{ name: "Bob", age: 20, animal: "cat" },
{ name: "Charlie", age: 25, animal: "cat" },
{ name: "Dani", age: 20, animal: "fish" },
]);
expect(df6.toRowNumbers()).toBe([3, 0, 2]);
Show the name and age for the 2nd and 3rd rows:
df6.rowIndex | df5.rowIndex | df4.rowIndex | df1.rowIndex | rowNumber | name | age |
---|---|---|---|---|---|---|
1 | 2 | 0 | 0 | 0 | Charlie | 25 |
2 | 0 | 1 | 2 | 2 | Dani | 20 |
In HighTable, rows can be selected by the user, or programmatically. The selection is represented by the row numbers, ie the index of each selected row in the underlying data.
As the row number is not available until the data is fetched, HighTable has to manage the selection state:
- if HighTable is being passed a selection, and not all the row numbers are available, it will set all the selection controls in "pending mode" (eg. disabled with unknown state) until the data is fetched (or the fetch is aborted).
- when shift-clicking to select a range of rows, the row number for some rows in the range might not be available yet. If so once the gesture occurs, all the selection controls will be set in "pending mode" until the data is fetched.
Hmmmm. I realized that sorting is not the same as sampling and shuffling.
We will keep having the option to sort any dataframe, while sampling, shuffling, or versioning (think iceberg versions) should be done by creating a different dataframe.