HighTable DataFrame V2

This is a proposal for a new DataFrame format in https://github.com/hyparam/hightable.

Goals:

synchronous rendering
events to trigger rerenders
support sorting
support sampling and shuffling
support selecting a slice of sorted rows
support rendering on every cell resolution
add/remove/reorder columns?
add/remove/reorder rows? (sampling and shuffling, but controlled)
update cell values?
infinite number of rows?

DataFrame

The dataframe is a representation of the underlying data. It has a fixed number of rows and columns, and it can be sorted, sampled, or shuffled. The result of each operation is a new dataframe.

Name	Description	Range
`df.numRows`	the number of rows in the dataframe. It might be less than the parent dataframe in case of sampling. It cannot be infinite (at least for now).	`[0, +Infinity[`
`rowNumber`	the index of the row in the underlying data.	Depends on the parent dataframes or data sources.
`rowIndex`	the index of the row in the dataframe. It might be different from `rowNumber` in case of sorting, sampling and/or shuffling.	`[0, dfNumRows - 1]`

Typescript interface

interface DataFrame {
  numRows: number;
  header: string[];

  // Checks if the required data is available, and it not, it fetches it.
  // The method is asynchronous and resolves when all the data has been fetch.
  //
  // It rejects on the first error, which can be the signal abort (it must throw `AbortError`).
  //
  // It's responsible for dispatching the "cell:resolve" and "rownumber:resolve" events when data has resolved
  // (ie: when some new data is available synchronously with the methods `getCell` and `getRowNumber`).
  // It can dispatch the events multiple times if the data is fetched in chunks.
  //
  // Note that it does not return the data.
  fetch(data: {
    rowStart: number; // inclusive
    rowEnd: number; // exclusive
    columns: string[]; // column names to render, along to the row numbers (always fetched). It can be empty to fetch only the row numbers.
    signal?: AbortSignal; // optional signal to abort the fetch jobs
  }): Promise<void>;

  // Returns the row number (index in the underlying data) for the given row index in the dataframe.
  // undefined if the row number is not available yet.
  getRowNumber(data: {
    row: number; // row index in the dataframe
  }): ResolvedValue<number> | undefined;

  // Returns the cell value for the given row index and column name in the dataframe.
  // undefined if the cell value is not available yet.
  getCell(data: {
    row: number; // row index in the dataframe
    column: string; // column name
  }): ResolvedValue<any> | undefined;

  // Event target to listen for changes in the dataframe.
  // It can be used to trigger rerenders when the data is updated.
  eventTarget: CustomEventTarget<DataFrameEvents>;
}

interface ResolvedValue<T> {
  value: T; // the resolved value
}

interface DataFrameEvents {
  "cell:resolve": {
    rowStart: number;
    rowEnd: number;
    columns: string[];
  };
  "rownumber:resolve": {
    rowStart: number;
    rowEnd: number;
  };
}

interface CustomEventTarget<TDetails> {
  addEventListener<TType extends keyof TDetails>(
    type: TType,
    listener: (ev: CustomEvent<TDetails[TType]>) => any,
    options?: boolean | AddEventListenerOptions
  ): void;

  removeEventListener<TType extends keyof TDetails>(
    type: TType,
    listener: (ev: CustomEvent<TDetails[TType]>) => any,
    options?: boolean | EventListenerOptions
  ): void;

  dispatchEvent<TType extends keyof TDetails>(
    ev: _TypedCustomEvent<TDetails, TType>
  ): void;
}
declare class _TypedCustomEvent<
  TDetails,
  TType extends keyof TDetails
> extends CustomEvent<TDetails[TType]> {
  constructor(
    type: TType,
    eventInitDict: { detail: TDetails[TType] } & EventInit
  );
}

New dataframes can be created from a dataframe using functions:

sort(options: { dataFrame: DataFrame, orderBy: OrderBy }): DataFrame: sorts the dataframe by the given columns and directions. The result is a new dataframe with the same number of rows, but the rows are sorted according to the given order.
sample(options: { dataFrame: DataFrame, numRows: number }): DataFrame: samples the dataframe to the given number of rows. The result is a new dataframe with the given number of rows. The rows are randomly selected from the original dataframe in order.
shuffle(options: { dataFrame: DataFrame }): DataFrame: shuffles the dataframe. The result is a new dataframe with the same number of rows, but the rows are randomly re-ordered.

A dataframe can also be created from an array of objects:

fromArray(data: Record<string, any>[]): DataFrame: creates a dataframe from the given array of objects. The objects are used as rows, and the keys of the objects are used as columns. The order of the columns is determined by the first object in the array.

The dataframe is responsible to handle the cache of the cells and row numbers, so that the data is fetched only once and reused when needed.

Examples

Let's use the following array as the underlying data:

data = [
  { name: "Charlie", age: 25, animal: "cat" },
  { name: "Alice", age: 30, animal: "dog" },
  { name: "Dani", age: 20, animal: "fish" },
  { name: "Bob", age: 20, animal: "cat" },
];

df1: underlying data

The dataframe represents the underlying data unmodified. It has 4 rows:

df1 = fromArray(data);

expect(df1.numRows).toBe(4);
expect(df1.header).toBe(["name", "age", "animal"]);
expect(df1.toArray()).toBe([
  { name: "Charlie", age: 25, animal: "cat" },
  { name: "Alice", age: 30, animal: "dog" },
  { name: "Dani", age: 20, animal: "fish" },
  { name: "Bob", age: 20, animal: "cat" },
]);
expect(df1.toRowNumbers()).toBe([0, 1, 2, 3]);

Show all the rows and columns:

df1.rowIndex	rowNumber	name	age	animal
0	0	Charlie	25	cat
1	1	Alice	30	dog
2	2	Dani	20	fish
3	3	Bob	20	cat

df1.fetch({ rowStart: 0, rowEnd: 4, columns: ["name", "age", "animal"] });
// on each render, for each row (0-3) and column ("name", "age", "animal"):
df1.getRowNumber({ row });
df1.getCell({ row, column });

Show a subset of the data (2nd and 3rd rows, and two columns):

df1.rowIndex	rowNumber	name	age
1	1	Alice	30
2	2	Dani	20

df1.fetch({ rowStart: 1, rowEnd: 3, columns: ["name", "age"] });
// on each render, for each row (1-2) and column ("name", "age"):
df1.getRowNumber({ row });
df1.getCell({ row, column });

df2: sorted by descending age

The dataframe represents the example data, sorted by descending age (the natural order in example data is used in case of tie). It has 4 rows:

const orderBy = [{ column: "age", direction: "descending" }];
df2 = sort({ dataFrame: df1, orderBy });

expect(df2.numRows).toBe(4);
expect(df2.header).toBe(["name", "age", "animal"]);
expect(df2.toArray()).toBe([
  { name: "Alice", age: 30, animal: "dog" },
  { name: "Charlie", age: 25, animal: "cat" },
  { name: "Dani", age: 20, animal: "fish" },
  { name: "Bob", age: 20, animal: "cat" },
]);
expect(df2.toRowNumbers()).toBe([1, 0, 2, 3]);