Skip to content

Instantly share code, notes, and snippets.

@severo
Last active July 1, 2025 16:16
Show Gist options
  • Save severo/622d4beddf81048c42f86bd84044415d to your computer and use it in GitHub Desktop.
Save severo/622d4beddf81048c42f86bd84044415d to your computer and use it in GitHub Desktop.
HighTable DataFrame V2

HighTable DataFrame V2

This is a proposal for a new DataFrame format in https://github.com/hyparam/hightable.

Goals:

  • synchronous rendering
  • events to trigger rerenders
  • support sorting
  • support sampling and shuffling
  • support selecting a slice of sorted rows
  • support rendering on every cell resolution
  • add/remove/reorder columns?
  • add/remove/reorder rows? (sampling and shuffling, but controlled)
  • update cell values?
  • infinite number of rows?

DataFrame

The dataframe is a representation of the underlying data. It has a fixed number of rows and columns, and it can be sorted, sampled, or shuffled. The result of each operation is a new dataframe.

Name Description Range
df.numRows the number of rows in the dataframe. It might be less than the parent dataframe in case of sampling. It cannot be infinite (at least for now). [0, +Infinity[
rowNumber the index of the row in the underlying data. Depends on the parent dataframes or data sources.
rowIndex the index of the row in the dataframe. It might be different from rowNumber in case of sorting, sampling and/or shuffling. [0, dfNumRows - 1]

Typescript interface

interface DataFrame {
  numRows: number;
  header: string[];

  // Checks if the required data is available, and it not, it fetches it.
  // The method is asynchronous and resolves when all the data has been fetch.
  //
  // It rejects on the first error, which can be the signal abort (it must throw `AbortError`).
  //
  // It's responsible for dispatching the "cell:resolve" and "rownumber:resolve" events when data has resolved
  // (ie: when some new data is available synchronously with the methods `getCell` and `getRowNumber`).
  // It can dispatch the events multiple times if the data is fetched in chunks.
  //
  // Note that it does not return the data.
  fetch(data: {
    rowStart: number; // inclusive
    rowEnd: number; // exclusive
    columns: string[]; // column names to render, along to the row numbers (always fetched). It can be empty to fetch only the row numbers.
    signal?: AbortSignal; // optional signal to abort the fetch jobs
  }): Promise<void>;

  // Returns the row number (index in the underlying data) for the given row index in the dataframe.
  // undefined if the row number is not available yet.
  getRowNumber(data: {
    row: number; // row index in the dataframe
  }): ResolvedValue<number> | undefined;

  // Returns the cell value for the given row index and column name in the dataframe.
  // undefined if the cell value is not available yet.
  getCell(data: {
    row: number; // row index in the dataframe
    column: string; // column name
  }): ResolvedValue<any> | undefined;

  // Event target to listen for changes in the dataframe.
  // It can be used to trigger rerenders when the data is updated.
  eventTarget: CustomEventTarget<DataFrameEvents>;
}

interface ResolvedValue<T> {
  value: T; // the resolved value
}

interface DataFrameEvents {
  "cell:resolve": {
    rowStart: number;
    rowEnd: number;
    columns: string[];
  };
  "rownumber:resolve": {
    rowStart: number;
    rowEnd: number;
  };
}

interface CustomEventTarget<TDetails> {
  addEventListener<TType extends keyof TDetails>(
    type: TType,
    listener: (ev: CustomEvent<TDetails[TType]>) => any,
    options?: boolean | AddEventListenerOptions
  ): void;

  removeEventListener<TType extends keyof TDetails>(
    type: TType,
    listener: (ev: CustomEvent<TDetails[TType]>) => any,
    options?: boolean | EventListenerOptions
  ): void;

  dispatchEvent<TType extends keyof TDetails>(
    ev: _TypedCustomEvent<TDetails, TType>
  ): void;
}
declare class _TypedCustomEvent<
  TDetails,
  TType extends keyof TDetails
> extends CustomEvent<TDetails[TType]> {
  constructor(
    type: TType,
    eventInitDict: { detail: TDetails[TType] } & EventInit
  );
}

New dataframes can be created from a dataframe using functions:

  • sort(options: { dataFrame: DataFrame, orderBy: OrderBy }): DataFrame: sorts the dataframe by the given columns and directions. The result is a new dataframe with the same number of rows, but the rows are sorted according to the given order.
  • sample(options: { dataFrame: DataFrame, numRows: number }): DataFrame: samples the dataframe to the given number of rows. The result is a new dataframe with the given number of rows. The rows are randomly selected from the original dataframe in order.
  • shuffle(options: { dataFrame: DataFrame }): DataFrame: shuffles the dataframe. The result is a new dataframe with the same number of rows, but the rows are randomly re-ordered.

A dataframe can also be created from an array of objects:

  • fromArray(data: Record<string, any>[]): DataFrame: creates a dataframe from the given array of objects. The objects are used as rows, and the keys of the objects are used as columns. The order of the columns is determined by the first object in the array.

The dataframe is responsible to handle the cache of the cells and row numbers, so that the data is fetched only once and reused when needed.

Examples

Let's use the following array as the underlying data:

data = [
  { name: "Charlie", age: 25, animal: "cat" },
  { name: "Alice", age: 30, animal: "dog" },
  { name: "Dani", age: 20, animal: "fish" },
  { name: "Bob", age: 20, animal: "cat" },
];

df1: underlying data

The dataframe represents the underlying data unmodified. It has 4 rows:

df1 = fromArray(data);

expect(df1.numRows).toBe(4);
expect(df1.header).toBe(["name", "age", "animal"]);
expect(df1.toArray()).toBe([
  { name: "Charlie", age: 25, animal: "cat" },
  { name: "Alice", age: 30, animal: "dog" },
  { name: "Dani", age: 20, animal: "fish" },
  { name: "Bob", age: 20, animal: "cat" },
]);
expect(df1.toRowNumbers()).toBe([0, 1, 2, 3]);

Show all the rows and columns:

df1.rowIndex rowNumber name age animal
0 0 Charlie 25 cat
1 1 Alice 30 dog
2 2 Dani 20 fish
3 3 Bob 20 cat
df1.fetch({ rowStart: 0, rowEnd: 4, columns: ["name", "age", "animal"] });
// on each render, for each row (0-3) and column ("name", "age", "animal"):
df1.getRowNumber({ row });
df1.getCell({ row, column });

Show a subset of the data (2nd and 3rd rows, and two columns):

df1.rowIndex rowNumber name age
1 1 Alice 30
2 2 Dani 20
df1.fetch({ rowStart: 1, rowEnd: 3, columns: ["name", "age"] });
// on each render, for each row (1-2) and column ("name", "age"):
df1.getRowNumber({ row });
df1.getCell({ row, column });

df2: sorted by descending age

The dataframe represents the example data, sorted by descending age (the natural order in example data is used in case of tie). It has 4 rows:

const orderBy = [{ column: "age", direction: "descending" }];
df2 = sort({ dataFrame: df1, orderBy });

expect(df2.numRows).toBe(4);
expect(df2.header).toBe(["name", "age", "animal"]);
expect(df2.toArray()).toBe([
  { name: "Alice", age: 30, animal: "dog" },
  { name: "Charlie", age: 25, animal: "cat" },
  { name: "Dani", age: 20, animal: "fish" },
  { name: "Bob", age: 20, animal: "cat" },
]);
expect(df2.toRowNumbers()).toBe([1, 0, 2, 3]);

Show the name and age for the 2nd and 3rd rows:

df2.rowIndex df1.rowIndex rowNumber name age
1 0 0 Charlie 25
2 2 2 Dani 20
df2.fetch({ rowStart: 1, rowEnd: 3, columns: ["name", "age"] });
// on each render, for each row (1-2) and column ("name", "age"):
df2.getRowNumber({ row });
df2.getCell({ row, column });

df3: sorted by animal, then by age

The dataframe represents the example data, sorted by animal, then by age in case of tie (then by the natural order in example data in case of another tie). It has 4 rows:

const orderBy = [
  { column: "animal", direction: "ascending" },
  { column: "age", direction: "ascending" },
];
// Note that df3 is derived from df1, not df2.
df3 = sort({ dataFrame: df1, orderBy });

expect(df3.numRows).toBe(4);
expect(df3.header).toBe(["name", "age", "animal"]);
expect(df3.toArray()).toBe([
  { name: "Bob", age: 20, animal: "cat" },
  { name: "Charlie", age: 25, animal: "cat" },
  { name: "Alice", age: 30, animal: "dog" },
  { name: "Dani", age: 20, animal: "fish" },
]);
expect(df2.toRowNumbers()).toBe([3, 0, 1, 2]);

Show the name and age for the 2nd and 3rd rows:

df3.rowIndex df1.rowIndex rowNumber name age
1 0 0 Charlie 25
2 1 1 Alice 30

df4: sampled data

The dataframe represents a sample of the example data with 3 rows instead of 4. Let's say it removes Alice:

df4 = sample({ dataFrame: df1, numRows: 3 });

expect(df4.numRows).toBe(3);
expect(df4.header).toBe(["name", "age", "animal"]);
expect(df4.toArray()).toBe([
  { name: "Charlie", age: 25, animal: "cat" },
  { name: "Dani", age: 20, animal: "fish" },
  { name: "Bob", age: 20, animal: "cat" },
]);
expect(df4.toRowNumbers()).toBe([0, 2, 3]);

Show the name and age for the 2nd and 3rd rows:

df4.rowIndex df1.rowIndex rowNumber name age
1 2 2 Dani 20
2 3 3 Bob 20
df4.fetch({ rowStart: 1, rowEnd: 3, columns: ["name", "age"] });
// on each render, for each row (1-2) and column ("name", "age"):
df4.getRowNumber({ row });
df4.getCell({ row, column });

df5: sampled shuffled data

The dataframe represents the sampled data (df4) but the order is shuffled: let's say Dani, Bob, Charlie. It has 3 rows.

df5 = shuffle({ dataFrame: df4 });

expect(df5.numRows).toBe(3);
expect(df5.header).toBe(["name", "age", "animal"]);
expect(df5.toArray()).toBe([
  { name: "Dani", age: 20, animal: "fish" },
  { name: "Bob", age: 20, animal: "cat" },
  { name: "Charlie", age: 25, animal: "cat" },
]);
expect(df5.toRowNumbers()).toBe([2, 3, 0]);

Show the name and age for the 2nd and 3rd rows:

df5.rowIndex df4.rowIndex df1.rowIndex rowNumber name age
1 2 3 3 Bob 20
2 0 0 0 Charlie 25
df5.fetch({ rowStart: 1, rowEnd: 3, columns: ["name", "age"] });
// on each render, for each row (1-2) and column ("name", "age"):
df5.getRowNumber({ row });
df5.getCell({ row, column });

df6: sorted sampled and shuffled data

The dataframe represents the sampled and shuffled data (df5) but sorted by animal. It has 3 rows.

const orderBy = [{ column: "animal", direction: "ascending" }];
df6 = sort({ dataFrame: df5, orderBy });

expect(df6.numRows).toBe(3);
expect(df6.header).toBe(["name", "age", "animal"]);
expect(df6.toArray()).toBe([
  { name: "Bob", age: 20, animal: "cat" },
  { name: "Charlie", age: 25, animal: "cat" },
  { name: "Dani", age: 20, animal: "fish" },
]);
expect(df6.toRowNumbers()).toBe([3, 0, 2]);

Show the name and age for the 2nd and 3rd rows:

df6.rowIndex df5.rowIndex df4.rowIndex df1.rowIndex rowNumber name age
1 2 0 0 0 Charlie 25
2 0 1 2 2 Dani 20

HighTable - rows selection

In HighTable, rows can be selected by the user, or programmatically. The selection is represented by the row numbers, ie the index of each selected row in the underlying data.

As the row number is not available until the data is fetched, HighTable has to manage the selection state:

  • if HighTable is being passed a selection, and not all the row numbers are available, it will set all the selection controls in "pending mode" (eg. disabled with unknown state) until the data is fetched (or the fetch is aborted).
  • when shift-clicking to select a range of rows, the row number for some rows in the range might not be available yet. If so once the gesture occurs, all the selection controls will be set in "pending mode" until the data is fetched.
@severo
Copy link
Author

severo commented Jul 1, 2025

Hmmmm. I realized that sorting is not the same as sampling and shuffling.

We will keep having the option to sort any dataframe, while sampling, shuffling, or versioning (think iceberg versions) should be done by creating a different dataframe.

@severo
Copy link
Author

severo commented Jul 1, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment