Skip to content

Instantly share code, notes, and snippets.

@HamdaanAliQuatil
Created March 19, 2026 00:10
Show Gist options
  • Select an option

  • Save HamdaanAliQuatil/1bf1d9b4e76b5ff5a31bcccf55b968b5 to your computer and use it in GitHub Desktop.

Select an option

Save HamdaanAliQuatil/1bf1d9b4e76b5ff5a31bcccf55b968b5 to your computer and use it in GitHub Desktop.

Design Critique of Labeller.java

File reviewed: common/core/src/main/java/zingg/common/core/executor/Labeller.java

I read this together with:

  • ZinggBase.java
  • TrainingDataModel.java
  • LabelDataViewHelper.java
  • LabelUpdater.java
  • IZinggModelInfo.java
  • ITrainingDataModel.java

TL;DR

After reading Labeller and the classes around it, I think the main issue is boundary placement. The class handles phase orchestration, record retrieval, CLI interaction, and default collaborator wiring all in one place.

Right now Labeller is doing all of these:

  • phase orchestration
  • record loading/filtering
  • CLI session control
  • raw console input parsing
  • state update coordination
  • default dependency construction

That makes the class hard to test and harder to extend than it needs to be.

The cleanest fix is not a rewrite. I would keep Labeller as the phase orchestrator, keep TrainingDataModel for stats/output work, keep LabelDataViewHelper for presentation, and extract one session-level abstraction for the interactive labeling flow.

I also think there is a difference between intentional framework layering and actual duplication:

  • ZinggBase looks like the intended default implementation of the IZinggModelInfo contract
  • Labeller.getUnmarkedRecords() looks like a drifted override
  • TrainingDataModel contains the clearest real duplicate implementation
  • Labeller and LabelUpdater share enough workflow that a missing abstraction is showing through

What Labeller Does Today

At a high level, execute() does the whole label phase end to end:

  1. reads existing marked records
  2. sets current stats
  3. loads unmarked records
  4. preprocesses them
  5. runs the CLI labeling loop
  6. post-processes the updated rows
  7. writes the output

That path is in Labeller.java:34-50.

The issue is that the class does not stop at orchestration.

getUnmarkedRecords() in Labeller.java:54-78 is doing record retrieval and filtering, but it also updates stats as a side effect.

processRecordsCli(...) in Labeller.java:80-142 is doing the actual interaction workflow:

  • cluster discovery
  • iteration
  • prompt building
  • display
  • input
  • stats update
  • output accumulation

readCliInput() in Labeller.java:152-164 is handling raw terminal input directly.

And getTrainingDataModel() / getLabelDataViewHelper() in Labeller.java:167-189 are still choosing concrete implementations from inside the abstract executor.

So the class is sitting across multiple abstraction layers at once. That is the main design issue.

Bigger Picture

Labeller matters because the Spark side is thin. The real behavior lives here.

The rough execution path is:

  1. client resolves the phase
  2. SparkZFactory selects SparkLabeller
  3. SparkLabeller mostly inherits behavior from Labeller

So if this class is hard to extend, the whole labeling phase is hard to extend.

The nearby classes already show a better shape trying to emerge:

  • TrainingDataModel already owns stats, record mutation, and writes
  • LabelDataViewHelper already owns display and prompt-related behavior
  • LabellerUtil already owns a post-processing step

That is why I would not rewrite the subsystem. I would fix the boundaries.

Current Boundary Model

This is the current boundary as I read it from the code. It is intentionally a boundary sketch, not a full repo diagram.

classDiagram
    class ZinggBase
    class IPreprocessors
    class Labeller {
        +execute()
        +getUnmarkedRecords()
        +processRecordsCli(lines)
        ~readCliInput()
        +getTrainingDataModel()
        +getLabelDataViewHelper()
        #getDfObjectUtil()
    }
    class LabelUpdater
    class ITrainingDataModel
    class TrainingDataModel
    class ILabelDataViewHelper
    class LabelDataViewHelper
    class LabellerUtil

    ZinggBase <|-- Labeller
    IPreprocessors <|.. Labeller
    Labeller <|-- LabelUpdater

    Labeller --> ITrainingDataModel
    Labeller --> ILabelDataViewHelper
    Labeller ..> TrainingDataModel : lazy default
    Labeller ..> LabelDataViewHelper : lazy default
    Labeller --> LabellerUtil
Loading

Findings

1. Labeller mixes phase logic with interaction logic

This is the biggest issue.

execute() is the phase entrypoint in Labeller.java:34-50. That part makes sense.

But the same class also owns the full CLI loop in Labeller.java:80-142 and raw input parsing in Labeller.java:152-164.

I would expect the phase class to answer:

  • what steps make up labeling

I would not expect it to answer:

  • how one interactive session runs
  • how terminal input is validated

Those are different reasons to change.

Why it matters:

  • testing gets harder because orchestration and interaction are coupled
  • adding a non-CLI reviewer flow gets harder because CLI behavior is baked into the executor
  • small UX changes to the label flow require editing the phase class itself

2. Record ownership is split in a confusing way

This part needs a careful read because not all repetition here is bad.

IZinggModelInfo defines:

  • getMarkedRecords()
  • getUnmarkedRecords()
  • marked-record stat helpers

See IZinggModelInfo.java:3-17.

ZinggBase then provides the default implementation in ZinggBase.java:94-137.

That looks intentional. I would not call that duplication in the negative sense. It looks like the base executor implementing a broad framework contract.

The messy part comes after that:

  • Labeller overrides getUnmarkedRecords() in Labeller.java:54-78
  • TrainingDataModel redefines record access and stat helpers in TrainingDataModel.java:118-160
  • ITrainingDataModel does not ask for those methods at all in ITrainingDataModel.java:6-27

At that point there are too many places that can claim ownership of the same concept.

That matters because the next change to marked/unmarked record resolution now has three plausible homes:

  • base executor
  • phase executor
  • training data model

That is how behavior drift starts.

3. Labeller.getUnmarkedRecords() looks like drift, not a strong specialization

Labeller.getUnmarkedRecords() mostly repeats the pattern from ZinggBase.getUnmarkedRecords():

  • read unmarked pipe
  • read marked pipe
  • anti-join out already labeled clusters

Compare:

  • ZinggBase.java:103-117
  • Labeller.java:54-78

The override adds two things:

  • different error handling
  • getTrainingDataModel().setMarkedRecordsStat(markedRecords) at Labeller.java:69

But execute() already does stat setup earlier at Labeller.java:38.

So now a method named like a data read helper also mutates session state, and it does that redundantly.

That is why I read this as drift instead of clean polymorphism.

4. TrainingDataModel has the clearest real duplication

This is the part I would call real duplicate implementation with the most confidence.

TrainingDataModel already has useful, focused responsibilities:

  • setMarkedRecordsStat(...)
  • updateRecords(...)
  • updateLabellerStat(...)
  • writeLabelledOutput(...)

See TrainingDataModel.java:30-85.

But then the same class also redefines:

  • getMarkedRecords()
  • getUnmarkedRecords()
  • all the marked-record stat helpers

See TrainingDataModel.java:118-160.

Those methods already exist in ZinggBase.java:94-137, and ITrainingDataModel does not require them.

One more detail makes this worse: the constructor in TrainingDataModel.java:23-27 sets context and client options, but not args, and the duplicated record-access methods depend on args.

So this is not just extra API surface. Some of it looks actively unsafe or at least misleading.

5. Labeller and LabelUpdater are not copy-paste twins, but they are still duplicating workflow

I would be careful not to overstate this.

The flows are different:

  • Labeller walks newly found pairs
  • LabelUpdater lets the user pick an existing cluster and relabel it

So I would not describe them as the same method copied twice.

But the shared scaffolding is obvious:

  • stat initialization and printing
  • ZidAndFieldDefSelector
  • pair display
  • user decision capture
  • stats update
  • updated-record accumulation
  • quit handling

Relevant code:

  • Labeller.java:80-142
  • LabelUpdater.java:36-140

To me, that points to a missing session abstraction, not a bad duplication hygiene. Debatable and open for discussion.

6. Dependency inversion is only half done

Labeller stores collaborators behind interfaces:

  • ITrainingDataModel
  • ILabelDataViewHelper

and it even provides setters.

But the abstract executor still instantiates the concrete defaults itself in Labeller.java:167-189.

That means the code wants the flexibility of indirection, but not quite enough to make the dependency boundary explicit.

This is not as serious as the abstraction-level problem, but it still matters:

  • tests need to push against a built-in default path
  • alternate interaction models have to work around the executor's construction choices
  • the abstraction is weaker than it looks from the field types alone

What I Would Keep

I would keep the general shape of the subsystem. I would not throw it away.

Keep Labeller as the phase entrypoint

The phase still needs one place that says:

  • get records
  • preprocess
  • run labeling
  • post-process
  • write

That should stay in Labeller. The problem is that Labeller currently goes deeper than that.

Keep TrainingDataModel for state and output work

This class already has a useful center of gravity:

  • stats
  • record mutation
  • output write

That is worth keeping.

Keep LabelDataViewHelper for presentation

This class already handles:

  • current pair extraction support
  • score/prediction messages
  • display formatting
  • stats printing

I would keep that direction. I would just stop short of letting it own the whole session.

Keep the post-processing extraction

LabellerUtil.postProcessLabel(...) is actually one of the cleaner separations in this flow. It does one distinct thing near the end of the pipeline.

Proposed Design

I would move toward this split:

Proposed Boundary Model

classDiagram
    class Labeller {
        +execute()
    }

    class LabelRecordSource {
        +getMarkedRecords()
        +getUnmarkedRecords()
    }

    class LabelSession {
        +run(lines)
    }

    class LabelInputHandler {
        +readDecision()
    }

    class TrainingDataModel {
        +setMarkedRecordsStat(marked)
        +updateRecords(...)
        +updateLabellerStat(...)
        +writeLabelledOutput(...)
    }

    class LabelDataViewHelper {
        +displayRecords(...)
        +printMarkedRecordsStat(...)
        +getCurrentPair(...)
        +getMsg1(...)
        +getMsg2(...)
    }

    class LabellerUtil

    Labeller --> LabelRecordSource
    Labeller --> LabelSession
    Labeller --> TrainingDataModel
    Labeller --> LabellerUtil
    LabelSession --> LabelDataViewHelper
    LabelSession --> LabelInputHandler
    LabelSession --> TrainingDataModel
Loading

Under that model, I would split responsibilities like this:

  • Labeller phase orchestration only
  • LabelRecordSource marked/unmarked record retrieval and filtering
  • LabelSession interactive labeling workflow
  • LabelInputHandler CLI input parsing and validation
  • TrainingDataModel stats, record mutation, output writing
  • LabelDataViewHelper presentation and prompt/message formatting
  • LabellerUtil or LabelPostProcessor post-processing before write

The main design choice here is deliberate: I would add only one meaningful new abstraction first, which is LabelSession. That is the missing concept the current code keeps hinting at.

Everything else can be lighter weight.

How I Would Write It

If I were writing this today, I would want the class boundaries to reflect intent directly.

Something like:

class Labeller {
    void execute() throws ZinggClientException;
}

interface LabelRecordSource<D, R, C> {
    ZFrame<D, R, C> getMarkedRecords();
    ZFrame<D, R, C> getUnmarkedRecords();
}

interface LabelSession<D, R, C> {
    ZFrame<D, R, C> run(ZFrame<D, R, C> candidates) throws ZinggClientException;
}

interface LabelInputHandler {
    int readDecision() throws ZinggClientException;
}

The exact names are less important than the split:

  • phase object decides the sequence
  • session object decides how interaction runs
  • record source decides how candidates are resolved

That gives each class one dominant reason to change.

Migration Plan

I would refactor this in small steps.

Step 1. Pick one owner for record retrieval

Either:

  • keep record retrieval in ZinggBase

or:

  • move it into a small LabelRecordSource

But I would not leave it spread across ZinggBase, Labeller, and TrainingDataModel.

As part of that step, I would remove duplicate record-access/stat-helper methods from TrainingDataModel.

Step 2. Extract raw input handling

Move readCliInput() out of Labeller into a tiny CLI input helper.

This is small but useful because it makes the terminal boundary explicit immediately.

Step 3. Extract the session loop

Move the body of processRecordsCli(...) into LabelSession.

After that, Labeller.execute() becomes much cleaner:

  • load records
  • preprocess
  • run session
  • post-process
  • write

Step 4. Reuse session scaffolding in LabelUpdater

I would not force Labeller and LabelUpdater into one identical method. Their entry flows are genuinely different.

But I would make them share:

  • display/prompt flow
  • decision capture
  • state update scaffolding

That is the right level of reuse here.

Step 5. Clean up construction

Stop instantiating TrainingDataModel and LabelDataViewHelper inside the abstract executor. Push default construction into a factory, constructor path, or context.

Step 6. Tighten the interfaces

IZinggModelInfo already has a TODO need to revisit this interface comment. I think that comment is justified.

Broad interfaces are part of why responsibility has spread in a blurry way here.

Final Take

I think Labeller has the right role and the wrong boundary.

The code already contains the pieces of a cleaner design:

  • a phase class
  • a state/output helper
  • a presentation helper
  • a post-processing helper

What is missing is one explicit abstraction for the interaction session, plus a decision about who really owns record retrieval.

That is why I would treat this as a boundary-fixing refactor, not a rewrite.

Evidence Summary

Main files reviewed:

  • common/core/src/main/java/zingg/common/core/executor/Labeller.java
  • common/core/src/main/java/zingg/common/core/executor/LabelUpdater.java
  • common/core/src/main/java/zingg/common/core/executor/TrainingDataModel.java
  • common/core/src/main/java/zingg/common/core/executor/LabelDataViewHelper.java
  • common/core/src/main/java/zingg/common/core/executor/ZinggBase.java
  • common/client/src/main/java/zingg/common/client/IZinggModelInfo.java
  • common/client/src/main/java/zingg/common/client/ITrainingDataModel.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment