Design Critique of `Labeller.java`

File reviewed: common/core/src/main/java/zingg/common/core/executor/Labeller.java

I read this together with:

ZinggBase.java
TrainingDataModel.java
LabelDataViewHelper.java
LabelUpdater.java
IZinggModelInfo.java
ITrainingDataModel.java

TL;DR

After reading Labeller and the classes around it, I think the main issue is boundary placement. The class handles phase orchestration, record retrieval, CLI interaction, and default collaborator wiring all in one place.

Right now Labeller is doing all of these:

phase orchestration
record loading/filtering
CLI session control
raw console input parsing
state update coordination
default dependency construction

That makes the class hard to test and harder to extend than it needs to be.

The cleanest fix is not a rewrite. I would keep Labeller as the phase orchestrator, keep TrainingDataModel for stats/output work, keep LabelDataViewHelper for presentation, and extract one session-level abstraction for the interactive labeling flow.

I also think there is a difference between intentional framework layering and actual duplication:

ZinggBase looks like the intended default implementation of the IZinggModelInfo contract
Labeller.getUnmarkedRecords() looks like a drifted override
TrainingDataModel contains the clearest real duplicate implementation
Labeller and LabelUpdater share enough workflow that a missing abstraction is showing through

What `Labeller` Does Today

At a high level, execute() does the whole label phase end to end:

reads existing marked records
sets current stats
loads unmarked records
preprocesses them
runs the CLI labeling loop
post-processes the updated rows
writes the output

That path is in Labeller.java:34-50.

The issue is that the class does not stop at orchestration.

getUnmarkedRecords() in Labeller.java:54-78 is doing record retrieval and filtering, but it also updates stats as a side effect.

processRecordsCli(...) in Labeller.java:80-142 is doing the actual interaction workflow:

cluster discovery
iteration
prompt building
display
input
stats update
output accumulation

readCliInput() in Labeller.java:152-164 is handling raw terminal input directly.

And getTrainingDataModel() / getLabelDataViewHelper() in Labeller.java:167-189 are still choosing concrete implementations from inside the abstract executor.

So the class is sitting across multiple abstraction layers at once. That is the main design issue.

Bigger Picture

Labeller matters because the Spark side is thin. The real behavior lives here.

The rough execution path is:

client resolves the phase
SparkZFactory selects SparkLabeller
SparkLabeller mostly inherits behavior from Labeller

So if this class is hard to extend, the whole labeling phase is hard to extend.

The nearby classes already show a better shape trying to emerge:

TrainingDataModel already owns stats, record mutation, and writes
LabelDataViewHelper already owns display and prompt-related behavior
LabellerUtil already owns a post-processing step

That is why I would not rewrite the subsystem. I would fix the boundaries.

Current Boundary Model

This is the current boundary as I read it from the code. It is intentionally a boundary sketch, not a full repo diagram.

classDiagram
    class ZinggBase
    class IPreprocessors
    class Labeller {
        +execute()
        +getUnmarkedRecords()
        +processRecordsCli(lines)
        ~readCliInput()
        +getTrainingDataModel()
        +getLabelDataViewHelper()
        #getDfObjectUtil()
    }
    class LabelUpdater
    class ITrainingDataModel
    class TrainingDataModel
    class ILabelDataViewHelper
    class LabelDataViewHelper
    class LabellerUtil

    ZinggBase <|-- Labeller
    IPreprocessors <|.. Labeller
    Labeller <|-- LabelUpdater

    Labeller --> ITrainingDataModel
    Labeller --> ILabelDataViewHelper
    Labeller ..> TrainingDataModel : lazy default
    Labeller ..> LabelDataViewHelper : lazy default
    Labeller --> LabellerUtil

Findings

1. `Labeller` mixes phase logic with interaction logic

This is the biggest issue.

execute() is the phase entrypoint in Labeller.java:34-50. That part makes sense.

But the same class also owns the full CLI loop in Labeller.java:80-142 and raw input parsing in Labeller.java:152-164.

I would expect the phase class to answer:

what steps make up labeling

I would not expect it to answer:

how one interactive session runs
how terminal input is validated

Those are different reasons to change.

Why it matters:

testing gets harder because orchestration and interaction are coupled
adding a non-CLI reviewer flow gets harder because CLI behavior is baked into the executor
small UX changes to the label flow require editing the phase class itself

2. Record ownership is split in a confusing way

This part needs a careful read because not all repetition here is bad.

IZinggModelInfo defines:

getMarkedRecords()
getUnmarkedRecords()
marked-record stat helpers

See IZinggModelInfo.java:3-17.

ZinggBase then provides the default implementation in ZinggBase.java:94-137.

That looks intentional. I would not call that duplication in the negative sense. It looks like the base executor implementing a broad framework contract.

The messy part comes after that:

Labeller overrides getUnmarkedRecords() in Labeller.java:54-78
TrainingDataModel redefines record access and stat helpers in TrainingDataModel.java:118-160
ITrainingDataModel does not ask for those methods at all in ITrainingDataModel.java:6-27

At that point there are too many places that can claim ownership of the same concept.

That matters because the next change to marked/unmarked record resolution now has three plausible homes:

base executor
phase executor
training data model

That is how behavior drift starts.

3. `Labeller.getUnmarkedRecords()` looks like drift, not a strong specialization

Labeller.getUnmarkedRecords() mostly repeats the pattern from ZinggBase.getUnmarkedRecords():

read unmarked pipe
read marked pipe
anti-join out already labeled clusters

Compare:

ZinggBase.java:103-117
Labeller.java:54-78

The override adds two things:

different error handling
getTrainingDataModel().setMarkedRecordsStat(markedRecords) at Labeller.java:69

But execute() already does stat setup earlier at Labeller.java:38.

So now a method named like a data read helper also mutates session state, and it does that redundantly.

That is why I read this as drift instead of clean polymorphism.

4. `TrainingDataModel` has the clearest real duplication

This is the part I would call real duplicate implementation with the most confidence.

TrainingDataModel already has useful, focused responsibilities:

setMarkedRecordsStat(...)
updateRecords(...)
updateLabellerStat(...)
writeLabelledOutput(...)

See TrainingDataModel.java:30-85.

But then the same class also redefines:

getMarkedRecords()
getUnmarkedRecords()
all the marked-record stat helpers

See TrainingDataModel.java:118-160.

Those methods already exist in ZinggBase.java:94-137, and ITrainingDataModel does not require them.

One more detail makes this worse: the constructor in TrainingDataModel.java:23-27 sets context and client options, but not args, and the duplicated record-access methods depend on args.

So this is not just extra API surface. Some of it looks actively unsafe or at least misleading.

5. `Labeller` and `LabelUpdater` are not copy-paste twins, but they are still duplicating workflow

I would be careful not to overstate this.

The flows are different:

Labeller walks newly found pairs
LabelUpdater lets the user pick an existing cluster and relabel it

So I would not describe them as the same method copied twice.

But the shared scaffolding is obvious:

stat initialization and printing
ZidAndFieldDefSelector
pair display
user decision capture
stats update
updated-record accumulation
quit handling

Relevant code:

Labeller.java:80-142
LabelUpdater.java:36-140

To me, that points to a missing session abstraction, not a bad duplication hygiene. Debatable and open for discussion.

6. Dependency inversion is only half done

Labeller stores collaborators behind interfaces:

ITrainingDataModel
ILabelDataViewHelper

and it even provides setters.

But the abstract executor still instantiates the concrete defaults itself in Labeller.java:167-189.

That means the code wants the flexibility of indirection, but not quite enough to make the dependency boundary explicit.

This is not as serious as the abstraction-level problem, but it still matters:

tests need to push against a built-in default path
alternate interaction models have to work around the executor's construction choices
the abstraction is weaker than it looks from the field types alone

What I Would Keep

I would keep the general shape of the subsystem. I would not throw it away.

Keep `Labeller` as the phase entrypoint

The phase still needs one place that says:

get records
preprocess
run labeling
post-process
write

That should stay in Labeller. The problem is that Labeller currently goes deeper than that.

Keep `TrainingDataModel` for state and output work

This class already has a useful center of gravity:

stats
record mutation
output write

That is worth keeping.

Keep `LabelDataViewHelper` for presentation

This class already handles:

current pair extraction support
score/prediction messages
display formatting
stats printing

I would keep that direction. I would just stop short of letting it own the whole session.

Keep the post-processing extraction

LabellerUtil.postProcessLabel(...) is actually one of the cleaner separations in this flow. It does one distinct thing near the end of the pipeline.

Proposed Design

I would move toward this split:

Proposed Boundary Model

classDiagram
    class Labeller {
        +execute()
    }

    class LabelRecordSource {
        +getMarkedRecords()
        +getUnmarkedRecords()
    }

    class LabelSession {
        +run(lines)
    }

    class LabelInputHandler {
        +readDecision()
    }

    class TrainingDataModel {
        +setMarkedRecordsStat(marked)
        +updateRecords(...)
        +updateLabellerStat(...)
        +writeLabelledOutput(...)
    }

    class LabelDataViewHelper {
        +displayRecords(...)
        +printMarkedRecordsStat(...)
        +getCurrentPair(...)
        +getMsg1(...)
        +getMsg2(...)
    }

    class LabellerUtil

    Labeller --> LabelRecordSource
    Labeller --> LabelSession
    Labeller --> TrainingDataModel
    Labeller --> LabellerUtil
    LabelSession --> LabelDataViewHelper
    LabelSession --> LabelInputHandler
    LabelSession --> TrainingDataModel

Under that model, I would split responsibilities like this:

Labeller phase orchestration only
LabelRecordSource marked/unmarked record retrieval and filtering
LabelSession interactive labeling workflow
LabelInputHandler CLI input parsing and validation
TrainingDataModel stats, record mutation, output writing
LabelDataViewHelper presentation and prompt/message formatting
LabellerUtil or LabelPostProcessor post-processing before write

The main design choice here is deliberate: I would add only one meaningful new abstraction first, which is LabelSession. That is the missing concept the current code keeps hinting at.

Everything else can be lighter weight.

How I Would Write It

If I were writing this today, I would want the class boundaries to reflect intent directly.

Something like:

class Labeller {
    void execute() throws ZinggClientException;
}

interface LabelRecordSource<D, R, C> {
    ZFrame<D, R, C> getMarkedRecords();
    ZFrame<D, R, C> getUnmarkedRecords();
}

interface LabelSession<D, R, C> {
    ZFrame<D, R, C> run(ZFrame<D, R, C> candidates) throws ZinggClientException;
}

interface LabelInputHandler {
    int readDecision() throws ZinggClientException;
}

The exact names are less important than the split:

phase object decides the sequence
session object decides how interaction runs
record source decides how candidates are resolved

That gives each class one dominant reason to change.

Migration Plan

I would refactor this in small steps.

Step 1. Pick one owner for record retrieval

Either:

keep record retrieval in ZinggBase

or:

move it into a small LabelRecordSource

But I would not leave it spread across ZinggBase, Labeller, and TrainingDataModel.

As part of that step, I would remove duplicate record-access/stat-helper methods from TrainingDataModel.

Step 2. Extract raw input handling

Move readCliInput() out of Labeller into a tiny CLI input helper.

This is small but useful because it makes the terminal boundary explicit immediately.

Step 3. Extract the session loop

Move the body of processRecordsCli(...) into LabelSession.

After that, Labeller.execute() becomes much cleaner:

load records
preprocess
run session
post-process
write

Step 4. Reuse session scaffolding in `LabelUpdater`

I would not force Labeller and LabelUpdater into one identical method. Their entry flows are genuinely different.

But I would make them share:

display/prompt flow
decision capture
state update scaffolding

That is the right level of reuse here.

Step 5. Clean up construction

Stop instantiating TrainingDataModel and LabelDataViewHelper inside the abstract executor. Push default construction into a factory, constructor path, or context.

Step 6. Tighten the interfaces

IZinggModelInfo already has a TODO need to revisit this interface comment. I think that comment is justified.

Broad interfaces are part of why responsibility has spread in a blurry way here.

Final Take

I think Labeller has the right role and the wrong boundary.

The code already contains the pieces of a cleaner design:

a phase class
a state/output helper
a presentation helper
a post-processing helper

What is missing is one explicit abstraction for the interaction session, plus a decision about who really owns record retrieval.

That is why I would treat this as a boundary-fixing refactor, not a rewrite.

Evidence Summary

Main files reviewed:

common/core/src/main/java/zingg/common/core/executor/Labeller.java
common/core/src/main/java/zingg/common/core/executor/LabelUpdater.java
common/core/src/main/java/zingg/common/core/executor/TrainingDataModel.java
common/core/src/main/java/zingg/common/core/executor/LabelDataViewHelper.java
common/core/src/main/java/zingg/common/core/executor/ZinggBase.java
common/client/src/main/java/zingg/common/client/IZinggModelInfo.java
common/client/src/main/java/zingg/common/client/ITrainingDataModel.java

HamdaanAliQuatil/LABELLER_DESIGN_CRITIQUE.md

Select an option

No results found

Select an option

No results found

Design Critique of `Labeller.java`

TL;DR

What `Labeller` Does Today

Bigger Picture

Current Boundary Model

Findings

1. `Labeller` mixes phase logic with interaction logic

2. Record ownership is split in a confusing way

3. `Labeller.getUnmarkedRecords()` looks like drift, not a strong specialization

4. `TrainingDataModel` has the clearest real duplication

5. `Labeller` and `LabelUpdater` are not copy-paste twins, but they are still duplicating workflow

6. Dependency inversion is only half done

What I Would Keep

Keep `Labeller` as the phase entrypoint

Keep `TrainingDataModel` for state and output work

Keep `LabelDataViewHelper` for presentation

Keep the post-processing extraction

Proposed Design

Proposed Boundary Model

How I Would Write It

Migration Plan

Step 1. Pick one owner for record retrieval

Step 2. Extract raw input handling

Step 3. Extract the session loop

Step 4. Reuse session scaffolding in `LabelUpdater`

Step 5. Clean up construction

Step 6. Tighten the interfaces

Final Take

Evidence Summary

HamdaanAliQuatil/LABELLER_DESIGN_CRITIQUE.md

Design Critique of Labeller.java

TL;DR

What Labeller Does Today

Bigger Picture

Current Boundary Model

Findings

1. Labeller mixes phase logic with interaction logic

2. Record ownership is split in a confusing way

3. Labeller.getUnmarkedRecords() looks like drift, not a strong specialization

4. TrainingDataModel has the clearest real duplication

5. Labeller and LabelUpdater are not copy-paste twins, but they are still duplicating workflow

6. Dependency inversion is only half done

What I Would Keep

Keep Labeller as the phase entrypoint

Keep TrainingDataModel for state and output work

Keep LabelDataViewHelper for presentation

Keep the post-processing extraction

Proposed Design

Proposed Boundary Model

How I Would Write It

Migration Plan

Step 1. Pick one owner for record retrieval

Step 2. Extract raw input handling

Step 3. Extract the session loop

Step 4. Reuse session scaffolding in LabelUpdater

Step 5. Clean up construction

Step 6. Tighten the interfaces

Final Take

Evidence Summary

Design Critique of `Labeller.java`

What `Labeller` Does Today

1. `Labeller` mixes phase logic with interaction logic

3. `Labeller.getUnmarkedRecords()` looks like drift, not a strong specialization

4. `TrainingDataModel` has the clearest real duplication

5. `Labeller` and `LabelUpdater` are not copy-paste twins, but they are still duplicating workflow

Keep `Labeller` as the phase entrypoint

Keep `TrainingDataModel` for state and output work

Keep `LabelDataViewHelper` for presentation

Step 4. Reuse session scaffolding in `LabelUpdater`