File reviewed: common/core/src/main/java/zingg/common/core/executor/Labeller.java
I read this together with:
ZinggBase.javaTrainingDataModel.javaLabelDataViewHelper.javaLabelUpdater.javaIZinggModelInfo.javaITrainingDataModel.java
After reading Labeller and the classes around it, I think the main issue is boundary placement. The class handles phase orchestration, record retrieval, CLI interaction, and default collaborator wiring all in one place.
Right now Labeller is doing all of these:
- phase orchestration
- record loading/filtering
- CLI session control
- raw console input parsing
- state update coordination
- default dependency construction
That makes the class hard to test and harder to extend than it needs to be.
The cleanest fix is not a rewrite. I would keep Labeller as the phase orchestrator, keep TrainingDataModel for stats/output work, keep LabelDataViewHelper for presentation, and extract one session-level abstraction for the interactive labeling flow.
I also think there is a difference between intentional framework layering and actual duplication:
ZinggBaselooks like the intended default implementation of theIZinggModelInfocontractLabeller.getUnmarkedRecords()looks like a drifted overrideTrainingDataModelcontains the clearest real duplicate implementationLabellerandLabelUpdatershare enough workflow that a missing abstraction is showing through
At a high level, execute() does the whole label phase end to end:
- reads existing marked records
- sets current stats
- loads unmarked records
- preprocesses them
- runs the CLI labeling loop
- post-processes the updated rows
- writes the output
That path is in Labeller.java:34-50.
The issue is that the class does not stop at orchestration.
getUnmarkedRecords() in Labeller.java:54-78 is doing record retrieval and filtering, but it also updates stats as a side effect.
processRecordsCli(...) in Labeller.java:80-142 is doing the actual interaction workflow:
- cluster discovery
- iteration
- prompt building
- display
- input
- stats update
- output accumulation
readCliInput() in Labeller.java:152-164 is handling raw terminal input directly.
And getTrainingDataModel() / getLabelDataViewHelper() in Labeller.java:167-189 are still choosing concrete implementations from inside the abstract executor.
So the class is sitting across multiple abstraction layers at once. That is the main design issue.
Labeller matters because the Spark side is thin. The real behavior lives here.
The rough execution path is:
- client resolves the phase
SparkZFactoryselectsSparkLabellerSparkLabellermostly inherits behavior fromLabeller
So if this class is hard to extend, the whole labeling phase is hard to extend.
The nearby classes already show a better shape trying to emerge:
TrainingDataModelalready owns stats, record mutation, and writesLabelDataViewHelperalready owns display and prompt-related behaviorLabellerUtilalready owns a post-processing step
That is why I would not rewrite the subsystem. I would fix the boundaries.
This is the current boundary as I read it from the code. It is intentionally a boundary sketch, not a full repo diagram.
classDiagram
class ZinggBase
class IPreprocessors
class Labeller {
+execute()
+getUnmarkedRecords()
+processRecordsCli(lines)
~readCliInput()
+getTrainingDataModel()
+getLabelDataViewHelper()
#getDfObjectUtil()
}
class LabelUpdater
class ITrainingDataModel
class TrainingDataModel
class ILabelDataViewHelper
class LabelDataViewHelper
class LabellerUtil
ZinggBase <|-- Labeller
IPreprocessors <|.. Labeller
Labeller <|-- LabelUpdater
Labeller --> ITrainingDataModel
Labeller --> ILabelDataViewHelper
Labeller ..> TrainingDataModel : lazy default
Labeller ..> LabelDataViewHelper : lazy default
Labeller --> LabellerUtil
This is the biggest issue.
execute() is the phase entrypoint in Labeller.java:34-50. That part makes sense.
But the same class also owns the full CLI loop in Labeller.java:80-142 and raw input parsing in Labeller.java:152-164.
I would expect the phase class to answer:
- what steps make up labeling
I would not expect it to answer:
- how one interactive session runs
- how terminal input is validated
Those are different reasons to change.
Why it matters:
- testing gets harder because orchestration and interaction are coupled
- adding a non-CLI reviewer flow gets harder because CLI behavior is baked into the executor
- small UX changes to the label flow require editing the phase class itself
This part needs a careful read because not all repetition here is bad.
IZinggModelInfo defines:
getMarkedRecords()getUnmarkedRecords()- marked-record stat helpers
See IZinggModelInfo.java:3-17.
ZinggBase then provides the default implementation in ZinggBase.java:94-137.
That looks intentional. I would not call that duplication in the negative sense. It looks like the base executor implementing a broad framework contract.
The messy part comes after that:
LabelleroverridesgetUnmarkedRecords()inLabeller.java:54-78TrainingDataModelredefines record access and stat helpers inTrainingDataModel.java:118-160ITrainingDataModeldoes not ask for those methods at all inITrainingDataModel.java:6-27
At that point there are too many places that can claim ownership of the same concept.
That matters because the next change to marked/unmarked record resolution now has three plausible homes:
- base executor
- phase executor
- training data model
That is how behavior drift starts.
Labeller.getUnmarkedRecords() mostly repeats the pattern from ZinggBase.getUnmarkedRecords():
- read unmarked pipe
- read marked pipe
- anti-join out already labeled clusters
Compare:
ZinggBase.java:103-117Labeller.java:54-78
The override adds two things:
- different error handling
getTrainingDataModel().setMarkedRecordsStat(markedRecords)atLabeller.java:69
But execute() already does stat setup earlier at Labeller.java:38.
So now a method named like a data read helper also mutates session state, and it does that redundantly.
That is why I read this as drift instead of clean polymorphism.
This is the part I would call real duplicate implementation with the most confidence.
TrainingDataModel already has useful, focused responsibilities:
setMarkedRecordsStat(...)updateRecords(...)updateLabellerStat(...)writeLabelledOutput(...)
See TrainingDataModel.java:30-85.
But then the same class also redefines:
getMarkedRecords()getUnmarkedRecords()- all the marked-record stat helpers
See TrainingDataModel.java:118-160.
Those methods already exist in ZinggBase.java:94-137, and ITrainingDataModel does not require them.
One more detail makes this worse: the constructor in TrainingDataModel.java:23-27 sets context and client options, but not args, and the duplicated record-access methods depend on args.
So this is not just extra API surface. Some of it looks actively unsafe or at least misleading.
I would be careful not to overstate this.
The flows are different:
Labellerwalks newly found pairsLabelUpdaterlets the user pick an existing cluster and relabel it
So I would not describe them as the same method copied twice.
But the shared scaffolding is obvious:
- stat initialization and printing
ZidAndFieldDefSelector- pair display
- user decision capture
- stats update
- updated-record accumulation
- quit handling
Relevant code:
Labeller.java:80-142LabelUpdater.java:36-140
To me, that points to a missing session abstraction, not a bad duplication hygiene. Debatable and open for discussion.
Labeller stores collaborators behind interfaces:
ITrainingDataModelILabelDataViewHelper
and it even provides setters.
But the abstract executor still instantiates the concrete defaults itself in Labeller.java:167-189.
That means the code wants the flexibility of indirection, but not quite enough to make the dependency boundary explicit.
This is not as serious as the abstraction-level problem, but it still matters:
- tests need to push against a built-in default path
- alternate interaction models have to work around the executor's construction choices
- the abstraction is weaker than it looks from the field types alone
I would keep the general shape of the subsystem. I would not throw it away.
The phase still needs one place that says:
- get records
- preprocess
- run labeling
- post-process
- write
That should stay in Labeller. The problem is that Labeller currently goes deeper than that.
This class already has a useful center of gravity:
- stats
- record mutation
- output write
That is worth keeping.
This class already handles:
- current pair extraction support
- score/prediction messages
- display formatting
- stats printing
I would keep that direction. I would just stop short of letting it own the whole session.
LabellerUtil.postProcessLabel(...) is actually one of the cleaner separations in this flow. It does one distinct thing near the end of the pipeline.
I would move toward this split:
classDiagram
class Labeller {
+execute()
}
class LabelRecordSource {
+getMarkedRecords()
+getUnmarkedRecords()
}
class LabelSession {
+run(lines)
}
class LabelInputHandler {
+readDecision()
}
class TrainingDataModel {
+setMarkedRecordsStat(marked)
+updateRecords(...)
+updateLabellerStat(...)
+writeLabelledOutput(...)
}
class LabelDataViewHelper {
+displayRecords(...)
+printMarkedRecordsStat(...)
+getCurrentPair(...)
+getMsg1(...)
+getMsg2(...)
}
class LabellerUtil
Labeller --> LabelRecordSource
Labeller --> LabelSession
Labeller --> TrainingDataModel
Labeller --> LabellerUtil
LabelSession --> LabelDataViewHelper
LabelSession --> LabelInputHandler
LabelSession --> TrainingDataModel
Under that model, I would split responsibilities like this:
Labellerphase orchestration onlyLabelRecordSourcemarked/unmarked record retrieval and filteringLabelSessioninteractive labeling workflowLabelInputHandlerCLI input parsing and validationTrainingDataModelstats, record mutation, output writingLabelDataViewHelperpresentation and prompt/message formattingLabellerUtilorLabelPostProcessorpost-processing before write
The main design choice here is deliberate: I would add only one meaningful new abstraction first, which is LabelSession. That is the missing concept the current code keeps hinting at.
Everything else can be lighter weight.
If I were writing this today, I would want the class boundaries to reflect intent directly.
Something like:
class Labeller {
void execute() throws ZinggClientException;
}
interface LabelRecordSource<D, R, C> {
ZFrame<D, R, C> getMarkedRecords();
ZFrame<D, R, C> getUnmarkedRecords();
}
interface LabelSession<D, R, C> {
ZFrame<D, R, C> run(ZFrame<D, R, C> candidates) throws ZinggClientException;
}
interface LabelInputHandler {
int readDecision() throws ZinggClientException;
}The exact names are less important than the split:
- phase object decides the sequence
- session object decides how interaction runs
- record source decides how candidates are resolved
That gives each class one dominant reason to change.
I would refactor this in small steps.
Either:
- keep record retrieval in
ZinggBase
or:
- move it into a small
LabelRecordSource
But I would not leave it spread across ZinggBase, Labeller, and TrainingDataModel.
As part of that step, I would remove duplicate record-access/stat-helper methods from TrainingDataModel.
Move readCliInput() out of Labeller into a tiny CLI input helper.
This is small but useful because it makes the terminal boundary explicit immediately.
Move the body of processRecordsCli(...) into LabelSession.
After that, Labeller.execute() becomes much cleaner:
- load records
- preprocess
- run session
- post-process
- write
I would not force Labeller and LabelUpdater into one identical method. Their entry flows are genuinely different.
But I would make them share:
- display/prompt flow
- decision capture
- state update scaffolding
That is the right level of reuse here.
Stop instantiating TrainingDataModel and LabelDataViewHelper inside the abstract executor. Push default construction into a factory, constructor path, or context.
IZinggModelInfo already has a TODO need to revisit this interface comment. I think that comment is justified.
Broad interfaces are part of why responsibility has spread in a blurry way here.
I think Labeller has the right role and the wrong boundary.
The code already contains the pieces of a cleaner design:
- a phase class
- a state/output helper
- a presentation helper
- a post-processing helper
What is missing is one explicit abstraction for the interaction session, plus a decision about who really owns record retrieval.
That is why I would treat this as a boundary-fixing refactor, not a rewrite.
Main files reviewed:
common/core/src/main/java/zingg/common/core/executor/Labeller.javacommon/core/src/main/java/zingg/common/core/executor/LabelUpdater.javacommon/core/src/main/java/zingg/common/core/executor/TrainingDataModel.javacommon/core/src/main/java/zingg/common/core/executor/LabelDataViewHelper.javacommon/core/src/main/java/zingg/common/core/executor/ZinggBase.javacommon/client/src/main/java/zingg/common/client/IZinggModelInfo.javacommon/client/src/main/java/zingg/common/client/ITrainingDataModel.java