Last active
February 2, 2026 21:48
-
-
Save nikhilbchilwant/226fe8a11fd957c30a12fb1ffa8b8503 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Stepwise Development Plan | |
| Enterprise Agentic Pipeline for API Change Notes | |
| Step 1 — Formalize scope & contracts (do this first) | |
| Goal: Eliminate ambiguity before engineering starts. | |
| Actions | |
| Define what qualifies as an API change (public endpoints, DTOs, OpenAPI, versioned contracts). | |
| Define the exact structure of the output file (sections, metadata, wording constraints). | |
| Define change severity categories (breaking / non-breaking / informational). | |
| Define confidence thresholds and publication rules. | |
| Deliverables | |
| Written API-change definition | |
| Output contract (schema + example) | |
| Acceptance metrics | |
| Quality Gate | |
| Stakeholders (API owners + business) agree that the output format is correct. | |
| Step 2 — Normalize historical data into a golden dataset | |
| Goal: Establish ground truth and evaluation baseline. | |
| Actions | |
| Convert last 50 human-written releases into structured records: | |
| Release ID | |
| API notes | |
| Referenced JIRA IDs | |
| Mentioned API elements (best-effort) | |
| Classify changes by type and severity. | |
| Extract canonical phrasing patterns. | |
| Deliverables | |
| Versioned “golden dataset” | |
| Label definitions and mapping rules | |
| Quality Gate | |
| Domain experts confirm historical normalization reflects reality. | |
| Step 3 — Define system boundaries & trust model | |
| Goal: Decide where determinism ends and probabilistic reasoning begins. | |
| Actions | |
| Declare deterministic vs non-deterministic components. | |
| Define what the agent is allowed to do. | |
| Define fail-closed behavior (what happens on uncertainty). | |
| Decisions | |
| Agent only summarizes and classifies; it does not fetch data. | |
| Pipeline controls execution order. | |
| Deliverables | |
| Architecture decision record (ADR) | |
| Quality Gate | |
| Architecture review sign-off. | |
| Step 4 — Build deterministic ingestion & filtering pipeline | |
| Goal: Create a reproducible, auditable foundation. | |
| Actions | |
| Scan sprint-bounded commits. | |
| Extract JIRA IDs (assumed reliable). | |
| Filter candidate commits via deterministic heuristics: | |
| Paths, file types | |
| Signature changes | |
| Spec deltas | |
| Produce structured change artifacts. | |
| Deliverables | |
| Commit → API-delta mapping | |
| Deterministic pipeline output | |
| Quality Gate | |
| Same input always produces same output. | |
| Step 5 — Integrate selective JIRA enrichment | |
| Goal: Add intent and context without overfetching. | |
| Actions | |
| Fetch JIRA summary + description only for filtered commits. | |
| Normalize and sanitize text. | |
| Link JIRA intent to code deltas. | |
| Deliverables | |
| Enriched structured context bundles | |
| Quality Gate | |
| JIRA outages or failures do not break the pipeline. | |
| Step 6 — Back-test detection against historical releases | |
| Goal: Validate correctness before adding agent logic. | |
| Actions | |
| Replay historical releases through the pipeline. | |
| Compare detected API changes vs historical notes. | |
| Measure precision, recall, and false positives. | |
| Deliverables | |
| Evaluation report | |
| Tuned heuristic rules | |
| Quality Gate | |
| Detection metrics meet agreed thresholds. | |
| Step 7 — Introduce bounded agent for summarization | |
| Goal: Generate human-readable API notes safely. | |
| Actions | |
| Feed agent only structured context bundles. | |
| Ground prompts with historical examples. | |
| Enforce strict output schema. | |
| Require confidence scoring and evidence references. | |
| Deliverables | |
| Agent output conforming to schema | |
| Quality Gate | |
| Invalid or low-confidence outputs are rejected automatically. | |
| Step 8 — Validate agent outputs using historical replay | |
| Goal: Prove agent behaves like human authors. | |
| Actions | |
| Run agent on historical releases. | |
| Compare summaries to golden dataset. | |
| Measure semantic similarity, verbosity, tone alignment. | |
| Deliverables | |
| Agent evaluation metrics | |
| Approved prompt versions | |
| Quality Gate | |
| Agent meets or exceeds human similarity thresholds. | |
| Step 9 — Add human-in-the-loop workflow | |
| Goal: Maintain trust while capturing feedback. | |
| Actions | |
| Route outputs to reviewers based on confidence. | |
| Allow edit, approve, or reject. | |
| Capture edits as labeled feedback. | |
| Deliverables | |
| Review workflow | |
| Audit logs | |
| Quality Gate | |
| All published notes are traceable to approvals. | |
| Step 10 — Implement governance & auditability | |
| Goal: Make the system enterprise-compliant. | |
| Actions | |
| Store: | |
| Commits, JIRA data, diffs | |
| Agent inputs/outputs | |
| Prompt and model versions | |
| Implement access controls and retention rules. | |
| Deliverables | |
| Audit trail | |
| Compliance documentation | |
| Quality Gate | |
| System passes internal audit review. | |
| Step 11 — Gradual automation rollout | |
| Goal: Reduce human effort safely. | |
| Actions | |
| Auto-publish low-risk, high-confidence changes. | |
| Keep breaking changes always reviewed. | |
| Monitor drift and error rates. | |
| Deliverables | |
| Automation policy | |
| Monitoring dashboards | |
| Quality Gate | |
| Error rates remain below defined thresholds. | |
| Step 12 — Continuous improvement | |
| Goal: Sustain quality over time. | |
| Actions | |
| Periodic historical replay with new data. | |
| Prompt and heuristic tuning. | |
| Confidence threshold adjustments. | |
| Deliverables | |
| Versioned improvements | |
| Updated evaluation reports | |
| Quality Gate | |
| No regression in quality metrics. | |
| ---------------------------------- | |
| 1. Evidence-first artifacts (treat evidence as a product) | |
| Suggestion: | |
| Persist evidence bundles as first-class, versioned artifacts. | |
| What this means | |
| Every pipeline run produces an immutable “evidence package” containing: | |
| Commit SHAs + diffs | |
| Extracted API deltas | |
| JIRA summary + description snapshot | |
| Pipeline version + ruleset version | |
| Why it matters | |
| You can re-run the agent without touching Git or JIRA again | |
| Auditors and reviewers can inspect facts independently of the AI | |
| Enables deterministic replay and regression testing | |
| Enterprise pattern | |
| Separate fact collection from interpretation. | |
| 2. Two-pass classification before summarization | |
| Suggestion: | |
| Split reasoning into classification → summarization, even if both use the same model. | |
| Pass 1: Classification | |
| Is this an API change? | |
| Change type? | |
| Breaking vs non-breaking? | |
| Confidence score | |
| Pass 2: Summarization | |
| Only runs if pass 1 succeeds | |
| Uses classification outputs as constraints | |
| Why it matters | |
| Reduces hallucinations | |
| Enables partial automation (e.g., auto-publish non-breaking changes) | |
| Makes evaluation easier and more explainable | |
| 3. Confidence as a computed value, not a model guess | |
| Suggestion: | |
| Treat confidence as a composite score, not just an LLM output. | |
| Combine | |
| Heuristic confidence (deterministic) | |
| Historical similarity score | |
| Model confidence | |
| JIRA intent clarity score (e.g., explicit “API change” mention) | |
| Why it matters | |
| Prevents over-trusting the model | |
| Enables predictable automation policies | |
| Easier to justify decisions to stakeholders | |
| 4. Negative capability testing (explicit “should NOT happen” cases) | |
| Suggestion: | |
| Create a test suite of known non-API changes. | |
| Examples: | |
| Internal refactors | |
| Logging changes | |
| Performance optimizations | |
| Test-only commits | |
| Why it matters | |
| Enterprise failures often come from false positives, not false negatives | |
| Business teams lose trust faster from noise than from missing items | |
| Pattern | |
| Measure false positives as aggressively as recall. | |
| 5. Human language alignment layer | |
| Suggestion: | |
| Introduce a language normalization step before final output. | |
| What it does | |
| Enforces: | |
| Verb tense | |
| Terminology (“consumer” vs “client”) | |
| Severity words (“breaking”, “minor”) | |
| Strips speculative phrasing (“might”, “appears”) | |
| Why it matters | |
| Business users care about consistency more than intelligence | |
| Prevents stylistic drift across releases | |
| 6. Drift detection on process, not just model | |
| Suggestion: | |
| Monitor drift in inputs and behavior, not only outputs. | |
| Track: | |
| Average commits per release | |
| Average API deltas per release | |
| JIRA description length | |
| Agent confidence distribution | |
| Why it matters | |
| Codebase evolution breaks heuristics silently | |
| Organizational process changes (e.g., worse commit messages) degrade quality | |
| Enterprise lesson | |
| Most AI failures are upstream data failures. | |
| 7. “What changed since last run?” awareness | |
| Suggestion: | |
| Maintain a release memory. | |
| What it enables | |
| Detect repeated changes to the same API | |
| Collapse noisy updates into a single coherent note | |
| Prevent duplicate reporting across releases | |
| Why it matters | |
| Humans naturally reason across time; pipelines do not unless designed | |
| 8. Safe degradation paths | |
| Suggestion: | |
| Define explicit downgrade modes. | |
| Examples: | |
| JIRA unavailable → produce notes without intent but flag | |
| Model unavailable → produce structured diff summary only | |
| Heuristics uncertain → force human review | |
| Why it matters | |
| Enterprise systems must degrade gracefully, not fail hard | |
| 9. Separation of “release assembly” from “change detection” | |
| Suggestion: | |
| Treat “release notes assembly” as a distinct stage. | |
| Why | |
| One API change may span multiple commits | |
| One commit may touch multiple APIs | |
| Release notes need aggregation logic, not just detection | |
| This avoids the trap of “one commit → one note”. | |
| 10. Kill switches & feature flags (non-optional) | |
| Suggestion: | |
| Everything agentic should be behind flags: | |
| Auto-publish on/off | |
| Agent on/off | |
| Confidence thresholds | |
| Why it matters | |
| You will need to disable parts quickly | |
| Builds trust with senior stakeholders | |
| 11. Identity pipeline + replay = enterprise superpower | |
| Combine: | |
| Idempotent pipeline | |
| Evidence bundles | |
| Historical replay | |
| This gives you: | |
| Deterministic debugging | |
| Compliance confidence | |
| Safe iteration on prompts and heuristics | |
| Most “AI failures” happen because teams can’t replay the past. | |
| 12. Design principle to remember | |
| Your agent is replaceable. | |
| Your pipeline is the product. | |
| If you design for: | |
| Reproducibility | |
| Evidence-first processing | |
| Controlled autonomy | |
| …you’ll end up with a system senior architects trust. | |
| ------------------------------------- | |
| Problem Statement | |
| Build an enterprise-grade autonomous / agentic workflow that generates API change notes for each release by analyzing Git commits and associated JIRA issues. | |
| The output is a file containing 1–2 line summaries of API changes, intended for business and non-technical stakeholders, and distributed to the team. | |
| Functional Requirements | |
| Input Sources | |
| Git repository | |
| Commits belonging to a specific sprint or release window | |
| Commits almost always contain a JIRA issue ID in the commit message | |
| JIRA | |
| JIRA tickets are fetched only for JIRA IDs extracted from commits | |
| It is not possible or desirable to fetch all JIRA tickets | |
| Processing Logic | |
| Pull Git repository | |
| Identify commits within the sprint/release | |
| Filter commits that appear related to API changes | |
| Extract code changes introduced by those commits | |
| Fetch corresponding JIRA issues (using extracted JIRA IDs) | |
| Connect code changes with JIRA descriptions | |
| Summarize API changes into concise human-readable text | |
| Aggregate summaries into a single output file | |
| Send the file to the relevant team | |
| API Change Scope | |
| Focused on public API changes | |
| API changes may include: | |
| Endpoint changes | |
| Request/response contract changes | |
| Public DTO or interface changes | |
| OpenAPI / specification changes | |
| Non-API changes (e.g., refactoring, internal logic, tests) are out of scope unless they affect the public API | |
| Output Requirements | |
| A single file per release | |
| Contains short (1–2 line) summaries of API changes | |
| Written for business and API consumers | |
| Includes traceability information (e.g., linked JIRA ID, commit reference) | |
| Generated automatically but may support human review | |
| Intended for distribution to stakeholders after generation | |
| Historical Data | |
| 50 previous releases already exist | |
| API change notes for these releases were written by humans | |
| Historical data can be used as: | |
| Reference behavior | |
| Ground truth for evaluation | |
| Validation baseline for automation | |
| Architectural Constraints | |
| Enterprise-grade quality required | |
| Auditability and traceability are required | |
| Idempotent pipeline behavior is required | |
| Re-running the pipeline with the same inputs should produce the same outputs | |
| Agentic behavior must be controlled and bounded | |
| Deterministic processing preferred where possible | |
| Technology Constraints | |
| Primary implementation language: Java | |
| Cloud environment available: GCP | |
| Preference for cloud-agnostic design | |
| No hard dependency on vendor-specific AI platforms | |
| Integration with: | |
| Git | |
| JIRA (via API or MCP) | |
| Workflow Characteristics | |
| Commit messages are the primary entry point for identifying relevant JIRA issues | |
| JIRA is used for contextual enrichment, not discovery | |
| API change detection occurs before JIRA enrichment | |
| Reasoning and summarization are part of the workflow | |
| Output must be suitable for enterprise consumption and governance | |
| Quality & Governance Constraints | |
| Traceability from output → JIRA → commits → code changes | |
| Ability to replay historical releases | |
| Ability to evaluate system output against historical human-generated releases | |
| Support for human-in-the-loop review | |
| Clear separation between: | |
| Deterministic pipeline stages | |
| Probabilistic / agentic reasoning stages | |
| Non-Goals (Explicit or Implicit) | |
| No requirement to fetch or index all JIRA tickets | |
| No requirement to allow agents to directly access Git or JIRA | |
| No requirement for full autonomous publishing without governance | |
| No requirement for implementation-level detail at this stage | |
| ---------------------------------- | |
| Step-by-Step PoC Plan (with Explicit Enterprise Path) | |
| Step 1 — Lock problem definition & PoC boundaries | |
| Purpose: Prevent scope creep and ensure results are interpretable. | |
| Do | |
| Write a one-page definition of: | |
| What counts as an API change | |
| Target audience (business/API consumers) | |
| Output format (1–2 line summaries) | |
| Choose: | |
| One repository | |
| 1–2 recent releases or sprints | |
| One API surface (e.g., REST controllers) | |
| Skip (for PoC) | |
| Multi-repo support | |
| Multiple API styles | |
| Carries to Enterprise | |
| API-change definition | |
| Output structure | |
| Step 2 — Create a minimal golden dataset | |
| Purpose: Establish objective comparison early. | |
| Do | |
| Select 5–10 historical releases from your 50 | |
| Extract: | |
| Human-written API notes | |
| Approximate commit ranges | |
| Normalize notes into a simple structured format | |
| Skip | |
| Perfect commit ↔ note mapping | |
| Full historical ingestion | |
| Carries to Enterprise | |
| Golden dataset format | |
| Evaluation mindset | |
| Step 3 — Implement minimal deterministic commit ingestion | |
| Purpose: Ground everything in real signals. | |
| Do | |
| Pull commits for chosen release window | |
| Extract JIRA IDs from commit messages (assumed reliable) | |
| Store commit metadata and diffs | |
| Skip | |
| Caching | |
| Idempotency guarantees | |
| Retry logic | |
| Carries to Enterprise | |
| Commit ingestion logic | |
| JIRA ID extraction rules | |
| Step 4 — First-pass API-change heuristics | |
| Purpose: Reduce noise before LLM involvement. | |
| Do | |
| Implement simple, explicit heuristics: | |
| File paths (controllers, API packages) | |
| Known annotations | |
| OpenAPI spec file changes | |
| Output a list of candidate API-change commits | |
| Skip | |
| AST diffs | |
| Weighted scoring | |
| Complex rule engines | |
| Carries to Enterprise | |
| Heuristic categories | |
| Observed false positives/negatives | |
| Step 5 — Lightweight JIRA enrichment | |
| Purpose: Add intent and context cheaply. | |
| Do | |
| Fetch: | |
| JIRA summary | |
| JIRA description | |
| Attach JIRA context to candidate commits | |
| Skip | |
| Caching | |
| Rate-limit handling beyond basics | |
| Carries to Enterprise | |
| JIRA field selection | |
| Linking strategy | |
| Step 6 — Define structured input & output schemas | |
| Purpose: Prevent PoC chaos and future rewrites. | |
| Do | |
| Define simple schemas: | |
| Commit context | |
| API delta | |
| Agent output (summary, justification, confidence) | |
| Validate outputs against schema | |
| Skip | |
| Versioning | |
| Backward compatibility | |
| Carries to Enterprise | |
| Core data contracts | |
| Step 7 — Single-pass agent summarization | |
| Purpose: Validate LLM usefulness, not autonomy. | |
| Do | |
| One prompt | |
| One agent | |
| Provide: | |
| Structured inputs | |
| 2–3 real historical examples | |
| Generate: | |
| 1–2 line API summaries | |
| Justification bullets | |
| Skip | |
| Multi-agent orchestration | |
| Tool-calling | |
| Confidence automation logic | |
| Carries to Enterprise | |
| Prompt patterns | |
| Output structure | |
| Step 8 — Manual evaluation against history | |
| Purpose: Answer “Is this good enough?” | |
| Do | |
| Compare agent output to human notes: | |
| Coverage | |
| Accuracy | |
| Tone alignment | |
| Capture: | |
| Missed changes | |
| Hallucinations | |
| Edits needed | |
| Metrics (simple) | |
| % of human changes detected | |
| false positives | |
| Reviewer usefulness rating | |
| Carries to Enterprise | |
| Evaluation criteria | |
| Known failure modes | |
| Step 9 — Iterate fast (tight PoC loop) | |
| Purpose: Maximize learning, not code quality. | |
| Do | |
| Iterate on: | |
| Heuristics | |
| Prompt wording | |
| Output phrasing | |
| Re-run against same historical set | |
| Skip | |
| Refactoring for cleanliness | |
| Performance tuning | |
| Carries to Enterprise | |
| Refined heuristics | |
| Stable prompt templates | |
| Step 10 — Produce PoC artifacts for decision-making | |
| Purpose: Enable an informed “go / no-go”. | |
| Deliver | |
| Example generated API-change files | |
| Side-by-side comparisons with human releases | |
| List of failure modes | |
| Quantified value estimate (time saved) | |
| Carries to Enterprise | |
| Business justification | |
| Architectural confidence | |
| Step 11 — Define explicit enterprise transition criteria | |
| Purpose: Avoid PoC limbo. | |
| Define | |
| Minimum acceptable detection recall | |
| Maximum tolerable false positives | |
| Human approval rate threshold | |
| Decision | |
| Proceed to enterprise hardening only if criteria met | |
| Carries to Enterprise | |
| Quality gates | |
| Step 12 — Transition to enterprise build (after PoC) | |
| What changes | |
| Add idempotent pipeline | |
| Add audit logs | |
| Add confidence-based automation | |
| Add governance & security | |
| Harden heuristics and diffing | |
| What stays | |
| Definitions | |
| Schemas | |
| Prompts | |
| Evaluation framework | |
| Historical dataset | |
| Key principle (to keep yourself honest) | |
| The PoC validates signal and behavior. | |
| The enterprise build hardens what proved valuable. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment