Skip to content

Instantly share code, notes, and snippets.

@jmchilton
Created February 2, 2026 18:44
Show Gist options
  • Select an option

  • Save jmchilton/6b5b5e87bd42442a5d1f6214f7603bba to your computer and use it in GitHub Desktop.

Select an option

Save jmchilton/6b5b5e87bd42442a5d1f6214f7603bba to your computer and use it in GitHub Desktop.
Workflow Extraction HID to ID Migration - Research and Planning Documents

Workflow Extraction: HID to ID Migration Plan

Executive Summary

Galaxy's workflow extraction API uses HIDs (History ID numbers) to identify datasets for workflow input selection. This creates limitations for cross-history workflows and handling copied datasets:

  1. Cross-history restrictions: HID-based identification requires a single history context
  2. Copied dataset issues: Datasets copied between histories lose proper job chain tracing

This plan details migrating from HID-based to ID-based dataset identification, enabling:

  • Cross-history workflow extraction (datasets from multiple histories)
  • Proper handling of copied datasets and job cache hits
  • Dynamic upstream job chain following in the new Vue UI
  • API consistency (current dataset_ids param uses HIDs despite the name)

Note: Anonymous users cannot create workflows in Galaxy, so these APIs are only relevant for authenticated users.


Current State Analysis

API Endpoint

Route: POST /api/workflows with from_history_id param

File: /lib/galaxy/webapps/galaxy/api/workflows.py (lines 196-298)

Current Parameters:

from_history_id: str           # Encoded history ID (required for extraction)
job_ids: List[str]             # Encoded job database IDs - decoded to int
dataset_ids: List[int]         # HIDs (not encoded IDs!) - used as-is
dataset_collection_ids: List[int]  # HIDs (not encoded IDs!) - used as-is
workflow_name: str             # Name for extracted workflow

Key code (lines 278-298):

if "from_history_id" in payload:
    from_history_id = payload.get("from_history_id")
    from_history_id = self.decode_id(from_history_id)
    history = self.history_manager.get_accessible(...)

    job_ids = [self.decode_id(_) for _ in payload.get("job_ids", [])]
    dataset_ids = payload.get("dataset_ids", [])           # HIDs! Not decoded
    dataset_collection_ids = payload.get("dataset_collection_ids", [])  # HIDs!
    workflow_name = payload["workflow_name"]
    stored_workflow = extract_workflow(
        trans=trans,
        user=trans.user,
        history=history,
        job_ids=job_ids,
        dataset_ids=dataset_ids,                 # Passed as HIDs
        dataset_collection_ids=dataset_collection_ids,  # Passed as HIDs
        workflow_name=workflow_name,
    )

Naming inconsistency: job_ids are actual database IDs (encoded), but dataset_ids are HIDs (plain integers). This is confusing and undocumented.

Mako UI

File: /templates/build_from_current_history.mako

How HIDs are passed (lines 21-22):

<input type="checkbox" id="as-input-${ encoded_id }" class="as-input"
       name="${data.history_content_type}_ids" value="${data.hid}" .../>

Key observations:

  • Form checkbox value is data.hid (not encoded ID)
  • Input naming uses dataset_ids / dataset_collection_ids (HIDs)
  • Job checkboxes use job.id (actual database ID, unencoded)

Job selection (line 162):

<input type="checkbox" name="job_ids" value="${job.id}" .../>

Note: For real jobs, job.id is database ID. For FakeJob, it's "fake_{dataset_id}".

Extraction Logic

File: /lib/galaxy/workflow/extract.py

HID usage in extract_steps() (lines 80-197):

def extract_steps(trans, history=None, job_ids=None, dataset_ids=None, ...):
    # Convert to integers (they're already int HIDs from API)
    dataset_ids = [int(_) for _ in dataset_ids]
    dataset_collection_ids = [int(_) for _ in dataset_collection_ids]

    # Build hid_to_output_pair mapping
    hid_to_output_pair = {}

    # Input dataset steps - indexed by HID
    for i, hid in enumerate(dataset_ids):
        step = model.WorkflowStep()
        step.type = "data_input"
        hid_to_output_pair[hid] = (step, "output")  # Key is HID

WorkflowSummary HID resolution (lines 259-275):

def hid(self, object):
    if object.history_content_type == "dataset_collection":
        if object.id in self.hdca_hid_in_history:
            return self.hdca_hid_in_history[object.id]  # Mapped HID from copy chain
        elif object.history == self.history:
            return object.hid  # Same history - use native HID
        else:
            log.warning("extraction issue, using hdca hid from outside current history")
            return object.hid  # Fallback with warning

Proposed API Changes

Phase 1: Add ID-Based Parameters with Cross-History Support

Key design decision: The single-history restriction is removed. from_history_id becomes optional - it's only needed for backward compatibility and UI context, not for validation. Datasets/collections from any history the user can access are valid inputs.

Rationale for cross-history support:

  • Fixes copied dataset issues (#9161, #13823) more completely
  • Job cache can return outputs from jobs in other histories - ID extraction handles this
  • ID-based lookup is globally unique
  • Single-history was only necessary due to HID semantics
  • Enables dynamic upstream chain following in Vue UI

New parameter schema:

# In lib/galaxy/webapps/galaxy/api/workflows.py

# New ID-based params (preferred) - named to match Galaxy models
hda_ids: Optional[List[DecodedDatabaseIdField]] = None
hdca_ids: Optional[List[DecodedDatabaseIdField]] = None

# Legacy HID-based params (unchanged, still supported)
dataset_ids: Optional[List[int]] = None           # HIDs
dataset_collection_ids: Optional[List[int]] = None  # HIDs

# Shared params
job_ids: Optional[List[DecodedDatabaseIdField]] = None  # Already IDs
workflow_name: str

# Changed: from_history_id is now OPTIONAL for ID-based extraction
from_history_id: Optional[DecodedDatabaseIdField] = None
# Required only when using legacy HID params
# When provided with ID params: used for UI context, not validation

Updated extraction call:

if "hda_ids" in payload or "hdca_ids" in payload:
    # ID-based extraction (new path) - cross-history supported
    hda_ids = [self.decode_id(_) for _ in payload.get("hda_ids", [])]
    hdca_ids = [self.decode_id(_) for _ in payload.get("hdca_ids", [])]
    job_ids = [self.decode_id(_) for _ in payload.get("job_ids", [])]
    workflow_name = payload["workflow_name"]

    # from_history_id optional for ID-based extraction
    from_history_id = None
    if "from_history_id" in payload:
        from_history_id = self.decode_id(payload.get("from_history_id"))

    stored_workflow = extract_workflow_by_ids(
        trans=trans,
        user=trans.user,
        history_id=from_history_id,  # Optional context
        job_ids=job_ids,
        hda_ids=hda_ids,
        hdca_ids=hdca_ids,
        workflow_name=workflow_name,
    )
elif "from_history_id" in payload:
    # Legacy HID-based extraction (still fully supported)
    from_history_id = self.decode_id(payload.get("from_history_id"))
    history = self.history_manager.get_accessible(...)
    job_ids = [self.decode_id(_) for _ in payload.get("job_ids", [])]
    dataset_ids = payload.get("dataset_ids", [])
    dataset_collection_ids = payload.get("dataset_collection_ids", [])
    workflow_name = payload["workflow_name"]
    stored_workflow = extract_workflow(
        trans=trans,
        user=trans.user,
        history=history,
        job_ids=job_ids,
        dataset_ids=dataset_ids,
        dataset_collection_ids=dataset_collection_ids,
        workflow_name=workflow_name,
    )
else:
    raise exceptions.RequestParameterMissingException(
        "Either hda_ids/hdca_ids or from_history_id required"
    )

Phase 1b: New Extraction Function with Permission-Based Validation

File: /lib/galaxy/workflow/extract.py

Add extract_steps_by_ids() function with permission checks instead of history membership checks:

def extract_steps_by_ids(
    trans,
    history_id=None,  # Optional - for UI context only
    job_ids=None,
    hda_ids=None,           # Database IDs, not HIDs
    hdca_ids=None,          # Database IDs, not HIDs
    dataset_names=None,
    dataset_collection_names=None,
):
    """Extract workflow steps using database IDs for input selection.

    Unlike extract_steps() which uses HIDs, this function uses database IDs
    for input dataset/collection selection. This enables cross-history
    extraction and proper handling of copied datasets.

    Datasets/collections may be from any history the user can access.
    Permission-based validation replaces history membership checks.
    """
    hda_manager = trans.app.hda_manager
    hdca_manager = trans.app.hdca_manager

    job_ids = listify(job_ids)
    hda_ids = listify(hda_ids)
    hdca_ids = listify(hdca_ids)

    job_ids = [int(_) for _ in job_ids]
    hda_ids = [int(_) for _ in hda_ids]
    hdca_ids = [int(_) for _ in hdca_ids]

    # Load history if provided (for context, not validation)
    history = None
    if history_id:
        history = trans.sa_session.get(model.History, history_id)

    summary = WorkflowSummaryByIds(trans, history)
    steps = []
    step_labels = set()

    # Use ID-to-output mapping instead of HID-to-output
    id_to_output_pair = {}

    # Input dataset steps - keyed by database ID
    for i, dataset_id in enumerate(hda_ids):
        dataset = trans.sa_session.get(model.HistoryDatasetAssociation, dataset_id)
        if not dataset:
            raise exceptions.ObjectNotFound(f"Dataset {dataset_id} not found")

        # Permission check instead of history membership check
        hda_manager.error_unless_accessible(dataset, trans.user)

        # Trace to original unconverted dataset for implicit conversions
        original_dataset = _trace_to_original(dataset)

        step = model.WorkflowStep()
        step.type = "data_input"
        name = dataset_names[i] if dataset_names else "Input Dataset"
        if name not in step_labels:
            step.label = name
            step_labels.add(name)
        step.tool_inputs = dict(name=name)

        # Key by ID, not HID
        id_to_output_pair[("dataset", original_dataset.id)] = (step, "output")
        steps.append(step)

    # Input collection steps - similar pattern with permission check
    for i, collection_id in enumerate(hdca_ids):
        collection = trans.sa_session.get(model.HistoryDatasetCollectionAssociation, collection_id)
        if not collection:
            raise exceptions.ObjectNotFound(f"Collection {collection_id} not found")

        # Permission check on HDCA only (not individual elements)
        hdca_manager.error_unless_accessible(collection, trans.user)

        step = model.WorkflowStep()
        step.type = "data_collection_input"
        collection_type = collection.collection.collection_type
        name = dataset_collection_names[i] if dataset_collection_names else "Input Dataset Collection"
        if name not in step_labels:
            step.label = name
            step_labels.add(name)
        step.tool_inputs = dict(name=name, collection_type=collection_type)

        id_to_output_pair[("collection", collection_id)] = (step, "output")
        steps.append(step)

    # Tool steps - use IDs for connection lookup
    for job_id in job_ids:
        job = trans.app.job_manager.get_accessible_job(trans, job_id)
        tool_inputs, associations = step_inputs_by_id(trans, job)  # New function

        step = model.WorkflowStep()
        step.type = "tool"
        step.tool_id = job.tool_id
        step.tool_version = job.tool_version
        step.tool_inputs = tool_inputs

        # Connect inputs using IDs
        for (content_type, content_id), input_name in associations:
            key = (content_type, content_id)
            if key in id_to_output_pair:
                step_input = step.get_or_add_input(input_name)
                other_step, other_name = id_to_output_pair[key]
                conn = model.WorkflowStepConnection()
                conn.input_step_input = step_input
                conn.output_step = other_step
                conn.output_name = other_name

        steps.append(step)

        # Store output IDs
        for assoc in job.output_datasets + job.output_dataset_collection_instances:
            assoc_name = assoc.name
            if ToolOutputCollectionPart.is_named_collection_part_name(assoc_name):
                continue
            if assoc_name.startswith("__new_primary_file"):
                continue

            if hasattr(assoc, "dataset"):
                content_type = "dataset"
                content_id = assoc.dataset.id
            else:
                content_type = "collection"
                content_id = assoc.dataset_collection_instance.id

            id_to_output_pair[(content_type, content_id)] = (step, assoc.name)

    return steps


def _trace_to_original(dataset):
    """Trace implicit conversions back to original unconverted dataset."""
    # Follow conversion chain to find original
    while dataset.creating_job and dataset.creating_job.tool_id == "__DATA_FETCH__":
        # This is not a conversion, stop
        break
    # Check if this is result of implicit conversion
    if dataset.implicitly_converted_parent:
        return _trace_to_original(dataset.implicitly_converted_parent)
    return dataset

Phase 1c: Abstract Base Class for WorkflowSummary

Refactor to use abstract base class with concrete implementations for HID vs ID content-to-step mapping:

from abc import ABC, abstractmethod

class BaseWorkflowSummary(ABC):
    """Abstract base for workflow extraction summary with content-to-step mapping."""

    def __init__(self, trans, history=None):
        self.trans = trans
        self.history = history
        self.sa_session = trans.sa_session

    @abstractmethod
    def get_output_key(self, content):
        """Return the key used to map content to workflow steps.

        For HID-based: returns HID (integer)
        For ID-based: returns (content_type, id) tuple
        """
        pass

    @abstractmethod
    def resolve_input(self, job_input):
        """Resolve a job input to its output key for connection mapping."""
        pass

    # Shared methods that don't depend on HID vs ID...


class WorkflowSummary(BaseWorkflowSummary):
    """HID-based workflow summary (legacy)."""

    def __init__(self, trans, history):
        super().__init__(trans, history)
        # Existing HID mapping logic
        self.hda_hid_in_history = {}
        self.hdca_hid_in_history = {}
        # ... existing initialization

    def get_output_key(self, content):
        return self.hid(content)

    def resolve_input(self, job_input):
        # Existing HID-based resolution
        pass


class WorkflowSummaryByIds(BaseWorkflowSummary):
    """ID-based workflow summary (new)."""

    def __init__(self, trans, history=None):
        super().__init__(trans, history)

    def get_output_key(self, content):
        content_type = "collection" if hasattr(content, "collection") else "dataset"
        return (content_type, content.id)

    def resolve_input(self, job_input):
        # ID-based resolution - simpler, no HID translation needed
        if hasattr(job_input, "dataset"):
            return ("dataset", job_input.dataset.id)
        else:
            return ("collection", job_input.dataset_collection.id)

Response Changes

No changes needed to response format. The extracted workflow structure is identical regardless of whether HID or ID was used for input selection.


Cross-History Extraction Benefits

1. Fixes Copied Dataset Problem

From WORKFLOW_EXTRACTION_ISSUES.md (#9161):

When datasets are copied from other histories: All connections are broken, Includes tools from original history

With cross-history ID extraction:

  • User copies dataset from History A to History B
  • Runs tools on copy in History B
  • Extraction request includes: datasets from History B, jobs from History B
  • The copied dataset is marked as an input (its ID in B)
  • Connections to jobs in B work correctly via IDs
  • No "foreign jobs" pulled from History A

2. Supports Job Cache Outputs

When job caching returns outputs from a job in another history, ID-based extraction can still trace connections because it doesn't require all outputs to be in the "current" history.

3. Enables Dynamic Upstream Chain Following

The Vue UI can dynamically load and display upstream job chains across history boundaries, allowing users to explore provenance and select inputs from any point in the chain.


Permission Model

Permission Checks Replace History Membership

The existing code relies on implicit security through history membership. ID-based extraction with cross-history support requires explicit permission checks:

Explicit Dataset Access Checks:

for dataset_id in hda_ids:
    dataset = trans.sa_session.get(model.HDA, dataset_id)
    if not dataset:
        raise ObjectNotFound(f"Dataset {dataset_id} not found")

    # Explicit permission check
    hda_manager.error_unless_accessible(dataset, trans.user)

Job Access Checks:

for job_id in job_ids:
    # get_accessible_job checks user owns job OR job is in accessible history
    job = trans.app.job_manager.get_accessible_job(trans, job_id)

Collection Access: Check HDCA accessibility only, not individual element HDAs.

Permission Scenarios:

Scenario Should Work? Permission Check
User's own dataset from another history Yes hda_manager.error_unless_accessible()
Shared dataset user can access Yes Galaxy's standard dataset access rules
Published dataset Yes dataset.dataset.published
Private dataset from another user No Access check fails

Mako UI Fix

Recommendation: Implement minimal Mako fix to use new ID-based params.

Current (line 22):

name="${data.history_content_type}_ids" value="${data.hid}"

Updated:

name="${'hda_ids' if data.history_content_type == 'dataset' else 'hdca_ids'}" value="${ trans.app.security.encode_id(data.id) }"

This allows Mako users to benefit from ID-based extraction immediately while Vue development proceeds.


Vue UI Implementation

The Vue UI should use IDs throughout and enable dynamic upstream chain exploration.

Use IDs Throughout

API Client (client/src/api/workflowExtraction.ts):

export interface ExtractWorkflowPayload {
    from_history_id?: string;             // Optional - for context
    workflow_name: string;
    job_ids: string[];                    // Encoded database IDs
    hda_ids: string[];                    // Encoded database IDs (NEW)
    hdca_ids: string[];                   // Encoded database IDs (NEW)
}

Component Data Model (WorkflowExtraction.vue):

// Track inputs by encoded ID, not HID
const inputDatasets = ref<Map<string, { id: string; hid: number; name: string; type: string }>>(new Map());

function toggleInputDataset(output: ExtractionOutputDataset) {
    // Key by encoded ID
    if (inputDatasets.value.has(output.id)) {
        inputDatasets.value.delete(output.id);
    } else {
        inputDatasets.value.set(output.id, {
            id: output.id,
            hid: output.hid,
            name: output.name,
            type: output.history_content_type,
        });
    }
}

Form Submission:

async function onSubmit() {
    const hdaIds: string[] = [];
    const hdcaIds: string[] = [];

    inputDatasets.value.forEach((input) => {
        if (input.type === "dataset") {
            hdaIds.push(input.id);  // Use encoded ID, not HID
        } else {
            hdcaIds.push(input.id);
        }
    });

    await extractWorkflow({
        from_history_id: props.historyId,  // Optional context
        workflow_name: workflowName.value,
        job_ids: Array.from(selectedJobs.value),
        hda_ids: hdaIds,
        hdca_ids: hdcaIds,
    });
}

Display Still Uses HIDs

The UI should still display HIDs for user readability:

<span class="hid">{{ output.hid }}</span>
<span class="name">{{ output.name }}</span>

But form submission uses IDs, not HIDs.

Dynamic Upstream Chain (Future Enhancement)

The Vue UI should lazily load and display upstream job dependencies as the user explores, enabling selection of inputs from any point in the provenance chain. This feature leverages ID-based extraction's cross-history capability.


FakeJob Handling

FakeJob IDs use "fake_{dataset_id}" format for datasets without real jobs (e.g., uploaded files). The ID-based API continues to accept FakeJob IDs in this format, decoding them to the underlying dataset ID.


Testing Requirements

Unit Tests

File: lib/galaxy_test/unit/workflow/test_extract.py (new)

class TestExtractStepsByIds:
    """Tests for ID-based extraction."""

    def test_basic_extraction_by_id(self):
        """Single dataset marked as input using ID."""
        pass

    def test_collection_extraction_by_id(self):
        """Dataset collection marked as input using ID."""
        pass

    def test_invalid_dataset_id(self):
        """Non-existent dataset ID raises ObjectNotFound."""
        pass

    def test_inaccessible_dataset_rejected(self):
        """Dataset user cannot access raises ItemAccessibilityException."""
        pass

    def test_cross_history_same_user(self):
        """Dataset from different history (same user) works."""
        pass

    def test_implicit_conversion_traces_to_original(self):
        """Implicitly converted dataset traces back to original."""
        pass

API Integration Tests

File: lib/galaxy_test/api/test_workflow_extraction.py (extend)

class TestWorkflowExtractionByIds(ApiTestCase):
    """Tests for ID-based workflow extraction API."""

    def test_extract_with_hda_ids(self):
        """Extract workflow using hda_ids param."""
        history_id = self.dataset_populator.new_history()
        hda = self.dataset_populator.new_dataset(history_id, wait=True)

        # Run a tool
        run_result = self.dataset_populator.run_tool(
            tool_id="cat1",
            inputs={"input1": {"src": "hda", "id": hda["id"]}},
            history_id=history_id,
        )
        self.dataset_populator.wait_for_history(history_id)

        # Extract using IDs
        response = self._post(
            "workflows",
            data={
                "from_history_id": history_id,
                "workflow_name": "Test Workflow",
                "job_ids": [run_result["jobs"][0]["id"]],
                "hda_ids": [hda["id"]],  # Encoded ID, not HID
            },
        )
        self._assert_status_code_is(response, 200)
        workflow = response.json()
        assert "id" in workflow

    def test_extract_without_from_history_id(self):
        """Extract workflow using only IDs, no from_history_id."""
        history_id = self.dataset_populator.new_history()
        hda = self.dataset_populator.new_dataset(history_id, wait=True)

        response = self._post(
            "workflows",
            data={
                # No from_history_id
                "workflow_name": "Test Workflow",
                "hda_ids": [hda["id"]],
            },
        )
        self._assert_status_code_is(response, 200)

    def test_cross_history_extraction(self):
        """Extract workflow with datasets from multiple histories."""
        history_a = self.dataset_populator.new_history()
        history_b = self.dataset_populator.new_history()
        hda_a = self.dataset_populator.new_dataset(history_a, wait=True)
        hda_b = self.dataset_populator.new_dataset(history_b, wait=True)

        response = self._post(
            "workflows",
            data={
                "workflow_name": "Cross-History Workflow",
                "hda_ids": [hda_a["id"], hda_b["id"]],
            },
        )
        self._assert_status_code_is(response, 200)

    def test_inaccessible_dataset_rejected(self):
        """Dataset from another user's private history rejected."""
        # Create dataset as different user
        # Try to extract - should fail with permission error
        pass

    def test_backward_compat_hid_params(self):
        """Legacy HID params still work."""
        history_id = self.dataset_populator.new_history()
        hda = self.dataset_populator.new_dataset(history_id, wait=True)

        response = self._post(
            "workflows",
            data={
                "from_history_id": history_id,
                "workflow_name": "Test Workflow",
                "dataset_ids": [1],  # HID, not encoded ID
            },
        )
        self._assert_status_code_is(response, 200)

    def test_copied_dataset_with_ids(self):
        """Copied datasets work correctly with ID-based extraction."""
        # Create history A with dataset
        # Copy dataset to history B
        # Run tool on copy in history B
        # Extract workflow from history B using IDs
        # Verify connections are correct
        pass

Selenium Tests

File: lib/galaxy_test/selenium/test_workflow_extraction.py

class TestWorkflowExtractionVue(SeleniumTestCase):
    """Selenium tests for Vue extraction UI using IDs."""

    def test_extraction_sends_ids_not_hids(self):
        """Verify Vue UI submits encoded IDs, not HIDs."""
        # Intercept network request
        # Verify payload contains hda_ids with encoded IDs
        pass

TODO / Open Questions

Items to resolve before or during implementation:

  1. _trace_to_original implementation: Verify the implicit conversion tracing logic is correct. Need to check how Galaxy marks implicit conversions (is implicitly_converted_parent the right attribute?).

  2. step_inputs_by_id function: Design this new function - should mirror step_inputs but return ID-based associations instead of HID-based.

  3. Abstract base class boundaries: Finalize which methods are shared vs abstract in BaseWorkflowSummary. Need to review existing WorkflowSummary to identify all HID-dependent code paths.

  4. Mako template testing: After Mako fix, verify form submission works correctly with new param names. May need JavaScript changes if form handling is client-side.

  5. Vue UI architecture: Detailed Vue component design is out of scope for this plan but referenced. Ensure the API design here supports the Vue UI requirements from WORKFLOW_EXTRACTION_VUE_CONVERSION_PLAN.md.

  6. Dynamic upstream chain API: Determine if existing /api/jobs/{id} provides enough info for Vue UI's dynamic chain following, or if new endpoints are needed. (Deferred - out of scope for initial implementation.)

Multiple History Support for ID-Based Workflow Extraction

Executive Summary

Yes, the plan should be updated to remove the single-history limitation.

The single-history restriction in WORKFLOW_EXTRACTION_HID_TO_ID_ISSUE.md is an artificial constraint inherited from HID-based assumptions, not a technical requirement. The extraction logic does not fundamentally require datasets to be in one history - it needs a set of datasets/jobs and their relationships. ID-based extraction unlocks the ability to support cross-history datasets, which would fix the "copied dataset problem" (#9161, #13823) more comprehensively.


Analysis: Why Single-History Exists in Current Plan

In the HID-Based System

The current HID-based extraction requires a single history because:

  1. HID is history-scoped: HIDs only have meaning within a specific history. hid=5 in History A is unrelated to hid=5 in History B.

  2. WorkflowSummary iterates one history: Line 282 of extract.py:

    for content in self.history.visible_contents:

    This builds the hid_to_output_pair mapping by scanning one history.

  3. HID lookup requires context: The hid() method (lines 259-275) tries to map cross-history objects back to "current history" HIDs, but this is fragile.

In the Proposed ID-Based System

The plan carries over from_history_id as a required parameter, but this is not technically necessary when using IDs:

# From the plan (lines 226-232):
if dataset.history_id != history.id:
    # Cross-history dataset - could be supported in future
    raise exceptions.RequestParameterInvalidException(
        f"Dataset {dataset_id} is not in history {history.id}"
    )

This check is artificially restrictive. With ID-based lookup:

  • Dataset ID uniquely identifies the dataset across all histories
  • No HID collision is possible
  • Connection tracing uses IDs not HIDs

Technical Feasibility of Cross-History Extraction

What Would Need to Change

1. API Changes

Current plan:

from_history_id: DecodedDatabaseIdField  # Required
input_dataset_ids: List[DecodedDatabaseIdField]

Updated for cross-history:

from_history_id: Optional[DecodedDatabaseIdField] = None  # Optional, for UI context only
input_dataset_ids: List[DecodedDatabaseIdField]  # Required
input_dataset_collection_ids: List[DecodedDatabaseIdField]

The from_history_id becomes optional/context-only rather than required.

2. Extraction Function Changes

Remove history-scoped validation:

# OLD (restrictive):
if dataset.history_id != history.id:
    raise RequestParameterInvalidException(...)

# NEW (permissive with access check):
if not self.hda_manager.is_accessible(dataset, trans.user):
    raise ItemAccessibilityException(...)

3. WorkflowSummary Changes

For ID-based extraction, WorkflowSummary would not iterate history.visible_contents. Instead:

def extract_steps_by_ids(trans, input_dataset_ids, input_collection_ids, job_ids):
    # Load datasets directly by ID
    for dataset_id in input_dataset_ids:
        dataset = trans.sa_session.get(model.HDA, dataset_id)
        # No history check needed, just permission check

    # Load jobs directly by ID
    for job_id in job_ids:
        job = trans.sa_session.get(model.Job, job_id)
        # Trace connections via IDs not HIDs

4. Connection Mapping Uses IDs

Current (HID-based):

hid_to_output_pair[hid] = (step, "output")
# Later...
if other_hid in hid_to_output_pair:
    # connect

ID-based:

id_to_output_pair[(content_type, content_id)] = (step, "output")
# Later...
if (content_type, content_id) in id_to_output_pair:
    # connect

This is already proposed in the plan. It naturally supports cross-history because IDs are globally unique.


Benefits of Cross-History Extraction

1. Fixes Copied Dataset Problem Completely

From WORKFLOW_EXTRACTION_ISSUES.md (#9161):

When datasets are copied from other histories: All connections are broken, Includes tools from original history

With cross-history ID extraction:

  • User copies dataset from History A to History B
  • Runs tools on copy in History B
  • Extraction request includes: datasets from History B, jobs from History B
  • The copied dataset is marked as an input (its ID in B)
  • Connections to jobs in B work correctly via IDs
  • No "foreign jobs" pulled from History A

2. Supports Job Cache Outputs

When job caching returns outputs from a job in another history, ID-based extraction can still trace connections because it doesn't require all outputs to be in the "current" history.

3. More Flexible Workflow Construction

Users could potentially construct workflows from multiple analysis sessions across histories:

  • Select outputs from History A (training data processing)
  • Select outputs from History B (validation data processing)
  • Combine into single workflow

Permission Model Considerations

Current State

The existing code has minimal permission checking in extract.py:

# Line 97-98:
# Find each job, for security we (implicitly) check that they are
# associated with a job in the current history.

This "implicit" security relies on:

  1. User can only access their own history's contents
  2. History access is checked at API level (history_manager.get_accessible)

Required Changes for Cross-History

Explicit Dataset Access Checks:

for dataset_id in input_dataset_ids:
    dataset = trans.sa_session.get(model.HDA, dataset_id)
    if not dataset:
        raise ObjectNotFound(f"Dataset {dataset_id} not found")

    # Explicit permission check
    if not hda_manager.is_accessible(dataset, trans.user):
        raise ItemAccessibilityException(
            f"Dataset {dataset_id} is not accessible to user"
        )

Explicit Job Access Checks:

for job_id in job_ids:
    job = job_manager.get_accessible_job(trans, job_id)
    # This already exists in jobs.py:340

Permission Scenarios:

Scenario Should Work? Permission Check
User's own dataset from another history Yes dataset.user_id == trans.user.id
Shared dataset user can access Yes Galaxy's standard dataset access rules
Published dataset Yes dataset.dataset.published
Private dataset from another user No Access check fails
Anonymous user's dataset (session-based) Complex May need session tracking

Recommendation

Use existing HDAManager.get_accessible() or similar for each dataset/collection. This leverages Galaxy's existing permission model without reinventing it.


UI Implications

Current UI Flow

  1. User opens extraction from current history
  2. UI shows all jobs/datasets from that history
  3. User selects items
  4. Extraction runs on single history

Cross-History UI Options

Option A: History-Scoped UI (Minimal Change)

  • Keep current UI showing one history at a time
  • User selects items from current history
  • For copied datasets, UI could show "(copied from History X)"
  • API accepts cross-history IDs but UI doesn't expose it directly

Option B: Multi-History Selection (Future Enhancement)

  • UI allows browsing multiple histories
  • User can select items from different histories
  • More complex but more powerful

Recommendation: Start with Option A. The primary benefit of cross-history support is handling copied datasets correctly, which happens transparently when using IDs. The UI can show the current history but the backend accepts any accessible dataset.


Required Changes to the Plan

1. Make from_history_id Optional

# Changed from required to optional
from_history_id: Optional[DecodedDatabaseIdField] = None

# If provided, used for:
# - Backward compatibility
# - UI context (which history was open)
# - Fallback for HID-based params

2. Remove Cross-History Validation Error

In extract_steps_by_ids():

# REMOVE this code:
if dataset.history_id != history.id:
    raise exceptions.RequestParameterInvalidException(...)

# REPLACE with:
self.hda_manager.error_unless_accessible(dataset, trans.user)

3. Add Permission Checks

def extract_steps_by_ids(trans, input_dataset_ids, ...):
    hda_manager = trans.app.hda_manager  # or inject via DI

    for dataset_id in input_dataset_ids:
        dataset = trans.sa_session.get(model.HDA, dataset_id)
        if not dataset:
            raise ObjectNotFound(f"Dataset {dataset_id} not found")
        hda_manager.error_unless_accessible(dataset, trans.user)

        # ... rest of step creation

4. Update API Documentation

:param from_history_id: Optional. The history context for extraction.
    Not required for ID-based extraction but may be used for UI purposes.
:param input_dataset_ids: Dataset IDs to use as workflow inputs.
    Datasets may be from any history the user can access.

5. Update Test Cases

Add tests for:

  • Extracting with datasets from different histories (same user)
  • Permission denied for inaccessible cross-history dataset
  • Mixed: some datasets from current history, some from another

Recommendation

Update the plan to support cross-history extraction. Specifically:

  1. Phase 1 (Initial Implementation):

    • Add ID-based params as planned
    • Remove the history_id != history.id validation
    • Add explicit permission checks per dataset/collection
    • Keep from_history_id optional (for UI context, backward compat)
  2. Phase 2 (Vue UI):

    • UI continues to show current history
    • Copied datasets work correctly without special handling
    • Future: consider multi-history selection UI
  3. Documentation:

    • Document that ID-based extraction supports cross-history datasets
    • Note permission requirements

Justification:

  • Fixes copied dataset bugs (#9161, #13823) more completely
  • Minimal additional complexity (just permission checks)
  • ID-based lookup naturally supports cross-history
  • Single-history was only needed due to HID semantics

Unresolved Questions

  1. Should from_history_id be required for backward compat? Or can we make it fully optional from the start?

  2. Anonymous users: If a session-based user copies datasets between histories, does session tracking support cross-history access?

  3. Job access: When job cache is used, the job may be in a different history. Should we allow referencing jobs from other histories if user can access them?

  4. UI discovery: How does a user discover/select datasets from other histories for extraction? Is this needed in initial implementation or can we rely on copied datasets being in current history?

  5. Shared histories: If History A is shared with user B, can user B extract workflows using datasets from History A? (Probably yes if access checks pass, but need to verify.)

  6. Collection elements: For copied collections, do all elements need to be accessible, or just the HDCA itself?

Workflow Extraction: HID vs ID Analysis

Executive Summary

This document analyzes the trade-offs of switching Galaxy's workflow extraction API from HID-based (History ID numbers) to ID-based (database primary key) references. The current HID approach prioritizes human readability but has fundamental limitations: HIDs are not unique within a history (implicit conversions share HIDs), and cross-history operations cause HID collisions. Switching to IDs would enable correct extraction at the cost of increased API complexity.


1. Current HID-Based Approach

How It Works

The extraction API accepts HIDs (History ID numbers) to specify which datasets and jobs to include:

# API parameters
job_ids: List[int]           # Actually job database IDs (confusing naming)
dataset_ids: List[int]       # HIDs of datasets to mark as inputs
dataset_collection_ids: List[int]  # HIDs of collections to mark as inputs

Key implementation details:

  1. WorkflowSummary iterates history.visible_contents (HDAs/HDCAs sorted by HID)
  2. Builds mappings: hda_hid_in_history[hda_id] -> hid, hdca_hid_in_history[hdca_id] -> hid
  3. Follows copied_from_* chains to find original dataset, then maps original ID to current HID
  4. Connections between steps use HIDs to locate source/target datasets

Where HIDs Are Used

Location Purpose
Mako template Display and checkbox values for job/dataset selection
dataset_ids param Identify datasets to mark as workflow inputs
dataset_collection_ids param Identify collections to mark as inputs
WorkflowSummary.hid() Resolve dataset references during step extraction
Connection building Find producing step by matching HIDs

2. Pros of Current HID Approach

Human Readability

  • HIDs match what users see in history panel (1, 2, 3, etc.)
  • Debugging extraction issues easier: "dataset 5 is disconnected"
  • API calls human-readable: dataset_ids=[1, 3, 5]

Simplicity

  • Single history scope assumption simplifies implementation
  • No need to validate cross-history permissions

Note: HIDs are NOT unique within a history. Multiple HDAs can share the same HID when implicit format conversions occur (e.g., original HID=5, implicitly converted HID=5). The ImplicitlyConvertedDatasetAssociation table links these datasets, and the UI groups them by HID.

Backward Compatibility

  • Existing UI, API clients, and scripts rely on HIDs
  • Mako template directly uses HIDs as form values

Natural Ordering

  • HIDs reflect creation order in history
  • Step ordering in extracted workflow roughly matches execution order

3. Cons of Current HID Approach

Fundamental: Copied Dataset Problem

The most critical limitation. When datasets are copied between histories:

  1. Copied HDA's creating_job_associations points to job in original history
  2. That job's inputs/outputs reference HIDs in original history
  3. HID collision: HID 5 in History A is different dataset than HID 5 in History B
  4. Result: broken connections, wrong jobs pulled in, extraction failures

Affected issues: #9161, #13823, #12236, possibly #21336

Fundamental: Single-History Limitation

  • Cannot extract workflows involving datasets from multiple histories
  • Job cache hits may reference jobs in different history contexts
  • Subworkflow extraction can't reference cross-history provenance

Fundamental: HID Non-Uniqueness (Implicit Conversions) - PRIORITY 1

This is the highest priority fix. See WORKFLOW_EXTRACTION_IMPLICIT_CONVERSION.md for detailed analysis.

HIDs are NOT unique within a single history. When Galaxy performs implicit format conversions:

  1. Original dataset: HID=5, ID=100
  2. Implicitly converted dataset: HID=5, ID=101
  3. Another conversion: HID=5, ID=102

The ImplicitlyConvertedDatasetAssociation table links these. The UI groups them by HID, but:

  • HID-based lookups can return wrong dataset when multiple share the same HID
  • Connection building may connect to converted variant instead of original
  • API parameter dataset_ids=[5] is ambiguous when HID=5 has multiple datasets

Important clarification: Runtime handles the conversion itself (same code path for workflows and interactive execution). The issue is dataset selection during extraction, not format handling at runtime. When extraction picks the wrong dataset due to HID ambiguity, the workflow may still run (runtime triggers conversion as needed), but provenance and job caching can be incorrect.

Existing mitigation: basic.py:2350-2380 has logic to prefer originals over conversions when building tool parameters. However, extract.py lacks equivalent logic - it filters out converted datasets via visible_contents but doesn't explicitly handle HID collisions when they occur.

HID Instability

  • Deleting/purging datasets doesn't reassign HIDs (gaps appear)
  • Copied datasets get new HIDs, breaking the mental model
  • HID 5 in current history might be copy of HID 100 in source history

Ambiguity in API

  • job_ids are actually database IDs (not HIDs)
  • dataset_ids are HIDs
  • This inconsistency is confusing

Cannot Support Advanced Features

  • Cross-history workflow extraction impossible
  • Job cache provenance tracing broken
  • Multi-history lineage reconstruction not feasible

4. Pros of Switching to IDs

Correctness

  • Database IDs are globally unique across all histories AND within a history
  • copied_from_* chains can be followed without HID translation
  • No ambiguity about which dataset is referenced, even when multiple HDAs share same HID (implicit conversions)
  • Resolves both cross-history HID collisions AND intra-history HID non-uniqueness

Enables Cross-History Operations

With ID-based API:

  • Extract workflow from datasets spanning multiple histories
  • Trace provenance through job cache hits to original jobs
  • Build workflows from datasets copied from shared histories, libraries, etc.

Consistency

  • All identifiers would be database IDs (jobs already use IDs)
  • Matches REST API conventions elsewhere in Galaxy
  • Encoded IDs are standard Galaxy API format

Future-Proofing

  • Enables ToolRequest-based extraction (which uses IDs)
  • Supports eventual "reconstruct workflow from any output" feature
  • Compatible with federated Galaxy scenarios

Eliminates HID Resolution Bugs

Current code has complex HID resolution:

def hid(self, object):
    if object.id in self.hdca_hid_in_history:
        return self.hdca_hid_in_history[object.id]  # Use mapped HID
    elif object.history == self.history:
        return object.hid  # Same history
    else:
        return object.hid  # Fallback with warning

ID-based approach eliminates this entirely.


5. Cons of Switching to IDs

Human Readability Loss

  • Encoded IDs like f2db41e1fa331b3e are opaque
  • Debugging harder: "dataset f2db41e1fa331b3e is disconnected"
  • API calls harder to construct manually

UI Complexity Increase

  • Selection UI must map between displayed HIDs and submitted IDs
  • Error messages need to translate IDs back to HIDs for user
  • Form handling more complex

Backward Compatibility Break

  • Existing API clients sending HIDs would break
  • Scripts using dataset_ids=[1, 2, 3] would fail
  • Mako template overhaul required (already planned per #17506)

Implementation Effort

  • All HID resolution code needs replacement
  • Tests need updating
  • API versioning or migration path needed

Security Considerations

  • IDs expose internal database structure (mitigated by encoding)
  • Cross-history extraction needs permission validation for each ID
  • More attack surface for IDOR vulnerabilities

Potential for User Confusion

  • Users might paste wrong IDs from different histories
  • No visual validation that ID belongs to expected history
  • Error messages harder to make actionable

6. Hybrid Approaches

Option A: Accept Both, Prefer IDs

dataset_ids: List[EncodedId]  # Primary (database IDs, encoded)
dataset_hids: List[int]       # Legacy support (HIDs, single history only)
  • New clients use IDs, old clients use HIDs
  • HID mode restricted to single-history extraction
  • ID mode enables cross-history

Pros: Backward compatible, gradual migration Cons: Two code paths to maintain, API complexity

Option B: IDs with HID Annotations

datasets: List[{
    "id": "f2db41e1fa331b3e",
    "hid": 5,  # Optional, for display/debugging
    "history_id": "a799d38679e985db"  # Optional, for cross-history
}]
  • IDs are authoritative
  • HIDs included for human readability
  • History context explicit

Pros: Best of both worlds Cons: Verbose API, complex validation

Option C: History-Scoped IDs

sources: List[{
    "history_id": "a799d38679e985db",
    "dataset_id": "f2db41e1fa331b3e"  # ID within that history
}]
  • Explicit history scoping
  • Supports single and multi-history modes
  • Validates dataset exists in specified history

Pros: Clear semantics, safe Cons: Verbose for single-history case


7. Recommendation

Recommended approach: Option A (Accept Both, Prefer IDs) with eventual deprecation of HID mode.

Justification

  1. The HID-based approach has fundamental correctness issues

    • Cross-history: 3+ open issues directly caused by HID approach (#9161, #13823, #12236)
    • Intra-history (Priority 1): HID non-uniqueness from implicit conversions causes ambiguous references. Per implicit conversion research, this is not mitigated by runtime - it causes real dataset selection bugs.
    • Workaround of "download and re-upload" is unacceptable UX
    • Only ID-based approach can fundamentally fix both problems
  2. Backward compatibility matters

    • Abrupt API break would affect external tools, scripts
    • HID mode can serve legacy clients indefinitely
    • Mako->Vue conversion (#17506) is natural transition point
  3. Cross-history extraction is valuable

    • Users already expect to copy datasets and have extraction work
    • Job caching benefits require cross-history provenance tracing
    • Future features (federated Galaxy, shared lineage) depend on this
  4. Implementation is incremental

    • Add ID-based code path alongside existing HID path
    • New Vue UI uses IDs from start
    • Deprecate HID path after migration period (2-3 releases)

Implementation Sketch

# Phase 1: Add ID support
class WorkflowExtractPayload(Model):
    # New ID-based params
    dataset_ids: Optional[List[DecodedDatabaseIdField]] = None
    dataset_collection_ids: Optional[List[DecodedDatabaseIdField]] = None

    # Legacy HID-based params (deprecated)
    dataset_hids: Optional[List[int]] = None
    dataset_collection_hids: Optional[List[int]] = None

    # Shared
    job_ids: Optional[List[DecodedDatabaseIdField]] = None  # Already IDs
    workflow_name: Optional[str] = None
    from_history_id: DecodedDatabaseIdField

# Phase 2: Vue UI uses IDs
# Phase 3: Deprecation warnings on HID params
# Phase 4: Remove HID params (v26.0+?)

Success Criteria

  1. All copied dataset issues (#9161, #13823, #12236) are fixed when using ID mode
  2. Existing HID-based clients continue to work (single history only)
  3. New Vue UI supports cross-history extraction when provenance exists
  4. Test coverage for both modes during transition

8. Unresolved Questions

  1. What permission model for cross-history extraction? Must user own all histories, or just have access to the datasets?
  2. Should ID mode auto-discover required jobs from other histories, or require explicit specification?
  3. How to handle UI when extracted workflow would span multiple histories - show warning? prevent? allow?
  4. Timeline for HID deprecation relative to Mako->Vue conversion?
  5. Does ToolRequest-based extraction (proposed solution in ISSUES doc) change this analysis? Would it use IDs inherently?

9. Related Documents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment