Galaxy's workflow extraction API uses HIDs (History ID numbers) to identify datasets for workflow input selection. This creates limitations for cross-history workflows and handling copied datasets:
- Cross-history restrictions: HID-based identification requires a single history context
- Copied dataset issues: Datasets copied between histories lose proper job chain tracing
This plan details migrating from HID-based to ID-based dataset identification, enabling:
- Cross-history workflow extraction (datasets from multiple histories)
- Proper handling of copied datasets and job cache hits
- Dynamic upstream job chain following in the new Vue UI
- API consistency (current
dataset_idsparam uses HIDs despite the name)
Note: Anonymous users cannot create workflows in Galaxy, so these APIs are only relevant for authenticated users.
Route: POST /api/workflows with from_history_id param
File: /lib/galaxy/webapps/galaxy/api/workflows.py (lines 196-298)
Current Parameters:
from_history_id: str # Encoded history ID (required for extraction)
job_ids: List[str] # Encoded job database IDs - decoded to int
dataset_ids: List[int] # HIDs (not encoded IDs!) - used as-is
dataset_collection_ids: List[int] # HIDs (not encoded IDs!) - used as-is
workflow_name: str # Name for extracted workflowKey code (lines 278-298):
if "from_history_id" in payload:
from_history_id = payload.get("from_history_id")
from_history_id = self.decode_id(from_history_id)
history = self.history_manager.get_accessible(...)
job_ids = [self.decode_id(_) for _ in payload.get("job_ids", [])]
dataset_ids = payload.get("dataset_ids", []) # HIDs! Not decoded
dataset_collection_ids = payload.get("dataset_collection_ids", []) # HIDs!
workflow_name = payload["workflow_name"]
stored_workflow = extract_workflow(
trans=trans,
user=trans.user,
history=history,
job_ids=job_ids,
dataset_ids=dataset_ids, # Passed as HIDs
dataset_collection_ids=dataset_collection_ids, # Passed as HIDs
workflow_name=workflow_name,
)Naming inconsistency: job_ids are actual database IDs (encoded), but dataset_ids are HIDs (plain integers). This is confusing and undocumented.
File: /templates/build_from_current_history.mako
How HIDs are passed (lines 21-22):
<input type="checkbox" id="as-input-${ encoded_id }" class="as-input"
name="${data.history_content_type}_ids" value="${data.hid}" .../>Key observations:
- Form checkbox
valueisdata.hid(not encoded ID) - Input naming uses
dataset_ids/dataset_collection_ids(HIDs) - Job checkboxes use
job.id(actual database ID, unencoded)
Job selection (line 162):
<input type="checkbox" name="job_ids" value="${job.id}" .../>Note: For real jobs, job.id is database ID. For FakeJob, it's "fake_{dataset_id}".
File: /lib/galaxy/workflow/extract.py
HID usage in extract_steps() (lines 80-197):
def extract_steps(trans, history=None, job_ids=None, dataset_ids=None, ...):
# Convert to integers (they're already int HIDs from API)
dataset_ids = [int(_) for _ in dataset_ids]
dataset_collection_ids = [int(_) for _ in dataset_collection_ids]
# Build hid_to_output_pair mapping
hid_to_output_pair = {}
# Input dataset steps - indexed by HID
for i, hid in enumerate(dataset_ids):
step = model.WorkflowStep()
step.type = "data_input"
hid_to_output_pair[hid] = (step, "output") # Key is HIDWorkflowSummary HID resolution (lines 259-275):
def hid(self, object):
if object.history_content_type == "dataset_collection":
if object.id in self.hdca_hid_in_history:
return self.hdca_hid_in_history[object.id] # Mapped HID from copy chain
elif object.history == self.history:
return object.hid # Same history - use native HID
else:
log.warning("extraction issue, using hdca hid from outside current history")
return object.hid # Fallback with warningKey design decision: The single-history restriction is removed. from_history_id becomes optional - it's only needed for backward compatibility and UI context, not for validation. Datasets/collections from any history the user can access are valid inputs.
Rationale for cross-history support:
- Fixes copied dataset issues (#9161, #13823) more completely
- Job cache can return outputs from jobs in other histories - ID extraction handles this
- ID-based lookup is globally unique
- Single-history was only necessary due to HID semantics
- Enables dynamic upstream chain following in Vue UI
New parameter schema:
# In lib/galaxy/webapps/galaxy/api/workflows.py
# New ID-based params (preferred) - named to match Galaxy models
hda_ids: Optional[List[DecodedDatabaseIdField]] = None
hdca_ids: Optional[List[DecodedDatabaseIdField]] = None
# Legacy HID-based params (unchanged, still supported)
dataset_ids: Optional[List[int]] = None # HIDs
dataset_collection_ids: Optional[List[int]] = None # HIDs
# Shared params
job_ids: Optional[List[DecodedDatabaseIdField]] = None # Already IDs
workflow_name: str
# Changed: from_history_id is now OPTIONAL for ID-based extraction
from_history_id: Optional[DecodedDatabaseIdField] = None
# Required only when using legacy HID params
# When provided with ID params: used for UI context, not validationUpdated extraction call:
if "hda_ids" in payload or "hdca_ids" in payload:
# ID-based extraction (new path) - cross-history supported
hda_ids = [self.decode_id(_) for _ in payload.get("hda_ids", [])]
hdca_ids = [self.decode_id(_) for _ in payload.get("hdca_ids", [])]
job_ids = [self.decode_id(_) for _ in payload.get("job_ids", [])]
workflow_name = payload["workflow_name"]
# from_history_id optional for ID-based extraction
from_history_id = None
if "from_history_id" in payload:
from_history_id = self.decode_id(payload.get("from_history_id"))
stored_workflow = extract_workflow_by_ids(
trans=trans,
user=trans.user,
history_id=from_history_id, # Optional context
job_ids=job_ids,
hda_ids=hda_ids,
hdca_ids=hdca_ids,
workflow_name=workflow_name,
)
elif "from_history_id" in payload:
# Legacy HID-based extraction (still fully supported)
from_history_id = self.decode_id(payload.get("from_history_id"))
history = self.history_manager.get_accessible(...)
job_ids = [self.decode_id(_) for _ in payload.get("job_ids", [])]
dataset_ids = payload.get("dataset_ids", [])
dataset_collection_ids = payload.get("dataset_collection_ids", [])
workflow_name = payload["workflow_name"]
stored_workflow = extract_workflow(
trans=trans,
user=trans.user,
history=history,
job_ids=job_ids,
dataset_ids=dataset_ids,
dataset_collection_ids=dataset_collection_ids,
workflow_name=workflow_name,
)
else:
raise exceptions.RequestParameterMissingException(
"Either hda_ids/hdca_ids or from_history_id required"
)File: /lib/galaxy/workflow/extract.py
Add extract_steps_by_ids() function with permission checks instead of history membership checks:
def extract_steps_by_ids(
trans,
history_id=None, # Optional - for UI context only
job_ids=None,
hda_ids=None, # Database IDs, not HIDs
hdca_ids=None, # Database IDs, not HIDs
dataset_names=None,
dataset_collection_names=None,
):
"""Extract workflow steps using database IDs for input selection.
Unlike extract_steps() which uses HIDs, this function uses database IDs
for input dataset/collection selection. This enables cross-history
extraction and proper handling of copied datasets.
Datasets/collections may be from any history the user can access.
Permission-based validation replaces history membership checks.
"""
hda_manager = trans.app.hda_manager
hdca_manager = trans.app.hdca_manager
job_ids = listify(job_ids)
hda_ids = listify(hda_ids)
hdca_ids = listify(hdca_ids)
job_ids = [int(_) for _ in job_ids]
hda_ids = [int(_) for _ in hda_ids]
hdca_ids = [int(_) for _ in hdca_ids]
# Load history if provided (for context, not validation)
history = None
if history_id:
history = trans.sa_session.get(model.History, history_id)
summary = WorkflowSummaryByIds(trans, history)
steps = []
step_labels = set()
# Use ID-to-output mapping instead of HID-to-output
id_to_output_pair = {}
# Input dataset steps - keyed by database ID
for i, dataset_id in enumerate(hda_ids):
dataset = trans.sa_session.get(model.HistoryDatasetAssociation, dataset_id)
if not dataset:
raise exceptions.ObjectNotFound(f"Dataset {dataset_id} not found")
# Permission check instead of history membership check
hda_manager.error_unless_accessible(dataset, trans.user)
# Trace to original unconverted dataset for implicit conversions
original_dataset = _trace_to_original(dataset)
step = model.WorkflowStep()
step.type = "data_input"
name = dataset_names[i] if dataset_names else "Input Dataset"
if name not in step_labels:
step.label = name
step_labels.add(name)
step.tool_inputs = dict(name=name)
# Key by ID, not HID
id_to_output_pair[("dataset", original_dataset.id)] = (step, "output")
steps.append(step)
# Input collection steps - similar pattern with permission check
for i, collection_id in enumerate(hdca_ids):
collection = trans.sa_session.get(model.HistoryDatasetCollectionAssociation, collection_id)
if not collection:
raise exceptions.ObjectNotFound(f"Collection {collection_id} not found")
# Permission check on HDCA only (not individual elements)
hdca_manager.error_unless_accessible(collection, trans.user)
step = model.WorkflowStep()
step.type = "data_collection_input"
collection_type = collection.collection.collection_type
name = dataset_collection_names[i] if dataset_collection_names else "Input Dataset Collection"
if name not in step_labels:
step.label = name
step_labels.add(name)
step.tool_inputs = dict(name=name, collection_type=collection_type)
id_to_output_pair[("collection", collection_id)] = (step, "output")
steps.append(step)
# Tool steps - use IDs for connection lookup
for job_id in job_ids:
job = trans.app.job_manager.get_accessible_job(trans, job_id)
tool_inputs, associations = step_inputs_by_id(trans, job) # New function
step = model.WorkflowStep()
step.type = "tool"
step.tool_id = job.tool_id
step.tool_version = job.tool_version
step.tool_inputs = tool_inputs
# Connect inputs using IDs
for (content_type, content_id), input_name in associations:
key = (content_type, content_id)
if key in id_to_output_pair:
step_input = step.get_or_add_input(input_name)
other_step, other_name = id_to_output_pair[key]
conn = model.WorkflowStepConnection()
conn.input_step_input = step_input
conn.output_step = other_step
conn.output_name = other_name
steps.append(step)
# Store output IDs
for assoc in job.output_datasets + job.output_dataset_collection_instances:
assoc_name = assoc.name
if ToolOutputCollectionPart.is_named_collection_part_name(assoc_name):
continue
if assoc_name.startswith("__new_primary_file"):
continue
if hasattr(assoc, "dataset"):
content_type = "dataset"
content_id = assoc.dataset.id
else:
content_type = "collection"
content_id = assoc.dataset_collection_instance.id
id_to_output_pair[(content_type, content_id)] = (step, assoc.name)
return steps
def _trace_to_original(dataset):
"""Trace implicit conversions back to original unconverted dataset."""
# Follow conversion chain to find original
while dataset.creating_job and dataset.creating_job.tool_id == "__DATA_FETCH__":
# This is not a conversion, stop
break
# Check if this is result of implicit conversion
if dataset.implicitly_converted_parent:
return _trace_to_original(dataset.implicitly_converted_parent)
return datasetRefactor to use abstract base class with concrete implementations for HID vs ID content-to-step mapping:
from abc import ABC, abstractmethod
class BaseWorkflowSummary(ABC):
"""Abstract base for workflow extraction summary with content-to-step mapping."""
def __init__(self, trans, history=None):
self.trans = trans
self.history = history
self.sa_session = trans.sa_session
@abstractmethod
def get_output_key(self, content):
"""Return the key used to map content to workflow steps.
For HID-based: returns HID (integer)
For ID-based: returns (content_type, id) tuple
"""
pass
@abstractmethod
def resolve_input(self, job_input):
"""Resolve a job input to its output key for connection mapping."""
pass
# Shared methods that don't depend on HID vs ID...
class WorkflowSummary(BaseWorkflowSummary):
"""HID-based workflow summary (legacy)."""
def __init__(self, trans, history):
super().__init__(trans, history)
# Existing HID mapping logic
self.hda_hid_in_history = {}
self.hdca_hid_in_history = {}
# ... existing initialization
def get_output_key(self, content):
return self.hid(content)
def resolve_input(self, job_input):
# Existing HID-based resolution
pass
class WorkflowSummaryByIds(BaseWorkflowSummary):
"""ID-based workflow summary (new)."""
def __init__(self, trans, history=None):
super().__init__(trans, history)
def get_output_key(self, content):
content_type = "collection" if hasattr(content, "collection") else "dataset"
return (content_type, content.id)
def resolve_input(self, job_input):
# ID-based resolution - simpler, no HID translation needed
if hasattr(job_input, "dataset"):
return ("dataset", job_input.dataset.id)
else:
return ("collection", job_input.dataset_collection.id)No changes needed to response format. The extracted workflow structure is identical regardless of whether HID or ID was used for input selection.
From WORKFLOW_EXTRACTION_ISSUES.md (#9161):
When datasets are copied from other histories: All connections are broken, Includes tools from original history
With cross-history ID extraction:
- User copies dataset from History A to History B
- Runs tools on copy in History B
- Extraction request includes: datasets from History B, jobs from History B
- The copied dataset is marked as an input (its ID in B)
- Connections to jobs in B work correctly via IDs
- No "foreign jobs" pulled from History A
When job caching returns outputs from a job in another history, ID-based extraction can still trace connections because it doesn't require all outputs to be in the "current" history.
The Vue UI can dynamically load and display upstream job chains across history boundaries, allowing users to explore provenance and select inputs from any point in the chain.
The existing code relies on implicit security through history membership. ID-based extraction with cross-history support requires explicit permission checks:
Explicit Dataset Access Checks:
for dataset_id in hda_ids:
dataset = trans.sa_session.get(model.HDA, dataset_id)
if not dataset:
raise ObjectNotFound(f"Dataset {dataset_id} not found")
# Explicit permission check
hda_manager.error_unless_accessible(dataset, trans.user)Job Access Checks:
for job_id in job_ids:
# get_accessible_job checks user owns job OR job is in accessible history
job = trans.app.job_manager.get_accessible_job(trans, job_id)Collection Access: Check HDCA accessibility only, not individual element HDAs.
Permission Scenarios:
| Scenario | Should Work? | Permission Check |
|---|---|---|
| User's own dataset from another history | Yes | hda_manager.error_unless_accessible() |
| Shared dataset user can access | Yes | Galaxy's standard dataset access rules |
| Published dataset | Yes | dataset.dataset.published |
| Private dataset from another user | No | Access check fails |
Recommendation: Implement minimal Mako fix to use new ID-based params.
Current (line 22):
name="${data.history_content_type}_ids" value="${data.hid}"Updated:
name="${'hda_ids' if data.history_content_type == 'dataset' else 'hdca_ids'}" value="${ trans.app.security.encode_id(data.id) }"This allows Mako users to benefit from ID-based extraction immediately while Vue development proceeds.
The Vue UI should use IDs throughout and enable dynamic upstream chain exploration.
API Client (client/src/api/workflowExtraction.ts):
export interface ExtractWorkflowPayload {
from_history_id?: string; // Optional - for context
workflow_name: string;
job_ids: string[]; // Encoded database IDs
hda_ids: string[]; // Encoded database IDs (NEW)
hdca_ids: string[]; // Encoded database IDs (NEW)
}Component Data Model (WorkflowExtraction.vue):
// Track inputs by encoded ID, not HID
const inputDatasets = ref<Map<string, { id: string; hid: number; name: string; type: string }>>(new Map());
function toggleInputDataset(output: ExtractionOutputDataset) {
// Key by encoded ID
if (inputDatasets.value.has(output.id)) {
inputDatasets.value.delete(output.id);
} else {
inputDatasets.value.set(output.id, {
id: output.id,
hid: output.hid,
name: output.name,
type: output.history_content_type,
});
}
}Form Submission:
async function onSubmit() {
const hdaIds: string[] = [];
const hdcaIds: string[] = [];
inputDatasets.value.forEach((input) => {
if (input.type === "dataset") {
hdaIds.push(input.id); // Use encoded ID, not HID
} else {
hdcaIds.push(input.id);
}
});
await extractWorkflow({
from_history_id: props.historyId, // Optional context
workflow_name: workflowName.value,
job_ids: Array.from(selectedJobs.value),
hda_ids: hdaIds,
hdca_ids: hdcaIds,
});
}The UI should still display HIDs for user readability:
<span class="hid">{{ output.hid }}</span>
<span class="name">{{ output.name }}</span>But form submission uses IDs, not HIDs.
The Vue UI should lazily load and display upstream job dependencies as the user explores, enabling selection of inputs from any point in the provenance chain. This feature leverages ID-based extraction's cross-history capability.
FakeJob IDs use "fake_{dataset_id}" format for datasets without real jobs (e.g., uploaded files). The ID-based API continues to accept FakeJob IDs in this format, decoding them to the underlying dataset ID.
File: lib/galaxy_test/unit/workflow/test_extract.py (new)
class TestExtractStepsByIds:
"""Tests for ID-based extraction."""
def test_basic_extraction_by_id(self):
"""Single dataset marked as input using ID."""
pass
def test_collection_extraction_by_id(self):
"""Dataset collection marked as input using ID."""
pass
def test_invalid_dataset_id(self):
"""Non-existent dataset ID raises ObjectNotFound."""
pass
def test_inaccessible_dataset_rejected(self):
"""Dataset user cannot access raises ItemAccessibilityException."""
pass
def test_cross_history_same_user(self):
"""Dataset from different history (same user) works."""
pass
def test_implicit_conversion_traces_to_original(self):
"""Implicitly converted dataset traces back to original."""
passFile: lib/galaxy_test/api/test_workflow_extraction.py (extend)
class TestWorkflowExtractionByIds(ApiTestCase):
"""Tests for ID-based workflow extraction API."""
def test_extract_with_hda_ids(self):
"""Extract workflow using hda_ids param."""
history_id = self.dataset_populator.new_history()
hda = self.dataset_populator.new_dataset(history_id, wait=True)
# Run a tool
run_result = self.dataset_populator.run_tool(
tool_id="cat1",
inputs={"input1": {"src": "hda", "id": hda["id"]}},
history_id=history_id,
)
self.dataset_populator.wait_for_history(history_id)
# Extract using IDs
response = self._post(
"workflows",
data={
"from_history_id": history_id,
"workflow_name": "Test Workflow",
"job_ids": [run_result["jobs"][0]["id"]],
"hda_ids": [hda["id"]], # Encoded ID, not HID
},
)
self._assert_status_code_is(response, 200)
workflow = response.json()
assert "id" in workflow
def test_extract_without_from_history_id(self):
"""Extract workflow using only IDs, no from_history_id."""
history_id = self.dataset_populator.new_history()
hda = self.dataset_populator.new_dataset(history_id, wait=True)
response = self._post(
"workflows",
data={
# No from_history_id
"workflow_name": "Test Workflow",
"hda_ids": [hda["id"]],
},
)
self._assert_status_code_is(response, 200)
def test_cross_history_extraction(self):
"""Extract workflow with datasets from multiple histories."""
history_a = self.dataset_populator.new_history()
history_b = self.dataset_populator.new_history()
hda_a = self.dataset_populator.new_dataset(history_a, wait=True)
hda_b = self.dataset_populator.new_dataset(history_b, wait=True)
response = self._post(
"workflows",
data={
"workflow_name": "Cross-History Workflow",
"hda_ids": [hda_a["id"], hda_b["id"]],
},
)
self._assert_status_code_is(response, 200)
def test_inaccessible_dataset_rejected(self):
"""Dataset from another user's private history rejected."""
# Create dataset as different user
# Try to extract - should fail with permission error
pass
def test_backward_compat_hid_params(self):
"""Legacy HID params still work."""
history_id = self.dataset_populator.new_history()
hda = self.dataset_populator.new_dataset(history_id, wait=True)
response = self._post(
"workflows",
data={
"from_history_id": history_id,
"workflow_name": "Test Workflow",
"dataset_ids": [1], # HID, not encoded ID
},
)
self._assert_status_code_is(response, 200)
def test_copied_dataset_with_ids(self):
"""Copied datasets work correctly with ID-based extraction."""
# Create history A with dataset
# Copy dataset to history B
# Run tool on copy in history B
# Extract workflow from history B using IDs
# Verify connections are correct
passFile: lib/galaxy_test/selenium/test_workflow_extraction.py
class TestWorkflowExtractionVue(SeleniumTestCase):
"""Selenium tests for Vue extraction UI using IDs."""
def test_extraction_sends_ids_not_hids(self):
"""Verify Vue UI submits encoded IDs, not HIDs."""
# Intercept network request
# Verify payload contains hda_ids with encoded IDs
passItems to resolve before or during implementation:
-
_trace_to_originalimplementation: Verify the implicit conversion tracing logic is correct. Need to check how Galaxy marks implicit conversions (isimplicitly_converted_parentthe right attribute?). -
step_inputs_by_idfunction: Design this new function - should mirrorstep_inputsbut return ID-based associations instead of HID-based. -
Abstract base class boundaries: Finalize which methods are shared vs abstract in
BaseWorkflowSummary. Need to review existingWorkflowSummaryto identify all HID-dependent code paths. -
Mako template testing: After Mako fix, verify form submission works correctly with new param names. May need JavaScript changes if form handling is client-side.
-
Vue UI architecture: Detailed Vue component design is out of scope for this plan but referenced. Ensure the API design here supports the Vue UI requirements from WORKFLOW_EXTRACTION_VUE_CONVERSION_PLAN.md.
-
Dynamic upstream chain API: Determine if existing
/api/jobs/{id}provides enough info for Vue UI's dynamic chain following, or if new endpoints are needed. (Deferred - out of scope for initial implementation.)