This document describes the planned repository split for nucleotide-fragment
generation, library construction, rotaconformer generation, and downstream
analysis. The plan originated in professional-TODO/ and is still to be
finalized.
graph TD;
pipeline["nucleotide-fragment-pipeline"];
fragments["nucleotide-fragments"];
completeness["nucleotide-completeness"];
buildLibrary["nucleotide-build-library"];
library["nucleotide-library"];
rotPipeline["Rotaconformer generation pipeline"];
rotaconformers["Rotaconformers"];
analysis["Analysis repository / nucleotide-stacking-interaction"];
interaction["nucleotide-interaction"];
libraryFit["nucleotide-library-fit"];
subgraph pipelineSteps["nucleotide-fragment-pipeline steps"]
p0["0. all-PDB mmCIF input"];
p1["1. parse structures to ppdb"];
p1a["1a. parse headers"];
p1b["1b. summarize headers"];
p1c["1c. cif-to-auth chain mapping"];
p2["2. detect interfaces"];
p3["3. filter protein-RNA interfaces"];
p4["4. collect interface structures and RNA"];
p5["5. add missing atoms with RNA topology"];
p5a["5a. detect RNA segments"];
p6["6. extract initial fragments"];
p7["7. mutate sequences to motifs"];
p8["8. deredundant clustering at 0.2 A"];
p9["9. filter clashes and disconnected bonds"];
p10["10. publish final fragments"];
p0 --> p1;
p0 --> p1a;
p1a --> p1b;
p1b --> p1c;
p1 --> p2;
p1c --> p2;
p2 --> p3;
p3 --> p4;
p4 --> p5;
p5 --> p5a;
p5a --> p6;
p6 --> p7;
p7 --> p8;
p8 --> p9;
p9 --> p10;
end;
pipeline --> p0;
p10 --> fragments;
fragments --> completeness;
fragments --> buildLibrary;
buildLibrary --> library;
library --> rotPipeline;
rotPipeline --> rotaconformers;
fragments --> analysis;
library --> analysis;
rotaconformers --> analysis;
completeness -.-> analysis;
analysis --> interaction;
library --> interaction;
interaction --> libraryFit;
interaction --> analysis;
Pipeline for parsing all protein-RNA complexes and extracting nucleotide fragments.
Possible extensions:
- Protein-DNA complexes.
- Isolated RNA or DNA.
- Mononucleotides and fragments longer than trinucleotides.
Responsibilities:
- Extract dinucleotides and trinucleotides.
- Filter fragments with clashes or missing connectivity.
- Cluster dinucleotides and trinucleotides into non-redundant fragments.
- Work incrementally by reusing previous Seamless results.
- Store results in the repository as checksums, not as materialized files.
Pipeline steps:
- Step 0: start from the complete PDB in mmCIF format, about 300 GB. Generate
the
allpdbdeepcell andallpdb-keyorder. - Step 1: parse all PDB entries with Biopython into ppdb, a NumPy structured
"parsed PDB" format. Main result:
allpdb-struc-index.jsonplus the ppdb deepcell. - Step 1a: parse PDB headers with Biopython. Main result:
allpdb-header-index.json. - Step 1b: summarize the parsed headers. Main results:
intermediate/allpdb-header-summarized.jsonandintermediate/allpdb-header-summarized-asym.json. - Step 1c: build the cif-to-auth chain mapping.
- Step 2: detect interfaces with
allpdb-detect-interfaces.py. - Step 3: filter interfaces to protein-RNA interfaces. Main result:
allpdb-filtered-interfaces.mixed. - Step 4: collect interface structures and extract RNA with
allpdb-collect-interface-struc.pyandallpdb-rna.py. Main result:allpdb-rna.mixed. - Step 5: add missing atoms using RNA topology and ATTRACT
aareduce. Main result:allpdb-rna-attract.mixed. - Step 5a: detect RNA segments with
allpdb-detect-segments.py. - Steps 6-9 run per fragment library, sequence, or motif. Current libraries are dinucleotides and trinucleotides.
- Step 6: extract initial fragments with
lib-$lib-initial.py. Main results:lib-$lib-initial-$seq.npyandlib-$lib-initial-$seq-origin.txt. - Step 7: mutate sequences to motifs with
lib-$lib-initial.py. Main results:lib-$lib-mutated-$motif.npyandlib-$lib-mutated-$motif-origin.txt. - Step 8: dereduplicate by clustering at 0.2 A with
lib-$lib-deredundant.py,clusterlib/, andderedundant.py. Main result:lib-$lib-nonredundant-$motif.clust. - Step 9: filter clustered fragments for internal clashes and disconnected
bonds with
filter-fragments.sh. Main results:lib-$lib-nonredundant-filtered-$motif.npyandlib-$lib-nonredundant-filtered-$motif-origin.txt. - Step 10: publish the filtered non-redundant fragments into
nucleotide-fragments/$lib/$motif.npyandnucleotide-fragments/$lib/origin/$motif.npywithpublish.sh.
Configuration and environment notes:
- A future
nucleotide-fragment-pipeline-configsubrepository should hold job files, Seamless deployment config, and job output handling. - The intended execution environment is
seamless-exactplus Biopython andopt_einsum. - Some large generated files should be stored only by checksum, while selected indexes and compact interface files should be added to the repository.
Code dependencies:
- Seamless.
- ATTRACT
aareduce. - Superposition functions.
- Clustering functions.
- Filtering functions.
Stores the final results produced by nucleotide-fragment-pipeline. Intermediate
pipeline results are excluded.
The same final data should also be published to Zenodo or a similar archive.
Responsibilities:
- Store final non-redundant fragment data.
- Provide scripts for analysis of the non-redundant fragments.
- Store analysis results in the repository.
Analyses:
- Brute-force closest-fit analysis. A smarter version requires 1 A and 2 A
clustering and belongs in
nucleotide-build-library. - Origin tracking:
- Source PDB for each fragment.
- Source PDBs for each non-redundant fragment.
- Origin analysis:
- Counts and masks of non-redundant fragments with a single-PDB origin.
- Terminality analysis:
- Which non-redundant fragments are terminal-only.
- Which non-redundant fragments are not terminal-only.
Repository dependency:
nucleotide-fragment-pipeline.
Code dependencies:
- PDB parsing functions.
- Superposition functions.
crocodile.dinucandcrocodile.trinucPython packages.
Completeness analysis for the fragments, without clustering.
Requires a closest-fit analysis result from nucleotide-fragments,
nucleotide-build-library, or nucleotide-library.
Responsibilities:
- Failure analysis for fragments with fit RMSD greater than 1 A.
- Analyze relationships between failure, origin, and terminality.
Repository dependencies:
nucleotide-fragments, including version-controlled files and analysis results.- Optionally
nucleotide-build-libraryornucleotide-library.
Code dependencies:
- Superposition functions.
Pipeline for building a fragment library through clustering.
Responsibilities:
- Cluster fragments from scratch.
- Produce clustering files at 0.5 A, 1.0 A, and 2.0 A thresholds.
- Produce origin files.
- Build primary, replacement, and extension libraries.
- Run smart closest-fit analysis using 1 A and 2 A clustering.
- Analyze best fit with and without clustering.
Repository dependency:
nucleotide-fragments.
Code dependencies:
- Superposition functions.
- Clustering functions.
- Possibly filtering functions.
Stores the dinucleotide and trinucleotide libraries produced by
nucleotide-build-library.
The same library data should also be published to Zenodo or a similar archive.
Responsibilities:
- Store primary, replacement, and extension libraries in
.npyformat. - Store minimal origin files for replacement libraries.
- Store minimal origin files for eliminating extension members.
Repository dependency:
nucleotide-build-library.
Pipeline for generating rotaconformers from the nucleotide libraries.
Repository dependency:
nucleotide-library.
Code dependencies:
- Seamless.
Stores the rotaconformer results produced by the rotaconformer generation pipeline.
Note: the full result set is too large for a regular source repository.
Repository dependency:
- Rotaconformer generation pipeline.
The analysis repository now appears to be represented by
nucleotide-stacking-interaction.
It contains analysis tools and their final results. Intermediate results are not stored directly; only their checksums are stored. The current target library is a 0.5 A library with extension, but the approach can be extended to other libraries.
The library should be refit independently to the PDB dataset, without relying on
fragments.json as in the old ProtNAffs workflow.
Responsibilities:
- Store best-fitted conformers.
- Store overlap RMSD matrices.
- Store compatibility RMSD matrices.
- Store fit RMSD values to the library.
- Store the best-fitted library conformer.
- Support the crocodile scheme: rotaconformer plus 0.5 A translation.
Reference:
allpdb-trinuc-fit.pyfrom the oldstacking_interactionrepository.
Responsibilities:
- Calculate and store additional properties, including:
- Double-stranded or single-stranded RNA.
- Hairpin annotation.
- Protein or RNA family.
- Stacking annotations.
Implementation notes:
- Property calculations should be incremental with respect to the PDB set, using a Seamless command-line style.
- Fit calculations should be incremental with respect to both the PDB set and library conformers. This does not necessarily require Seamless.
Tools should run on arbitrary datasets.
Use cases:
- New PDBs not used in library generation, possibly updated weekly.
- Deep-learning prediction outputs.
Repository dependencies:
nucleotide-fragmentsfor the parsed PDB dataset.nucleotide-libraryfor the library.Rotaconformersfor the crocodile scheme.
Code dependencies:
- Seamless.
- Superposition routines.
Small analysis-side repository for annotating fitted RNA fragments with protein contact counts.
It consumes precomputed all-PDB interface and RNA-fit artifacts, then reduces
them to per-atom and per-fitted-dinucleotide contact counts. The interface input
checksums match the inputs used by nucleotide-stacking-interaction, so this
repository appears to share the same parsed all-PDB interface dataset.
Responsibilities:
- Count protein contacts for each RNA atom in the all-PDB RNA dataset.
- Sum those contact counts over fitted dinucleotide spans.
- Store contact-count arrays for downstream filtering and interaction analysis.
Inputs:
allpdb-filtered-interfaces.mixed.allpdb-interface-struc.mixed.allpdb-keyorder.json.allpdb-rna-aareduce.mixed.allpdb-rna-fit.npy.allpdb-rna-fit-indices.npy.
Outputs:
allpdb-count-contacts.npy: per-RNA-atom protein contact counts.allpdb-rna-fit-count-contacts.npy: per-fitted-dinucleotide summed contact counts.
Known downstream use:
nucleotide-library-fit/pair-freq-words.pyusesallpdb-rna-fit-count-contacts.npyto filter fragments and fragment pairs to those with protein contacts.
Repository dependencies:
nucleotide-stacking-interactionor the same parsed all-PDB interface inputs.- RNA-fit results from the library fitting workflow.
Code dependencies:
- Seamless.
- NumPy.
- SciPy
KDTree. - tqdm.