Skip to content

Instantly share code, notes, and snippets.

@sjdv1982
Last active April 28, 2026 12:57
Show Gist options
  • Select an option

  • Save sjdv1982/827cdd5ac4f39028470e535cd35e7a95 to your computer and use it in GitHub Desktop.

Select an option

Save sjdv1982/827cdd5ac4f39028470e535cd35e7a95 to your computer and use it in GitHub Desktop.
Nucleotide repositories plan

Nucleotide Repositories

This document describes the planned repository split for nucleotide-fragment generation, library construction, rotaconformer generation, and downstream analysis. The plan originated in professional-TODO/ and is still to be finalized.

Repository Dependency Diagram

graph TD;
    pipeline["nucleotide-fragment-pipeline"];
    fragments["nucleotide-fragments"];
    completeness["nucleotide-completeness"];
    buildLibrary["nucleotide-build-library"];
    library["nucleotide-library"];
    rotPipeline["Rotaconformer generation pipeline"];
    rotaconformers["Rotaconformers"];
    analysis["Analysis repository / nucleotide-stacking-interaction"];
    interaction["nucleotide-interaction"];
    libraryFit["nucleotide-library-fit"];

    subgraph pipelineSteps["nucleotide-fragment-pipeline steps"]
        p0["0. all-PDB mmCIF input"];
        p1["1. parse structures to ppdb"];
        p1a["1a. parse headers"];
        p1b["1b. summarize headers"];
        p1c["1c. cif-to-auth chain mapping"];
        p2["2. detect interfaces"];
        p3["3. filter protein-RNA interfaces"];
        p4["4. collect interface structures and RNA"];
        p5["5. add missing atoms with RNA topology"];
        p5a["5a. detect RNA segments"];
        p6["6. extract initial fragments"];
        p7["7. mutate sequences to motifs"];
        p8["8. deredundant clustering at 0.2 A"];
        p9["9. filter clashes and disconnected bonds"];
        p10["10. publish final fragments"];

        p0 --> p1;
        p0 --> p1a;
        p1a --> p1b;
        p1b --> p1c;
        p1 --> p2;
        p1c --> p2;
        p2 --> p3;
        p3 --> p4;
        p4 --> p5;
        p5 --> p5a;
        p5a --> p6;
        p6 --> p7;
        p7 --> p8;
        p8 --> p9;
        p9 --> p10;
    end;

    pipeline --> p0;
    p10 --> fragments;
    fragments --> completeness;
    fragments --> buildLibrary;
    buildLibrary --> library;
    library --> rotPipeline;
    rotPipeline --> rotaconformers;
    fragments --> analysis;
    library --> analysis;
    rotaconformers --> analysis;
    completeness -.-> analysis;
    analysis --> interaction;
    library --> interaction;
    interaction --> libraryFit;
    interaction --> analysis;
Loading

1. nucleotide-fragment-pipeline

Pipeline for parsing all protein-RNA complexes and extracting nucleotide fragments.

Possible extensions:

  • Protein-DNA complexes.
  • Isolated RNA or DNA.
  • Mononucleotides and fragments longer than trinucleotides.

Responsibilities:

  • Extract dinucleotides and trinucleotides.
  • Filter fragments with clashes or missing connectivity.
  • Cluster dinucleotides and trinucleotides into non-redundant fragments.
  • Work incrementally by reusing previous Seamless results.
  • Store results in the repository as checksums, not as materialized files.

Pipeline steps:

  • Step 0: start from the complete PDB in mmCIF format, about 300 GB. Generate the allpdb deepcell and allpdb-keyorder.
  • Step 1: parse all PDB entries with Biopython into ppdb, a NumPy structured "parsed PDB" format. Main result: allpdb-struc-index.json plus the ppdb deepcell.
  • Step 1a: parse PDB headers with Biopython. Main result: allpdb-header-index.json.
  • Step 1b: summarize the parsed headers. Main results: intermediate/allpdb-header-summarized.json and intermediate/allpdb-header-summarized-asym.json.
  • Step 1c: build the cif-to-auth chain mapping.
  • Step 2: detect interfaces with allpdb-detect-interfaces.py.
  • Step 3: filter interfaces to protein-RNA interfaces. Main result: allpdb-filtered-interfaces.mixed.
  • Step 4: collect interface structures and extract RNA with allpdb-collect-interface-struc.py and allpdb-rna.py. Main result: allpdb-rna.mixed.
  • Step 5: add missing atoms using RNA topology and ATTRACT aareduce. Main result: allpdb-rna-attract.mixed.
  • Step 5a: detect RNA segments with allpdb-detect-segments.py.
  • Steps 6-9 run per fragment library, sequence, or motif. Current libraries are dinucleotides and trinucleotides.
  • Step 6: extract initial fragments with lib-$lib-initial.py. Main results: lib-$lib-initial-$seq.npy and lib-$lib-initial-$seq-origin.txt.
  • Step 7: mutate sequences to motifs with lib-$lib-initial.py. Main results: lib-$lib-mutated-$motif.npy and lib-$lib-mutated-$motif-origin.txt.
  • Step 8: dereduplicate by clustering at 0.2 A with lib-$lib-deredundant.py, clusterlib/, and deredundant.py. Main result: lib-$lib-nonredundant-$motif.clust.
  • Step 9: filter clustered fragments for internal clashes and disconnected bonds with filter-fragments.sh. Main results: lib-$lib-nonredundant-filtered-$motif.npy and lib-$lib-nonredundant-filtered-$motif-origin.txt.
  • Step 10: publish the filtered non-redundant fragments into nucleotide-fragments/$lib/$motif.npy and nucleotide-fragments/$lib/origin/$motif.npy with publish.sh.

Configuration and environment notes:

  • A future nucleotide-fragment-pipeline-config subrepository should hold job files, Seamless deployment config, and job output handling.
  • The intended execution environment is seamless-exact plus Biopython and opt_einsum.
  • Some large generated files should be stored only by checksum, while selected indexes and compact interface files should be added to the repository.

Code dependencies:

  • Seamless.
  • ATTRACT aareduce.
  • Superposition functions.
  • Clustering functions.
  • Filtering functions.

2. nucleotide-fragments

Stores the final results produced by nucleotide-fragment-pipeline. Intermediate pipeline results are excluded.

The same final data should also be published to Zenodo or a similar archive.

Responsibilities:

  • Store final non-redundant fragment data.
  • Provide scripts for analysis of the non-redundant fragments.
  • Store analysis results in the repository.

Analyses:

  • Brute-force closest-fit analysis. A smarter version requires 1 A and 2 A clustering and belongs in nucleotide-build-library.
  • Origin tracking:
    • Source PDB for each fragment.
    • Source PDBs for each non-redundant fragment.
  • Origin analysis:
    • Counts and masks of non-redundant fragments with a single-PDB origin.
  • Terminality analysis:
    • Which non-redundant fragments are terminal-only.
    • Which non-redundant fragments are not terminal-only.

Repository dependency:

  • nucleotide-fragment-pipeline.

Code dependencies:

  • PDB parsing functions.
  • Superposition functions.
  • crocodile.dinuc and crocodile.trinuc Python packages.

3. nucleotide-completeness

Completeness analysis for the fragments, without clustering.

Requires a closest-fit analysis result from nucleotide-fragments, nucleotide-build-library, or nucleotide-library.

Responsibilities:

  • Failure analysis for fragments with fit RMSD greater than 1 A.
  • Analyze relationships between failure, origin, and terminality.

Repository dependencies:

  • nucleotide-fragments, including version-controlled files and analysis results.
  • Optionally nucleotide-build-library or nucleotide-library.

Code dependencies:

  • Superposition functions.

4. nucleotide-build-library

Pipeline for building a fragment library through clustering.

Responsibilities:

  • Cluster fragments from scratch.
  • Produce clustering files at 0.5 A, 1.0 A, and 2.0 A thresholds.
  • Produce origin files.
  • Build primary, replacement, and extension libraries.
  • Run smart closest-fit analysis using 1 A and 2 A clustering.
  • Analyze best fit with and without clustering.

Repository dependency:

  • nucleotide-fragments.

Code dependencies:

  • Superposition functions.
  • Clustering functions.
  • Possibly filtering functions.

5. nucleotide-library

Stores the dinucleotide and trinucleotide libraries produced by nucleotide-build-library.

The same library data should also be published to Zenodo or a similar archive.

Responsibilities:

  • Store primary, replacement, and extension libraries in .npy format.
  • Store minimal origin files for replacement libraries.
  • Store minimal origin files for eliminating extension members.

Repository dependency:

  • nucleotide-build-library.

6. Rotaconformer Generation Pipeline

Pipeline for generating rotaconformers from the nucleotide libraries.

Repository dependency:

  • nucleotide-library.

Code dependencies:

  • Seamless.

7. Rotaconformers

Stores the rotaconformer results produced by the rotaconformer generation pipeline.

Note: the full result set is too large for a regular source repository.

Repository dependency:

  • Rotaconformer generation pipeline.

8. Analysis Repository

The analysis repository now appears to be represented by nucleotide-stacking-interaction.

It contains analysis tools and their final results. Intermediate results are not stored directly; only their checksums are stored. The current target library is a 0.5 A library with extension, but the approach can be extended to other libraries.

The library should be refit independently to the PDB dataset, without relying on fragments.json as in the old ProtNAffs workflow.

Fit Analysis

Responsibilities:

  • Store best-fitted conformers.
  • Store overlap RMSD matrices.
  • Store compatibility RMSD matrices.
  • Store fit RMSD values to the library.
  • Store the best-fitted library conformer.
  • Support the crocodile scheme: rotaconformer plus 0.5 A translation.

Reference:

  • allpdb-trinuc-fit.py from the old stacking_interaction repository.

Stacking Interaction Analysis

Responsibilities:

  • Calculate and store additional properties, including:
    • Double-stranded or single-stranded RNA.
    • Hairpin annotation.
    • Protein or RNA family.
    • Stacking annotations.

Implementation notes:

  • Property calculations should be incremental with respect to the PDB set, using a Seamless command-line style.
  • Fit calculations should be incremental with respect to both the PDB set and library conformers. This does not necessarily require Seamless.

Tools should run on arbitrary datasets.

Use cases:

  • New PDBs not used in library generation, possibly updated weekly.
  • Deep-learning prediction outputs.

Repository dependencies:

  • nucleotide-fragments for the parsed PDB dataset.
  • nucleotide-library for the library.
  • Rotaconformers for the crocodile scheme.

Code dependencies:

  • Seamless.
  • Superposition routines.

9. nucleotide-interaction

Small analysis-side repository for annotating fitted RNA fragments with protein contact counts.

It consumes precomputed all-PDB interface and RNA-fit artifacts, then reduces them to per-atom and per-fitted-dinucleotide contact counts. The interface input checksums match the inputs used by nucleotide-stacking-interaction, so this repository appears to share the same parsed all-PDB interface dataset.

Responsibilities:

  • Count protein contacts for each RNA atom in the all-PDB RNA dataset.
  • Sum those contact counts over fitted dinucleotide spans.
  • Store contact-count arrays for downstream filtering and interaction analysis.

Inputs:

  • allpdb-filtered-interfaces.mixed.
  • allpdb-interface-struc.mixed.
  • allpdb-keyorder.json.
  • allpdb-rna-aareduce.mixed.
  • allpdb-rna-fit.npy.
  • allpdb-rna-fit-indices.npy.

Outputs:

  • allpdb-count-contacts.npy: per-RNA-atom protein contact counts.
  • allpdb-rna-fit-count-contacts.npy: per-fitted-dinucleotide summed contact counts.

Known downstream use:

  • nucleotide-library-fit/pair-freq-words.py uses allpdb-rna-fit-count-contacts.npy to filter fragments and fragment pairs to those with protein contacts.

Repository dependencies:

  • nucleotide-stacking-interaction or the same parsed all-PDB interface inputs.
  • RNA-fit results from the library fitting workflow.

Code dependencies:

  • Seamless.
  • NumPy.
  • SciPy KDTree.
  • tqdm.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment