Skip to content

Instantly share code, notes, and snippets.

View fedarko's full-sized avatar

Marcus Fedarko fedarko

View GitHub Profile
@fedarko
fedarko / explode_primer.py
Created May 22, 2025 01:13
Compute all possible variations of a primer sequence
# This code makes it easy to do exact matching of a primer sequence to e.g.
# a 16S rRNA gene sequence -- you can use explode() on the primer sequence
# to get all possible variations of this primer sequence, and then just
# search through your gene to find exact matches to any of these primers.
#
# I know there are real tools that can do this faster (e.g. Primer Prospector),
# but I just needed something quick and dirty.
# Some example sequences with degenerate nucleotides: the EMP 16S primers,
# per https://earthmicrobiome.org/protocols-and-standards/16s/
@fedarko
fedarko / rmdegen.py
Last active April 1, 2025 22:20
Replace degenerate nucleotides in a FASTA file with random nucleotides (respecting IUPAC codes)
#! /usr/bin/env python3
#
# You can run this script as follows:
# $ ./rmdegen [input FASTA] [output FASTA]
import sys
import time
import random
import pyfastx
from collections import Counter
@fedarko
fedarko / get_runs_of_ints.py
Created November 12, 2024 04:52
Identify runs of consecutive integers in a list
def get_runs(e):
"""Identifies runs of consecutive ints in a list.
(The main reason I created this: identifying runs of 0-coverage positions in
the output of "samtools depth -a".)
Parameters
----------
e: list of int
Must not contain any duplicate elements.
@fedarko
fedarko / filter_gfa_lowcov.py
Created April 30, 2024 23:53
Filter low-kmer-coverage segments from a jumboDBG GFA file
#! /usr/bin/env python
# Filters low-coverage segments from a jumboDBG GFA file.
# This assumes that the second-from-the-last entry in each segment line will be
# length, and that the last entry in each segment line will be k-mer coverage.
# It also uses the definition of k-mer coverage assumed by jumboDBG -- so, the
# conventional "coverage" of a segment can be computed as KC / (length - K).
K = 5001
MINCOV = 5
UPDATE_FREQ = 1000
@fedarko
fedarko / read_stats.py
Last active December 17, 2023 22:43
Compute simple statistics (number of reads, total read length, average read length) for a set of (maybe gzipped) FASTA / FASTQ files
#! /usr/bin/env python3
#
# Computes the total number of reads, total read length, and average read
# length of a set of (maybe gzipped) FASTA / FASTQ files. Requires the pyfastx
# library (https://github.com/lmdu/pyfastx). I designed this in the context of
# computing read statistics, but if you have a set of other sequences (e.g.
# contigs) then I guess this would still work for that.
#
# USAGE:
# ./read_stats.py file1.fa [file2.fa ...]
@fedarko
fedarko / shorten_edge_labels.py
Last active August 17, 2023 21:08
Shortens each edge label in a LJA DOT file to just the first line
#! /usr/bin/env python
#
# Shortens edge labels in a DOT file output by LJA to just show the first line
# and then a count of how many other lines are omitted. (If an edge's label
# spans exactly one or two lines, then the entire label is preserved.)
#
# USAGE:
# ./shorten_edge_labels.py in.dot out.dot
import sys
@fedarko
fedarko / check_for_conflicting_node_ids.py
Created August 16, 2023 05:55
Checks for "conflicting" node IDs defined multiple times in a DOT file
#! /usr/bin/env python
#
# Scans through a jumboDBG / LJA output DOT file; looks for cases where
# the same node is "defined" on multiple lines. This can be caused by the
# same truncated node ID being misused across lines.
#
# USAGE:
# ./check_for_conflicting_node_ids.py graph.dot
#
# Note that this assumes that the input graph was output by jumboDBG / LJA --
@fedarko
fedarko / rm_seqs_from_gfa.py
Last active May 29, 2024 05:15
Remove sequences from a GFA 1 file
#! /usr/bin/env python3
#
# SUMMARY
# =======
# Outputs a copy of a GFA 1 file with each segment (S) line that contains a
# sequence (not just a "*" character) altered to have an LN:i: tag describing
# the length of the sequence, and the sequence replaced with a "*" character.
#
# All other lines (including S lines that already do not contain a sequence,
# and other types of lines [e.g. H, L, ...]) will be included unchanged in the
@fedarko
fedarko / sort-rmdup-bbl.py
Last active August 31, 2022 08:11
Sort and remove duplicate BBL (bibtex file) entries; useful when combining multiple BBL files (e.g. if using the multibib package) into a single one
#! /usr/bin/env python3
# NOTE: this is a hack, so it will probably break if you have BBL files that
# don't look like the natbib-generated ones I'm used to. It is also pretty
# unintelligent about *how* it sorts entries (it defers most of the work
# to python), so if you have cases where some of your references are by
# the same person or whatever then that might cause the output to not match
# your expectations.
import sys
@fedarko
fedarko / gfa-to-fasta.py
Created April 15, 2022 02:45
Convert GFA to FASTA
#! /usr/bin/env python3
# Converts a GFA assembly graph to a FASTA file of all sequences
# within the graph. Notably, this ignores connections between sequences
# in the graph.
#
# Depends on Python 3.6 or later.
#
# Usage:
# $ ./gfa_to_fasta.py mygraph.gfa contigs.fasta