fedarko’s gists

fedarko / explode_primer.py

Created May 22, 2025 01:13

Compute all possible variations of a primer sequence

	# This code makes it easy to do exact matching of a primer sequence to e.g.
	# a 16S rRNA gene sequence -- you can use explode() on the primer sequence
	# to get all possible variations of this primer sequence, and then just
	# search through your gene to find exact matches to any of these primers.
	#
	# I know there are real tools that can do this faster (e.g. Primer Prospector),
	# but I just needed something quick and dirty.

	# Some example sequences with degenerate nucleotides: the EMP 16S primers,
	# per https://earthmicrobiome.org/protocols-and-standards/16s/

fedarko / rmdegen.py

Last active April 1, 2025 22:20

Replace degenerate nucleotides in a FASTA file with random nucleotides (respecting IUPAC codes)

	#! /usr/bin/env python3
	#
	# You can run this script as follows:
	# $ ./rmdegen [input FASTA] [output FASTA]

	import sys
	import time
	import random
	import pyfastx
	from collections import Counter

fedarko / get_runs_of_ints.py

Created November 12, 2024 04:52

Identify runs of consecutive integers in a list

	def get_runs(e):
	"""Identifies runs of consecutive ints in a list.

	(The main reason I created this: identifying runs of 0-coverage positions in
	the output of "samtools depth -a".)

	Parameters
	----------
	e: list of int
	Must not contain any duplicate elements.

fedarko / filter_gfa_lowcov.py

Created April 30, 2024 23:53

Filter low-kmer-coverage segments from a jumboDBG GFA file

	#! /usr/bin/env python
	# Filters low-coverage segments from a jumboDBG GFA file.
	# This assumes that the second-from-the-last entry in each segment line will be
	# length, and that the last entry in each segment line will be k-mer coverage.
	# It also uses the definition of k-mer coverage assumed by jumboDBG -- so, the
	# conventional "coverage" of a segment can be computed as KC / (length - K).

	K = 5001
	MINCOV = 5
	UPDATE_FREQ = 1000

fedarko / read_stats.py

Last active December 17, 2023 22:43

Compute simple statistics (number of reads, total read length, average read length) for a set of (maybe gzipped) FASTA / FASTQ files

	#! /usr/bin/env python3
	#
	# Computes the total number of reads, total read length, and average read
	# length of a set of (maybe gzipped) FASTA / FASTQ files. Requires the pyfastx
	# library (https://github.com/lmdu/pyfastx). I designed this in the context of
	# computing read statistics, but if you have a set of other sequences (e.g.
	# contigs) then I guess this would still work for that.
	#
	# USAGE:
	# ./read_stats.py file1.fa [file2.fa ...]

fedarko / shorten_edge_labels.py

Last active August 17, 2023 21:08

Shortens each edge label in a LJA DOT file to just the first line

	#! /usr/bin/env python
	#
	# Shortens edge labels in a DOT file output by LJA to just show the first line
	# and then a count of how many other lines are omitted. (If an edge's label
	# spans exactly one or two lines, then the entire label is preserved.)
	#
	# USAGE:
	# ./shorten_edge_labels.py in.dot out.dot

	import sys

fedarko / check_for_conflicting_node_ids.py

Created August 16, 2023 05:55

Checks for "conflicting" node IDs defined multiple times in a DOT file

	#! /usr/bin/env python
	#
	# Scans through a jumboDBG / LJA output DOT file; looks for cases where
	# the same node is "defined" on multiple lines. This can be caused by the
	# same truncated node ID being misused across lines.
	#
	# USAGE:
	# ./check_for_conflicting_node_ids.py graph.dot
	#
	# Note that this assumes that the input graph was output by jumboDBG / LJA --

fedarko / rm_seqs_from_gfa.py

Last active May 29, 2024 05:15

Remove sequences from a GFA 1 file

	#! /usr/bin/env python3
	#
	# SUMMARY
	# =======
	# Outputs a copy of a GFA 1 file with each segment (S) line that contains a
	# sequence (not just a "*" character) altered to have an LN:i: tag describing
	# the length of the sequence, and the sequence replaced with a "*" character.
	#
	# All other lines (including S lines that already do not contain a sequence,
	# and other types of lines [e.g. H, L, ...]) will be included unchanged in the

fedarko / sort-rmdup-bbl.py

Last active August 31, 2022 08:11

Sort and remove duplicate BBL (bibtex file) entries; useful when combining multiple BBL files (e.g. if using the multibib package) into a single one

	#! /usr/bin/env python3
	# NOTE: this is a hack, so it will probably break if you have BBL files that
	# don't look like the natbib-generated ones I'm used to. It is also pretty
	# unintelligent about how it sorts entries (it defers most of the work
	# to python), so if you have cases where some of your references are by
	# the same person or whatever then that might cause the output to not match
	# your expectations.

	import sys

fedarko / gfa-to-fasta.py

Created April 15, 2022 02:45

Convert GFA to FASTA

	#! /usr/bin/env python3
	# Converts a GFA assembly graph to a FASTA file of all sequences
	# within the graph. Notably, this ignores connections between sequences
	# in the graph.
	#
	# Depends on Python 3.6 or later.
	#
	# Usage:
	# $ ./gfa_to_fasta.py mygraph.gfa contigs.fasta

Marcus Fedarko fedarko