Skip to content

Instantly share code, notes, and snippets.

View fedarko's full-sized avatar

Marcus Fedarko fedarko

View GitHub Profile
@fedarko
fedarko / rmdegen.py
Last active April 1, 2025 22:20
Replace degenerate nucleotides in a FASTA file with random nucleotides (respecting IUPAC codes)
#! /usr/bin/env python3
#
# You can run this script as follows:
# $ ./rmdegen [input FASTA] [output FASTA]
import sys
import time
import random
import pyfastx
from collections import Counter
@fedarko
fedarko / get_runs_of_ints.py
Created November 12, 2024 04:52
Identify runs of consecutive integers in a list
def get_runs(e):
"""Identifies runs of consecutive ints in a list.
(The main reason I created this: identifying runs of 0-coverage positions in
the output of "samtools depth -a".)
Parameters
----------
e: list of int
Must not contain any duplicate elements.
@fedarko
fedarko / filter_gfa_lowcov.py
Created April 30, 2024 23:53
Filter low-kmer-coverage segments from a jumboDBG GFA file
#! /usr/bin/env python
# Filters low-coverage segments from a jumboDBG GFA file.
# This assumes that the second-from-the-last entry in each segment line will be
# length, and that the last entry in each segment line will be k-mer coverage.
# It also uses the definition of k-mer coverage assumed by jumboDBG -- so, the
# conventional "coverage" of a segment can be computed as KC / (length - K).
K = 5001
MINCOV = 5
UPDATE_FREQ = 1000
@fedarko
fedarko / read_stats.py
Last active December 17, 2023 22:43
Compute simple statistics (number of reads, total read length, average read length) for a set of (maybe gzipped) FASTA / FASTQ files
#! /usr/bin/env python3
#
# Computes the total number of reads, total read length, and average read
# length of a set of (maybe gzipped) FASTA / FASTQ files. Requires the pyfastx
# library (https://github.com/lmdu/pyfastx). I designed this in the context of
# computing read statistics, but if you have a set of other sequences (e.g.
# contigs) then I guess this would still work for that.
#
# USAGE:
# ./read_stats.py file1.fa [file2.fa ...]
@fedarko
fedarko / shorten_edge_labels.py
Last active August 17, 2023 21:08
Shortens each edge label in a LJA DOT file to just the first line
#! /usr/bin/env python
#
# Shortens edge labels in a DOT file output by LJA to just show the first line
# and then a count of how many other lines are omitted. (If an edge's label
# spans exactly one or two lines, then the entire label is preserved.)
#
# USAGE:
# ./shorten_edge_labels.py in.dot out.dot
import sys
@fedarko
fedarko / check_for_conflicting_node_ids.py
Created August 16, 2023 05:55
Checks for "conflicting" node IDs defined multiple times in a DOT file
#! /usr/bin/env python
#
# Scans through a jumboDBG / LJA output DOT file; looks for cases where
# the same node is "defined" on multiple lines. This can be caused by the
# same truncated node ID being misused across lines.
#
# USAGE:
# ./check_for_conflicting_node_ids.py graph.dot
#
# Note that this assumes that the input graph was output by jumboDBG / LJA --
@fedarko
fedarko / rm_seqs_from_gfa.py
Last active May 29, 2024 05:15
Remove sequences from a GFA 1 file
#! /usr/bin/env python3
#
# SUMMARY
# =======
# Outputs a copy of a GFA 1 file with each segment (S) line that contains a
# sequence (not just a "*" character) altered to have an LN:i: tag describing
# the length of the sequence, and the sequence replaced with a "*" character.
#
# All other lines (including S lines that already do not contain a sequence,
# and other types of lines [e.g. H, L, ...]) will be included unchanged in the
@fedarko
fedarko / sort-rmdup-bbl.py
Last active August 31, 2022 08:11
Sort and remove duplicate BBL (bibtex file) entries; useful when combining multiple BBL files (e.g. if using the multibib package) into a single one
#! /usr/bin/env python3
# NOTE: this is a hack, so it will probably break if you have BBL files that
# don't look like the natbib-generated ones I'm used to. It is also pretty
# unintelligent about *how* it sorts entries (it defers most of the work
# to python), so if you have cases where some of your references are by
# the same person or whatever then that might cause the output to not match
# your expectations.
import sys
@fedarko
fedarko / gfa-to-fasta.py
Created April 15, 2022 02:45
Convert GFA to FASTA
#! /usr/bin/env python3
# Converts a GFA assembly graph to a FASTA file of all sequences
# within the graph. Notably, this ignores connections between sequences
# in the graph.
#
# Depends on Python 3.6 or later.
#
# Usage:
# $ ./gfa_to_fasta.py mygraph.gfa contigs.fasta
@fedarko
fedarko / handle_duplicate_sample_ids.py
Last active December 16, 2019 22:12
Script to report on duplicate IDs in a plate map spreadsheet (and modify certain duplicate IDs, in a very specific case); also attempts to update Qiita prep files accordingly. As a warning, code is untested / pretty gross.
#! /usr/bin/env python3
import os
from collections import Counter
from math import ceil
import re
from numpy import argmax
import pandas as pd
from qiime2 import Metadata
# "Parameters" of this script