Last active
August 29, 2015 14:06
-
-
Save senvey/75d4188922a3f8061804 to your computer and use it in GitHub Desktop.
[pika] Clump Finding Problem
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Given integers L and t, a string Pattern forms an (L, t)-clump inside a (larger) string Genome if there is an interval of Genome of length L in which Pattern appears at least t times. For example, TGCA forms a (25,3)-clump in the following Genome: gatcagcataagggtcccTGCAaTGCAtgacaagccTGCAgttgttttac. | |
Clump Finding Problem | |
Find patterns forming clumps in a string. | |
Given: A string Genome, and integers k, L, and t. | |
Return: All distinct k-mers forming (L, t)-clumps in Genome. | |
""" | |
from collections import defaultdict | |
def search(inseq, k, L, t): | |
lookup = defaultdict(list) | |
result = set() | |
for cursor in range(len(inseq) - k + 1): | |
seg = inseq[cursor:cursor + k] | |
# remove prior positions of the same segment | |
# if they are more than L distance far | |
while lookup[seg] and cursor + k - lookup[seg][0] > L: | |
lookup[seg].pop(0) | |
lookup[seg].append(cursor) | |
if len(lookup[seg]) == t: | |
result.add(seg) | |
return result |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment