https://leetcode.com/problems/repeated-dna-sequences/
Repeated DNA Sequences
Total Accepted: 26334 Total Submissions: 125642 Difficulty: Medium
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Code:
class Solution(object): def findRepeatedDnaSequences(self, s): """ :type s: str :rtype: List[str] """ ecd = 0 res = [] d = dict() for i in xrange(len(s)): ecd = ecd << 3 ecd = ecd | ord(s[i]) & 7 ecd = ecd & int('0b'+'1'*30, 2) if d.get(ecd) is None: d[ecd] = 1 elif d[ecd] == 1 and i >= 9: res.append(s[i-9:i+1]) d[ecd] = 2 return res
Idea:
Since we only have “A”, “C”, “G” and “T”, we can encode each character using the last three bits of its binary representation. We want to encode every 10 character substring eventually. We create a variable `ecd`, which should only have 30 bits valid (10 character*3 bits/character). Since python can allow you to have very large integer, you need to mask the last 30 bits. (Line 13) For every reading bit, you left shift the previous code 3 bits (line 11), intercept this character’s last 3 bits and then fill the last 3 empty bits (line 12).
Then it is easy to find whether one encode has been saved or not. If the encode already exists as a key in the dictionary, you only need to report it once.
Reference:
https://leetcode.com/discuss/24478/i-did-it-in-10-lines-of-c