Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

CMPS 3120 Algorithm Analysis

 

Space and Time Trade-offs

 


String Searching Problem: Given a pattern and a text, find out all the matching locations.

pattern: a string of m characters to search for
text: a (long) string of n characters to search in
matching: a fully match for all the characters
comparison: compare 2 characters (positive or negative result)

Solution


 

Hashing (PDF ver)

 




The Human Genome Project: The Human Genome Project (HGP) is one of the greatest scientific feats in history. The project was a voyage of biological discovery led by an international group of researchers looking to comprehensively study all of the DNA (known as a genome) of a select set of organisms. Launched in October 1990 and completed in April 2003, the Human Genome Project’s signature accomplishment – generating the first sequence of the human genome – provided fundamental information about the human blueprint, which has since accelerated the study of human biology and improved the practice of medicine.

--National Human Genome Research Institute

Base Pair (BP): A base pair consists of two complementary DNA nucleotide bases that pair together to form a “rung of the DNA ladder.” DNA is made of two linked strands that wind around each other to resemble a twisted ladder — a shape known as a double helix. Each strand has a backbone made of alternating sugar (deoxyribose) and phosphate groups.

The human genome contains 3 billion bps that form tens of thousands of genes, yet its complex structure is made up of only four molecules: adenine (A), cytosine (C), guanine (G), and thymine (T).

 acgt

Motif: Motif is a region (a subsequence) of protein or DNA sequence that has a specific structure.

FOXA1: This gene encodes a member of the forkhead class of DNA-binding proteins. These hepatocyte nuclear factors are transcriptional activators for liver-specific transcripts such as albumin and transthyretin, and they also interact with chromatin.            AAAGTAAACA

 

 

How fast we can go:
We have FOXA1 (AAAGTAAACA) as the pattern (10 bps); and the entire gene as the text (3,000,000,000 bps).

What if we change to other motifs?

Hints