********************************************** * REGULAR LANGUAGES AND REGULAR EXPRESSIONS * ********************************************** A language is a set that can be informally described through examples, e.g. "The orange cat jumped over the fence." and "The hat is red." are sentences in the set of sentences in the English language. The set of a language (which could be infinite) can be more precisely described by set-builder notation. Set builder notation is used here to describe the language of regular expressions (regexes). The purpose is to show that regexes are regular languages. An important property of a regular language is that it can be scanned for validity from left to right without backtracking. Def. Let A be an alphabet. The class R of regular languages over A, and the corresponding regular expressions are defined by: 1. Φ (phi) or {} (the empty-set) is a regular language expression and the corresponding regular expression is {}. This trivial case says that an empty set of strings is regular. 2. {λ} is a regular language, and the corresponding regular expression is λ (lambda), where lambda is the empty string. The empty string λ is not the same as the empty language {}. This trivial case says that a set containing the empty string is regular. 3. For each 'a' in A, {a} is a regular language, and 'a' is the corresponding regular expression. 4. If L1 and L2 are regular languages and r1 and r2 are the corresponding regular expressions, then (a) L1 | L2 (set union - sometimes denoted by +) is a regular language with the regular expression (r1 | r2). Reads L1 or L2. (b) L1L2 is a regular language with the regular expression (r1r2). This is string concatenation. (c) L1* is a regular language with the regular expression (r1*), where * means L1 concatenated 0 or more times. (d) L1^n is a regular language with L1 concatenated n times. A new regex can be formed by applying the operators * and | to a regex. A new regex can be formed by applying concatenation to two existing regexes. Examples. Let A = {a,b} ((a|b)*|(ab|bb)) # denotes all possible strings over A including lambda {lambda,b,bb,ab,abb,bab,bbb,abab,...} (a|b)*(ab) # denotes all strings over A that end with ab {ab,bab,aab,abab, ... } (a|b)^2|(ab)* # denotes the union of all strings over A of length # 2 with all strings over A of 0 or more 'ab's {aa,bb,ab,ba,aaab,bbab,abab,baab,...} a(a|b)*b|b(a|b)*a # denotes all strings over A whose first and last # symbols differ. {ab,ba,aab,abb,baa,bba,...} (ab|b)*(b|ab)^2 # denotes all strings over A that are the concatenation # of ab and b with strings formed by two concatenations # of b and ab {bb,abab,bbb,babab,abbb,ababab,abbb,...} There is a precedence hierarchy for operators in a regex that does not contain parentheses. regular expression operator hierarchy ------------------------------------- parentheses Kleene star * concatenation union | Example of applying hierarchy for regular expressions b|cb = b|(cb) { concatenation has higher priority } b|ab* = b|a(b*) { Kleene star has priority over concatenation } EXAMPLE. (0|1)*00(0|1)* # denotes all strings containing substring 00 ((01|1)*)(0|lambda) # denotes all strings without substring 00 Note that there is no clear relationship between the regular expression for a language and a regular expression for its complement. EXAMPLE Identifiers: A string of letters and/or digits beginning with letter. letter(letter|digit)* where letter = a|b|c...|z|A|B|C...|Z and digit = 0|1|2..|9 Integers: ('+' + '-' + lambda)(digit)(digit)* Reals: ('+'|'-'|lambda)digit digit* (lambda | '.' digit digit*) ((lambda)| ('E' (lambda | '+' | '-') digit digit* ) EXAMPLE (a|bc)*(c|lambda) = {lambda, a, c, aa, ac, bc, ...} L=the set of all strings of a or bc, with an optional c at the end. EXAMPLE: Let L be the language of the following regular expression. c*(b|(ac*))* Is cba in L? yes Is ccbaac in L? yes Is cbc in L? no, we can't get a second c without an a. Is acabacc in L? yes Is bbaaacc in L? yes EX: Given a finite language, find a regular expression. L = {a, bcc, acb} reg. expr: a|bcc|acb Clearly, there is a reg. expr. for any finite language. That regex is created by the union of all the elements of L. Thus, any finite language is regular. EX: A = {a,b} L = {w : w = (a^n)(b^n) , 0<=n<=3 } ab|aabb + aaabbb is a regular expression for L. This language is regular because it is finite. HOWEVER: A = {a,b} L = {w : w = (a^n)(b^n) , 0<=n } a*b* does NOT work, because it includes aab (among others) (ab)* does NOT work, because it includes abababab... There is no regular expression for this language. EX: L = the set of all palindromes over {a,b} For example: a abba bab bb We cannot write a regular expression for this language. Why? Informally, because as we scan from left to right, the test for validity of the string depends on something yet to come. We could write a regular expression for a *finite* set of palindrones, but not for the set of all palindromes.