Formal Languages, Regular Expression, Automata, Transducers
Formal Language
Formal language can model a phenomenon, e.g written english
Examples
Mathmatical equations: 1+1=2
A sentence: I ate breakfast
Some combination of letters
Formal Grammar
Formal Grammar has a set of rules and matches all and only instances of a formal language
A formal grammar defines a formal language
Formal grammars are used to generate and recoginize formal languages e.g programming language has a set of rules and syntax
Formal Grammar Properties
N: a finite set of non-terminal symbols
Non-Terminal: Abstract symbols representing groups of terminals or other non-terminals
Example: S (sentence), NP (noun phrase), VP (verb phrase)
T: a finite set of terminal symbols
Terminal: Basic symbols that cannot be broken down furthur
Example: "I ate breakfast" can be broken down into 3 terminal symbols, "I", "ate", "breakfast"
R: a set of rewrite rules
Rules defining how non-terminals can be written into sequences of terminals/non-terminals
Example: S -> NP VP
S: A special non-terminal that is the start symbol
Or the root non-terminal that can derive to other non/terminals
The Chomsky Hierarchy
Type 0: No restrictions on rules
Most powerful, no restrictions on rules
Type 1: Context sensitive rules
Example: Modeling agreements (he walks vs they walk)
αAβ → αγβ, rule means replace A with γ where A is between α and β
Type 2: Context free rules
Like type 1, but left-hand side can only include one non-terminal
Example: Parsing sentence structure ("The dog barks" as S → NP VP).
Type 3: Regular Grammars
Simplest, used for pattern matching
Like type 2, but right-hand side is constrained
Non-terminal precedes but don't follow terminals in left regular grammar
Non-terminal follow but don't precede terminals in right regular grammar
Type0⊇Type1⊇Type2⊇Type3
Type 3 is least expressive most efficient, Type 1 is most expressive least efficient
Type 0 = exponential; Type 1 = polynomial; Type 2 = O(n^3); Type 3 = O(n log n)
Regular Expression
Concatenation
If X is a regexp and Y is a regexp, then XY is a regexp
Examples
If ABC and DEF are regexps, then ABCDEF is a regexp
If AB* and BC* are regexps, then ABBC is a regexp
Note: Kleene * is explained below
Disjunction
If X is a regexp and Y is a regexp, then X | Y is a regexp
Example: ABC|DEF will match either ABC or DEF
Repetition
If X is a regexp than a repetition of X will also be a regexp
The Kleene Star: A* means 0 or more instances of A
Regexp{number}: A{2} means exactly 2 instances of A
Disjunction of characters
[ABC] – means the same thing as A | B | C
[a-zA-Z0-9] – character ranges are equivalent to lists: a|b|c|...|A|B|...|0|1|...|9|
Negation of character lists/sequences
^ inside bracket means complement of disjunction, e.g., [^a-z] means a character that is neither a nor b nor c … nor z
Question: Why is character negation equivalent to a disjunction?
Parentheses
Disambiguate scope of operators
(ABC)|(DEF) means ABC or ADEF
Otherwise defaults apply, e.g., ABC|D means ABC or ABD
? signifies optionality
ABC? is equivalent to (ABC)|(AB)
plus+ indiates 1 or more
A(BC)* is equivalent to A|(A(BC)+)
Special Symbols:
Period means any character, e.g., A.*B – matches A and B and any characters between
Carrot (^) means the beginning of a line, e.g., ^ABC matches ABC at the beginning of a line [*Note dual usage of ^ as negation operator]
Dollar sign ($) means the end of a line, .e.g., [.?!] * $ matches final punctuation, zero or more spaces and the end of a line
Finite State Automata (FSA)
FSAs are used to recognize patterns in sequences including regex
Deterministic Finite State Automata (DFSA)
Rules are clear/unambiguous For every state, there is exactly one transition
Non-Deterministic FSA (NDFSA)
Rules are unclear/ambiguous, can have mutiple rules and transitions: