Formal Languages, Regular Expression, Automata, Transducers

Formal Language

  • Formal language can model a phenomenon, e.g written english

  • Examples

    • Mathmatical equations: 1+1=2

    • A sentence: I ate breakfast

    • Some combination of letters

Formal Grammar

  • Formal Grammar has a set of rules and matches all and only instances of a formal language

  • A formal grammar defines a formal language

  • Formal grammars are used to generate and recoginize formal languages e.g programming language has a set of rules and syntax

Formal Grammar Properties

  • N: a finite set of non-terminal symbols

    • Non-Terminal: Abstract symbols representing groups of terminals or other non-terminals

    • Example: S (sentence), NP (noun phrase), VP (verb phrase)

  • T: a finite set of terminal symbols

    • Terminal: Basic symbols that cannot be broken down furthur

    • Example: "I ate breakfast" can be broken down into 3 terminal symbols, "I", "ate", "breakfast"

  • R: a set of rewrite rules

    • Rules defining how non-terminals can be written into sequences of terminals/non-terminals

    • Example: S -> NP VP

  • S: A special non-terminal that is the start symbol

    • Or the root non-terminal that can derive to other non/terminals

The Chomsky Hierarchy

  • Type 0: No restrictions on rules

    • Most powerful, no restrictions on rules

  • Type 1: Context sensitive rules

    • Example: Modeling agreements (he walks vs they walk)

    • αAβ → αγβ, rule means replace A with γ where A is between α and β

  • Type 2: Context free rules

    • Like type 1, but left-hand side can only include one non-terminal

    • Example: Parsing sentence structure ("The dog barks" as S → NP VP).

  • Type 3: Regular Grammars

    • Simplest, used for pattern matching

    • Like type 2, but right-hand side is constrained

      • Non-terminal precedes but don't follow terminals in left regular grammar

      • Non-terminal follow but don't precede terminals in right regular grammar

  • Type0⊇Type1⊇Type2⊇Type3

    • Type 3 is least expressive most efficient, Type 1 is most expressive least efficient

    • Type 0 = exponential; Type 1 = polynomial; Type 2 = O(n^3); Type 3 = O(n log n)

Regular Expression

  • Concatenation

    • If X is a regexp and Y is a regexp, then XY is a regexp

    • Examples

      • If ABC and DEF are regexps, then ABCDEF is a regexp

      • If AB* and BC* are regexps, then ABBC is a regexp

        • Note: Kleene * is explained below

  • Disjunction

    • If X is a regexp and Y is a regexp, then X | Y is a regexp

    • Example: ABC|DEF will match either ABC or DEF

  • Repetition

    • If X is a regexp than a repetition of X will also be a regexp

      • The Kleene Star: A* means 0 or more instances of A

      • Regexp{number}: A{2} means exactly 2 instances of A

  • Disjunction of characters

    • [ABC] – means the same thing as A | B | C

    • [a-zA-Z0-9] – character ranges are equivalent to lists: a|b|c|...|A|B|...|0|1|...|9|

  • Negation of character lists/sequences

    • ^ inside bracket means complement of disjunction, e.g., [^a-z] means a character that is neither a nor b nor c … nor z

    • Question: Why is character negation equivalent to a disjunction?

  • Parentheses

    • Disambiguate scope of operators

    • (ABC)|(DEF) means ABC or ADEF

    • Otherwise defaults apply, e.g., ABC|D means ABC or ABD

  • ? signifies optionality

    • ABC? is equivalent to (ABC)|(AB)

  • plus+ indiates 1 or more

    • A(BC)* is equivalent to A|(A(BC)+)

  • Special Symbols:

    • Period means any character, e.g., A.*B – matches A and B and any characters between

    • Carrot (^) means the beginning of a line, e.g., ^ABC matches ABC at the beginning of a line [*Note dual usage of ^ as negation operator]

    • Dollar sign ($) means the end of a line, .e.g., [.?!] * $ matches final punctuation, zero or more spaces and the end of a line

Finite State Automata (FSA)

  • FSAs are used to recognize patterns in sequences including regex

  • Deterministic Finite State Automata (DFSA)

    • Rules are clear/unambiguous For every state, there is exactly one transition

  • Non-Deterministic FSA (NDFSA)

    • Rules are unclear/ambiguous, can have mutiple rules and transitions: