HMM and Part of Speech Tagging

POS

  • Verbs: VB, VBP, VBZ, VBD, VBG, VBN

    • base, present-non-3rd, present-3rd, past, -ing, -en

  • Nouns: NNP, NNPS, NN, NNS

    • proper/common, singular/plural (singular includes mass + generic)

  • Adjectives: JJ, JJR, JJS (base, comparative, superlative)

  • Adverbs: RB, RBR, RBS, RP (base, comparative, superlative, particle)

  • Pronouns: PRP, PP$ (personal, possessive)

  • Interogatives: WP, WP$, WDT, WRB (compare to: PRP, PP$, DT, RB)

  • Other Closed Class: CC, CD, DT, PDT, IN, MD

  • Punctuation: # $ . , : ( ) “ ” '' ' `

  • Weird Cases: FW(deja vu), SYM (@), LS (1, 2, a, b), TO (to), POS('s, '), UH (no, OK, well), EX (it/there)

  • Newer tags: HYPH, PU

Tokenization Rules

  1. Divide at spaces and hyphens

  2. Divide before punctuations that is followed by a space or the end of the line

  3. Break off the following as seperate tokens when followd by a space of end of line:

    • 's, n't, 'd, 'll, 're, etc

  4. Sentence splitting potentially after . ? !

    • [S]This is this.[S][S]That is that.[S]

Finite State Automata

  • DFSA

    • Rules are unambiguous

    • Less flexible

  • NDFSA

    • Rules are ambiguous

    • Flexible but difficult to implement

Hearst Pattern

  • Hypernym(X,Y) means that X is a type of Y

HMM

  • A weighted finite-state automation (WFSA)

    • each transition arc is associated with a probability

    • sum of all arcs is 1

  • Markov Chain is predicting the future probability of a state depending on our current or previous states

  • Hidden Markov Model (HMM) is slightly different because some of the information (previous pos tags) is unknown

  • HMM consists:

    • Q = set of states, q0 start state ... qf final state

    • A = transition probability matrix nxn

    • O = sequence of T observation (words) from a vocabulary V

    • B = sequence of observation likelyhood (state probability) called likelihood aka emission prob i