HMM and Part of Speech Tagging

POS

Verbs: VB, VBP, VBZ, VBD, VBG, VBN
- base, present-non-3rd, present-3rd, past, -ing, -en
Nouns: NNP, NNPS, NN, NNS
- proper/common, singular/plural (singular includes mass + generic)
Adjectives: JJ, JJR, JJS (base, comparative, superlative)
Adverbs: RB, RBR, RBS, RP (base, comparative, superlative, particle)
Pronouns: PRP, PP$ (personal, possessive)
Interogatives: WP, WP$, WDT, WRB (compare to: PRP, PP$, DT, RB)
Other Closed Class: CC, CD, DT, PDT, IN, MD
Punctuation: # $ . , : ( ) “ ” '' ' `
Weird Cases: FW(deja vu), SYM (@), LS (1, 2, a, b), TO (to), POS('s, '), UH (no, OK, well), EX (it/there)
Newer tags: HYPH, PU

Divide at spaces and hyphens
Divide before punctuations that is followed by a space or the end of the line
Break off the following as seperate tokens when followd by a space of end of line:
- 's, n't, 'd, 'll, 're, etc
Sentence splitting potentially after . ? !
- [S]This is this.[S][S]That is that.[S]

A weighted finite-state automation (WFSA)
- each transition arc is associated with a probability
- sum of all arcs is 1
Markov Chain is predicting the future probability of a state depending on our current or previous states
Hidden Markov Model (HMM) is slightly different because some of the information (previous pos tags) is unknown
HMM consists:
- Q = set of states, q0 start state ... qf final state
- A = transition probability matrix nxn
- O = sequence of T observation (words) from a vocabulary V
- B = sequence of observation likelyhood (state probability) called likelihood aka emission prob i