HMM and Part of Speech Tagging
POS
Verbs: VB, VBP, VBZ, VBD, VBG, VBN
base, present-non-3rd, present-3rd, past, -ing, -en
Nouns: NNP, NNPS, NN, NNS
proper/common, singular/plural (singular includes mass + generic)
Adjectives: JJ, JJR, JJS (base, comparative, superlative)
Adverbs: RB, RBR, RBS, RP (base, comparative, superlative, particle)
Pronouns: PRP, PP$ (personal, possessive)
Interogatives: WP, WP$, WDT, WRB (compare to: PRP, PP$, DT, RB)
Other Closed Class: CC, CD, DT, PDT, IN, MD
Punctuation: # $ . , : ( ) “ ” '' ' `
Weird Cases: FW(deja vu), SYM (@), LS (1, 2, a, b), TO (to), POS('s, '), UH (no, OK, well), EX (it/there)
Newer tags: HYPH, PU
Tokenization Rules
Divide at spaces and hyphens
Divide before punctuations that is followed by a space or the end of the line
Break off the following as seperate tokens when followd by a space of end of line:
's, n't, 'd, 'll, 're, etc
Sentence splitting potentially after . ? !
[S]This is this.[S][S]That is that.[S]
Finite State Automata
DFSA
Rules are unambiguous
Less flexible
NDFSA
Rules are ambiguous
Flexible but difficult to implement
Hearst Pattern
Hypernym(X,Y) means that X is a type of Y
HMM
A weighted finite-state automation (WFSA)
each transition arc is associated with a probability
sum of all arcs is 1
Markov Chain is predicting the future probability of a state depending on our current or previous states
Hidden Markov Model (HMM) is slightly different because some of the information (previous pos tags) is unknown
HMM consists:
Q = set of states, q0 start state ... qf final state
A = transition probability matrix nxn
O = sequence of T observation (words) from a vocabulary V
B = sequence of observation likelyhood (state probability) called likelihood aka emission prob i