2.1 Understanding word embeddings
DL and LLM cannot digest raw text or uses mathematical operations as texts are categorical, therefore we need embedding
Embedding: converting data into a vector format
RAG: combines generation (like producing text) with retrieval (like searching an external knowledge
base) to pull relevant information when generating text
Word2Vec: library to generate embeddings by predicting the context of a word given the target word/context by using neural network.
The advantage of not using a pre-trained model is that we have the ability to fine tune the model to our specific task
2.2 Tokenizing Text
Couple different way of splitting text, one is to use regular expression library
And remove the whitespaces and commas
2.3 Converting tokens into token IDs
2.4 Adding special context tokens
We add special tokens to a vocabulary to deal with certain contexts. For instance, we
add an <|unk|> token to represent new and unknown words that were not part of the training
data and thus not part of the existing vocabulary. Furthermore, we add an <|endoftext|> token
that we can use to separate two unrelated text sources.
Handling end of text and unknown:
2.5 Byte pair encoding
BPE tokenizer encodes and decodes unknown words with breaking down the unknown words into subwords or individual characters