2. Working with Text Data

2.1 Understanding word embeddings

DL and LLM cannot digest raw text or uses mathematical operations as texts are categorical, therefore we need embedding

Embedding: converting data into a vector format

RAG: combines generation (like producing text) with retrieval (like searching an external knowledge base) to pull relevant information when generating text

Word2Vec: library to generate embeddings by predicting the context of a word given the target word/context by using neural network.

  • The advantage of not using a pre-trained model is that we have the ability to fine tune the model to our specific task

2.2 Tokenizing Text

Couple different way of splitting text, one is to use regular expression library

And remove the whitespaces and commas

2.3 Converting tokens into token IDs

2.4 Adding special context tokens

We add special tokens to a vocabulary to deal with certain contexts. For instance, we add an <|unk|> token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an <|endoftext|> token that we can use to separate two unrelated text sources.

Handling end of text and unknown:

2.5 Byte pair encoding

BPE tokenizer encodes and decodes unknown words with breaking down the unknown words into subwords or individual characters

Last updated