2. Working with Text Data

2.1 Understanding word embeddings

DL and LLM cannot digest raw text or uses mathematical operations as texts are categorical, therefore we need embedding

Embedding: converting data into a vector format

RAG: combines generation (like producing text) with retrieval (like searching an external knowledge base) to pull relevant information when generating text

Word2Vec: library to generate embeddings by predicting the context of a word given the target word/context by using neural network.

The advantage of not using a pre-trained model is that we have the ability to fine tune the model to our specific task

2.2 Tokenizing Text

Couple different way of splitting text, one is to use regular expression library

import re
text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

And remove the whitespaces and commas

result = re.split(r'([,.]|\s)', text)
result = [item for item in result if item.strip()]

2.3 Converting tokens into token IDs

class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab #A
        self.int_to_str = {i:s for s,i in vocab.items()} #B
    def encode(self, text): #C
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip(
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    def decode(self, ids): #D
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) #E
        return text

2.4 Adding special context tokens

We add special tokens to a vocabulary to deal with certain contexts. For instance, we add an <|unk|> token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an <|endoftext|> token that we can use to separate two unrelated text sources.

Handling end of text and unknown:

all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

def encode(self, text):
    preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
    preprocessed = [item.strip() for item in preprocessed if item.strip(
    preprocessed = [item if item in self.str_to_int #A
    else "<|unk|>" for item in preprocessed]
    ids = [self.str_to_int[s] for s in preprocessed]
return ids

2.5 Byte pair encoding

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of some
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"
strings = tokenizer.decode(integers)

BPE tokenizer encodes and decodes unknown words with breaking down the unknown words into subwords or individual characters

PreviousBuild a Large Language Model (From Scratch)NextPrinciple of Data Science

Last updated 1 year ago

hashtag2.1 Understanding word embeddings

hashtag2.2 Tokenizing Text

hashtag2.3 Converting tokens into token IDs

hashtag2.4 Adding special context tokens

hashtagHandling end of text and unknown:

hashtag2.5 Byte pair encoding