Data Classification Algorithms

K-Nearest Neighbors

Assumptions:

  • Data is in some metric space and there is a notion of distance between points

  • Each of the training data consists of a label associated with it. KNN works on both binary and multiclass classification

  • There is one hyperparameter k. This number decides how many neighbors influence the classification, usually odd.

KNN is a lazy learner, meaning during the training all it does is store the data which we have to rerun the model every single time if we want to make a prediction

How to find K

  1. Take the square root of the number of training samples

  2. Trial and error

Linear Regression

Line of best fit

Minimize the loss function which minimize the residual

Logistic Regression

Estimates P(Y|X) where Y is a binary variable, X is features

Find max(P(1|X) and P(0|X))

Model P(Y |X) as a function of linear combination of X variables, i.e.:

P(Y = 1|X) = f(w0X0 + w1X1 + w2X2 + … + wnXn) = f(wTX)

Sigmoid Function

The logistic function maps any real-valued number into the (0, 1) interval, making it suitable for modeling probabilities.

Decision Boundary

Take the ratio of P(Y=0|X) / P(Y=1|X), if ratio is less than 1, assign to class 1 otherwise class 0

Gradient Descent

Stochastic Gradient Descent

Updates parameters for each training example individually

Computes the gradient using a single random data point per update.

Pros:

  • Faster iterations and can start improving parameters immediately.

  • Can handle large datasets efficiently.

  • Introduces randomness that may help escape local minima.

Cons:

  • Updates have high variance, leading to a noisy path toward convergence.

  • May not converge to the exact minimum; instead, it fluctuates around it.

  • Requires careful tuning of the learning rate.

Batch Gradient Descent

Computes the gradient of the cost function with respect to the parameters for the entire training dataset

Pros:

  • Converges smoothly towards the minimum.

  • The update direction is accurate since it uses all data.

Cons:

  • Computationally expensive for large datasets.

  • Requires loading the entire dataset into memory.

  • Slow updates due to processing the whole dataset.

Naive Bayes

Naive Bayes = Classfication using probability + Bayes Theorem + Conditional Independence (Naive)

Independence

Two events are independent if

  • P(A and B) = P(A) X P(B)

  • P(A|B) = P(A)

Conditional Independence

An event A is conditionally independent of another event B given an event C, if the probability distribution governing A is independent of the value of B, given the value of C

  • P(A and B|C) = P(A|C) X P(B|C)

  • P(A|B,C) = P(A|C)

e.g

A = Alex will go out if the sky is clear

B = Brett will go out if the sky is clear

C = Sky is clear

If we don't know C, then A and B doesn't know each other, with C, we are guaranteed to know

Classfication with Probability

Given a list of features about the Titanic customers (Feature Vector X = {X1, X2, ..Xn})

predict class Y (Y = {Survived, Died})

max(P(Y=survived|X), P(Y=died|X) for all X

In general, calculate P(Y|X) for all classes, and select the class with the highest probability

Conditional probability could be:

  • Bernoulli

  • Gaussian

  • Multinomial

  • Categorical

Last updated