Data Classification Algorithms
K-Nearest Neighbors
Assumptions:
Data is in some metric space and there is a notion of distance between points
Each of the training data consists of a label associated with it. KNN works on both binary and multiclass classification
There is one hyperparameter k. This number decides how many neighbors influence the classification, usually odd.
KNN is a lazy learner, meaning during the training all it does is store the data which we have to rerun the model every single time if we want to make a prediction
How to find K
Take the square root of the number of training samples
Trial and error
Linear Regression

Line of best fit
Minimize the loss function which minimize the residual
Logistic Regression
Estimates P(Y|X) where Y is a binary variable, X is features
Find max(P(1|X) and P(0|X))
Model P(Y |X) as a function of linear combination of X variables, i.e.:
P(Y = 1|X) = f(w0X0 + w1X1 + w2X2 + … + wnXn) = f(wTX)
Sigmoid Function
The logistic function maps any real-valued number into the (0, 1) interval, making it suitable for modeling probabilities.


Decision Boundary
Take the ratio of P(Y=0|X) / P(Y=1|X), if ratio is less than 1, assign to class 1 otherwise class 0



Gradient Descent

Stochastic Gradient Descent
Updates parameters for each training example individually
Computes the gradient using a single random data point per update.
Pros:
Faster iterations and can start improving parameters immediately.
Can handle large datasets efficiently.
Introduces randomness that may help escape local minima.
Cons:
Updates have high variance, leading to a noisy path toward convergence.
May not converge to the exact minimum; instead, it fluctuates around it.
Requires careful tuning of the learning rate.
Batch Gradient Descent
Computes the gradient of the cost function with respect to the parameters for the entire training dataset
Pros:
Converges smoothly towards the minimum.
The update direction is accurate since it uses all data.
Cons:
Computationally expensive for large datasets.
Requires loading the entire dataset into memory.
Slow updates due to processing the whole dataset.
Naive Bayes
Naive Bayes = Classfication using probability + Bayes Theorem + Conditional Independence (Naive)
Independence
Two events are independent if
P(A and B) = P(A) X P(B)
P(A|B) = P(A)
Conditional Independence
An event A is conditionally independent of another event B given an event C, if the probability distribution governing A is independent of the value of B, given the value of C
P(A and B|C) = P(A|C) X P(B|C)
P(A|B,C) = P(A|C)
e.g
A = Alex will go out if the sky is clear
B = Brett will go out if the sky is clear
C = Sky is clear
If we don't know C, then A and B doesn't know each other, with C, we are guaranteed to know
Classfication with Probability
Given a list of features about the Titanic customers (Feature Vector X = {X1, X2, ..Xn})
predict class Y (Y = {Survived, Died})
max(P(Y=survived|X), P(Y=died|X) for all X
In general, calculate P(Y|X) for all classes, and select the class with the highest probability


Conditional probability could be:
Bernoulli
Gaussian
Multinomial
Categorical
Last updated