Data Preprocessing Algorithms
Problems with Data
Outlier: Extreme values that deviate significantly frome the rest of the data
Causes: data entry errors, measurement errors, natural variability
Solutions: Detect and remove outliers using statistical methods (z-scores, interquartile). Use algo like tree based models that are less senstitev to outliers
Imbalanced Data: In classfication, imbalanced datasets (very rare) can cause models to be biased toward majority class
Causes: Natural rarity of certain events, sampling error
Solutions: Resampling, use algo designed to handle imbalance (SMOTE, cost-senesitve learning)
Downsample: if imabalanced data is 0.5% of the data, we can bring down the balanced data to also 0.5% of the data and if the accuracy is still the same then it is accruate.
Multicolinearity/correlation between features: occurs when two or more predictor variables in regression model are highly correlated making it diffcult to distinguish
Redundunt features or variables
Data leakage: when information from outside the trainning dataset is used to create the model, leading to overly optimistic performance during training
Causes: mistakenly use future data or including variables that would not be avaliable in real world prediction
Solutions: ensure that the training set only includes data availble at the time of prediction. Conduct thorough review of features
High dimension: having too many features can lead to overfitting, models learn the noise in the data
Causes: datasts with a large number of variables relative to the number of observations
Solutions: PCA, feature selection
Times serires data: autocorrelation or seasonality can impact a model accruary
Causes: time related pattern needs to be accounted for
Solutions: use time series models (arima sarima lstms)
Add time related features
Seasonality: some time event is counted as outlier can effect the model
Basic Descriptive Statistics
Descriptive Stats:
Frequency distributiosn
Proportions
Chi-square test
Mean and variance
Graphical Rep:
Box plot
Covariance: When mean and variance dont work, use covariance. A positive cov means two varaibles tend to move the same direction. A negative means two variables tend to move in opposite directions
Sample formula: covx,y=N−1∑(xi−xˉ)(yi−yˉ)
On a covariance matrix, the diagonal is always variance. cov(x,x) = E(x^2) * (E(x))^2
Correlation: measure linear association, drived from covaraince and devide by variance of x and y
Covariance of itself or a single variable is variance
Feature Engineering
Transforming raw data into meaningful features
Curse of dimensionality: various challenges due to high dimensional spaces
Data sparsity: in high dimensional spaces, data points become sparse, data may not cover the space sufficiently, making it difficult to detect reliable partterns
Distance measures lose its meaning: in high dimensions, the impact of distance becomes less meaningful, affecting algorithms that rely on distance (e.g knn)
PCA
Trying to find a new axis that can maximize the spread or variance of the data
Find the vector that maximizes the variance of your data projected on theat vector
Cov matrix by normalizing data
Compute eigenvalues and eigenvectors based on cov matrix by using SVD
The eigenvector with the largest eigenvalue captures the most variance
Sort eigenvectors by eigenvalues, select top k eigenvalues and it will reduced to dimention k
Project data on the new feature space
We can use scree plot to see how much variance is explained by number of principle components
SVD
Decomposes any matrix X (doesnt have to be square) into three matrices
Use to extract eigenvalues and eigenvectors for PCA
Combining Datasets
Enriched analysis: can introuce additional features
Holistic understanding
Enhanced BI
Comparing Data
Similarity: Higher values indicate greater similarity.
Distance: Higher values indicate greater dissimilarity.








Last updated