Data Preprocessing Algorithms

Problems with Data

Outlier: Extreme values that deviate significantly frome the rest of the data
- Causes: data entry errors, measurement errors, natural variability
- Solutions: Detect and remove outliers using statistical methods (z-scores, interquartile). Use algo like tree based models that are less senstitev to outliers
Imbalanced Data: In classfication, imbalanced datasets (very rare) can cause models to be biased toward majority class
- Causes: Natural rarity of certain events, sampling error
- Solutions: Resampling, use algo designed to handle imbalance (SMOTE, cost-senesitve learning)
  - Downsample: if imabalanced data is 0.5% of the data, we can bring down the balanced data to also 0.5% of the data and if the accuracy is still the same then it is accruate.
Multicolinearity/correlation between features: occurs when two or more predictor variables in regression model are highly correlated making it diffcult to distinguish
- Redundunt features or variables
Data leakage: when information from outside the trainning dataset is used to create the model, leading to overly optimistic performance during training
- Causes: mistakenly use future data or including variables that would not be avaliable in real world prediction
- Solutions: ensure that the training set only includes data availble at the time of prediction. Conduct thorough review of features
High dimension: having too many features can lead to overfitting, models learn the noise in the data
- Causes: datasts with a large number of variables relative to the number of observations
- Solutions: PCA, feature selection
Times serires data: autocorrelation or seasonality can impact a model accruary
- Causes: time related pattern needs to be accounted for
- Solutions: use time series models (arima sarima lstms)
  - Add time related features
- Seasonality: some time event is counted as outlier can effect the model

Basic Descriptive Statistics

Descriptive Stats:
- Frequency distributiosn
- Proportions
  - Chi-square test
- Mean and variance
Graphical Rep:
- Box plot
Covariance: When mean and variance dont work, use covariance. A positive cov means two varaibles tend to move the same direction. A negative means two variables tend to move in opposite directions
- Sample formula: $cov_{x,y} = \frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{N-1}$
- On a covariance matrix, the diagonal is always variance. cov(x,x) = E(x^2) * (E(x))^2
Correlation: measure linear association, drived from covaraince and devide by variance of x and y
Covariance of itself or a single variable is variance

Feature Engineering

Transforming raw data into meaningful features

Curse of dimensionality: various challenges due to high dimensional spaces
Data sparsity: in high dimensional spaces, data points become sparse, data may not cover the space sufficiently, making it difficult to detect reliable partterns
Distance measures lose its meaning: in high dimensions, the impact of distance becomes less meaningful, affecting algorithms that rely on distance (e.g knn)

PCA

Trying to find a new axis that can maximize the spread or variance of the data

Find the vector that maximizes the variance of your data projected on theat vector
1. Cov matrix by normalizing data
2. Compute eigenvalues and eigenvectors based on cov matrix by using SVD
3. The eigenvector with the largest eigenvalue captures the most variance
4. Sort eigenvectors by eigenvalues, select top k eigenvalues and it will reduced to dimention k
Project data on the new feature space