Introduction to Data Analytics
CRISP(Cross-Industry Standard Process for Data Mining)-DM
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Business Understanding
Objective: Understand project objectives and requirements from business perspectives
Task:
Define business goal
Covert business goal to datqa goal
Data Understanding
Objective: To collect and explore the data to understand its structure and quality
Task:
Gather initial data
Describe the data
Explore the data
Verify data
Data Preparation
Objective: To prepare the data for modeling by cleaning, transforming, and organizing it
Task:
Select relevant data
Clean the data
Construct new features
Integrate data
Format and structure the data
Modeling
Objective: To build and evaluate predictive models based on the prepared data
Task:
Select appropriate modeling techniques
Train models using the prepared data
Evaluate model performance
Precision and recall
Tune model parameters
Evaluation
Objective: To assess the model's performance and ensure it meets business objectives
Task:
Evaluate the model's result against business objectives
Validate the model's effectiveness and reliabilty
Review the process and results
Deployment
Objective: To implement the model in a real-world setting and monitor its performance
Task:
Deploy the model into production
Monitor the model's performance and maintain
Update and refine the model
What is predictive analytics?
Data mining
Statistical inference
Machine Learning
Business Sense
Machine Learning
Supervised Learning: The model is trained on a labeled dataset. This means that for each input in the training set, the corresponding output (or label) is known. The goal is to learn a mapping from inputs to outputs so that the model can accurately predict the output for new, unseen data. Always requires a labelled training dataset Examples: Predictive Modeling, uplift modeling, recommender systems, sentiment analysis
Unsupervised Learning: The model is trained on a dataset that does not contain labeled outputs. Instead, the model tries to find hidden patterns, structures, or relationships within the data without any explicit instructions on what to predict. Examples: Association Rule Mining, Clustering,
Forms of Predictive Analytics
Predictive Modeling
Regression: It estimates relationships between variables to predict a continuous numerical outcome.
Classification: Predicts discrete categories or classes, such as spam, cancer cells, or speech. The output is typically a label or a class from a set of predefined options.
Clustering
This technique groups similar data points together based on their inherent characteristics without predefined labels.
K-means, hierarchical clustering, and density-based clustering are prominent algorithms.
Used for: Customer segmentation, market basket analysis, identifying anomalies
Association Rule Mining
Identifies relationships between variables in large datasets
For example, market basket analysis predicts customer purchasing behavior by finding associations between products.
Recommender Systems
Recommender systems are a type of predictive modeling and data filtering technology that aims to suggest items or content to users based on their preferences, behavior, or similarities with other users.
These systems predict the relevance of items (such as products, movies, articles, etc.) to a particular user, helping to personalize their experience by recommending things they are likely to be interested in
Sentiment Analysis
Sentiment analysis, also known as opinion mining, is a natural language processing (NLP) technique used to determine the emotional tone or sentiment expressed in a piece of text.
It involves classifying text into categories such as positive, negative, or neutral, based on the underlying emotions or opinions conveyed by the words and phrases.
Uplift Modeling
Uplift modeling, also known as incremental modeling, is a predictive modeling technique used to estimate the causal impact of a specific action or treatment on an individual's behavior.
Uplift models predict the difference in outcomes caused by an intervention (e.g., how likely a customer is to buy a product as a result of receiving a targeted marketing campaign).