Data Science Methodology - English Template
Data Science Methodology - English Template
1) Problem Characterization
Stage Description:
The first step for any AI project is understanding the business requirements to clearly define the
problem to be solved. A comprehensive understanding of the problem ensures that relevant
data is collected, and goals are set accurately, often requiring insights from a domain expert.
1. Define the Business Objective: Identify the specific business goal the project
addresses and how an AI solution can add value.
2. Identify Solution Components: Determine which parts of the solution require AI and
which do not, so resources are allocated effectively.
3. Set Success Criteria: Define clear and measurable criteria to evaluate the model’s
success.
4. Gather Stakeholder Requirements: Consult with stakeholders (business leaders,
domain experts, and technical teams) to ensure a comprehensive understanding of the
problem.
5. Outline Ethical and Bias Considerations: Identify any ethical, transparency, or
bias-related requirements for the model to ensure responsible AI practices.
6. Define Technical Constraints: Understand any limitations related to infrastructure,
deployment, or technology that may impact the solution.
7. Consider Iterative Development: Decide if the project can be split into smaller, iterative
sprints to manage complexity and adjust based on feedback.
Questions to Ask:
When faced with a problem, the optimal analytical approach is selected based on the nature of
the problem and the desired outcomes. This could involve predictive analysis,
classification analysis, descriptive analysis, or other statistical methods. Each approach
offers insights and results to guide informed decision-making.
● Optimal analytical approach: The most effective method for solving a problem using
available data.
● Problem nature: Includes data size, data type (quantitative, qualitative), relationships
between variables, and the statistical distribution of data.
● Desired outcomes: The goals of the analysis, such as predicting a future value of a
variable, classifying data into categories, or describing data in a clear and concise
manner.
● Predictive analysis: Uses statistical models to forecast future events based on
historical data.
● Classification analysis: Aims to categorize data into predefined classes, such as
classifying customers into different groups based on their purchasing behavior.
● Descriptive analysis: Focuses on describing and summarizing data using descriptive
statistics like mean, standard deviation, and frequency distribution.
In essence, selecting the appropriate analytical approach is crucial for achieving accurate
and valuable results.
Additional Notes:
● The choice of analytical approach depends on the specific context and goals of the
analysis.
● Data quality is essential for accurate results.
● There are various statistical software tools available to perform these analyses.
Stage Description:
After defining the problem, the next step is to understand the data needs and determine if the
available data is suitable for model development. This stage focuses on data identification, initial
collection, quality assessment, and identifying areas for further exploration.
1. Identify Data Sources: Determine where relevant data can be obtained, whether from
internal databases, third-party APIs, or external datasets.
2. Initial Data Collection: Gather a preliminary dataset to analyze its structure, distribution,
and relevance to the business objective.
3. Assess Data Quality: Evaluate the data's quality, including checking for missing values,
inconsistencies, duplicates, and outliers.
4. Identify Data Requirements: Define the ideal data size, attributes, and types necessary
for training the model effectively.
5. Divide Data Sets: Plan how to split data into training, test, and validation sets.
6. Consider Labeling Needs: For supervised learning, assess if the data needs labeling
and if a labeling strategy is required.
7. Real-time Data Needs: Determine if real-time data access is required, especially if the
AI model needs continuous updates or operates in real-time environments.
Questions to Ask:
● What are the data sources needed for training the model?
● What is the adequate data size for the AI project?
● What is the current quality and quantity of data for training?
● How will the training and test sets be divided?
● If it’s a supervised learning task, can the data be labeled?
● Can pre-trained models be used?
● Where will operational and training data be stored?
● Is there a need for real-time data on edge devices or in hard-to-access areas?
● Do you have a labeled dataset where the input data is paired with corresponding output
labels?
● Is the task you're trying to solve a classification or regression problem, where you want
Unsupervised Learning:
● Do you have data without labeled output or are you looking for patterns, structures, or
● Are you interested in exploring the inherent structure of the data or reducing its
dimensionality?
Semi-Supervised Learning:
● Do you have a partially labeled dataset with some data points having labels and others
without?
● Do you want to leverage both labeled and unlabeled data to improve your model's
performance?
Reinforcement Learning:
● Are you dealing with tasks that involve learning from interactions with an environment
Stage Description:
Data collection and preparation are often the most time-consuming steps, taking up to 80% of
the project time. This stage includes data cleansing, labeling, handling missing values, format
unification, and feature engineering to enhance data quality.
1. Data Collection: Gather data from all identified sources and consolidate it into a single
dataset for ease of use.
2. Data Cleaning: Handle missing, incorrect, or inconsistent values, and remove duplicates
to improve data quality.
3. Data Transformation: Standardize and normalize data formats to ensure consistency
across various sources.
4. Feature Engineering: Create new features or modify existing ones to improve the
dataset’s relevance and predictive power.
5. Data Augmentation: For tasks like image processing, enhance the dataset by
generating new data points (e.g., flipping or rotating images).
6. Data Splitting: Split the data into training, validation, and test sets to prevent overfitting
and enable model evaluation.
Questions to Ask:
● Where will data be collected from, and what are the various sources?
● How will data formats be standardized across different sources?
● Are there incorrect or missing values that need adjusting?
● Is there a need for additional dimensions or external data sources to enhance the
dataset?
● Should some data be duplicated, as in the case of image-based datasets?
● What data cleaning processes are necessary for model quality?
● How should the data be split into training, test, and validation sets for accuracy?
Tools:
● Data Cleaning Tools: OpenRefine and Pandas are excellent for cleaning datasets.
● Data Transformation Tools: PySpark and Databricks are suited for handling
large-scale data processing.
● Feature Engineering Libraries: Scikit-Learn provides feature engineering methods to
enhance data.
● Data Labeling Platforms: Labelbox and AWS SageMaker Ground Truth are effective
for labeling unlabeled data.
● Data Overview:
○ What is the size of the dataset (number of rows and columns)?
○ What are the data types of each column (numeric, categorical, date, text, etc.)?
○ Do you have a data dictionary or documentation to understand the meaning of
each variable?
● Missing Data:
○ Are there missing values in the dataset?
○ What is the percentage of missing data for each column?
○ Are missing values random, or is there a pattern to their occurrence?
● Data Distribution:
○ What is the distribution of numeric variables (mean, median, standard deviation,
min, max)?
○ Are there any outliers in the data?
○ Are there any variables that appear to follow a specific distribution (e.g., normal,
exponential)?
● Categorical Variables:
○ How many unique categories are there in categorical variables?
○ Are there any rare or infrequent categories?
○ Are there any misspelled or inconsistent category values?
● Text Data:
○ How long are text fields, and what is the average length?
○ Are there any special characters or formatting issues in text data?
○ Are there common stopwords or irrelevant text that should be removed?
● Data Consistency:
○ Are there inconsistencies in naming conventions (e.g., "Male" vs. "M" vs.
"male")?
○ Are there units of measurement discrepancies?
○ Are date formats consistent, and are there any date outliers?
○ Are there any trends, seasonality, or cyclical patterns in time series data?
● Domain Knowledge:
○ Are there domain-specific rules or expectations that the data should adhere to?
○ Are there any external factors that might affect the data quality or interpretation?
● Data Visualization:
○ Can you create visualizations like histograms, box plots, scatter plots, or
Stage Description:
Once data is prepared, the next step is selecting features that will enable the model to perform
well. This stage includes applying appropriate ML techniques, tuning hyperparameters, and
evaluating model performance.
1. Feature Selection: Identify the most relevant features from the dataset to improve
model performance and reduce complexity.
2. Algorithm Selection: Choose the machine learning algorithm that best fits the problem,
such as regression, classification, clustering, or neural networks.
3. Model Training: Train the model using the selected algorithm and dataset.
4. Hyperparameter Tuning: Adjust hyperparameters to optimize model performance and
avoid overfitting or underfitting.
5. Model Evaluation: Continuously evaluate model performance during training using
metrics such as accuracy, precision, and recall.
6. Ensemble Techniques: If necessary, combine multiple models to improve performance,
especially when a single model doesn’t meet the accuracy requirements.
Questions to Ask:
● What algorithm best suits the learning objectives and data requirements?
● How can hyperparameters be optimized for ideal performance?
● What features yield the best results?
● Should multiple models be combined (ensemble models) to enhance performance?
● How can model explainability be improved if necessary?
● What are the requirements for running and deploying the model?
Tools:
● Feature Selection Tools: Use libraries like Scikit-Learn and Feature-engine for
selecting the most relevant features.
● Machine Learning Frameworks: TensorFlow, Keras, and PyTorch are powerful for
training ML models.
● Hyperparameter Tuning Tools: Optuna and Hyperopt help with tuning model
hyperparameters.
● Ensemble Methods: Libraries like XGBoost and CatBoost are useful for building
ensemble models.
Stage Description:
Model evaluation is the quality assurance step where the model’s performance is assessed
against business requirements. The model must meet performance metrics and business
objectives.
1. Select Evaluation Metrics: Choose relevant metrics based on the problem type
(classification, regression, etc.), such as accuracy, precision, recall, F1 score, or Mean
Squared Error.
2. Calculate Confusion Matrix: For classification problems, the confusion matrix helps
analyze true positives, false positives, true negatives, and false negatives.
3. Cross-Validation: Use techniques like K-Fold Cross-Validation to assess model stability
across different subsets of data.
4. Baseline Comparison: Compare the model’s performance to a baseline model or
heuristic to ensure it provides a significant improvement.
5. Hyperparameter Fine-Tuning: Adjust hyperparameters if the model’s performance
needs further improvement based on evaluation results.
6. Check for Overfitting/Underfitting: Ensure the model generalizes well on unseen data
by analyzing training and validation errors.
Questions to Ask:
● Which performance metrics will be used to evaluate the model, such as accuracy, false
positive rate, or logarithmic loss?
● How can a confusion matrix be calculated for classification tasks?
● Are techniques like K-Fold Cross-Validation needed to ensure accuracy?
● Is further tuning required to improve performance?
● How does the model’s performance compare to a baseline or heuristic?
Tools:
Stage Description:
After quality verification, the model is deployed into a production environment. This stage
assesses real-world performance, and the model can be deployed on the cloud, on-premises, or
edge devices.
1. Select Deployment Environment: Choose where the model will be deployed, such as
on the cloud, on-premises, or on edge devices, depending on business requirements
and technical constraints.
2. Model Packaging: Package the model using tools like containers to make deployment
and scaling easier.
3. Set Up Continuous Monitoring: Implement monitoring to track the model’s
performance, detect drift, and ensure it continues to meet accuracy and other KPIs over
time.
4. Benchmarking: Set benchmarks to compare the current model’s performance with
future versions, enabling continuous improvement.
5. Implement Model Versioning: Keep track of model versions for easier updates,
rollback, and maintenance.
6. Schedule Model Updates: Define a process for regularly updating the model,
incorporating new data, and retraining as necessary.
Questions to Ask:
Tools:
Stage Description:
After deployment, the process doesn’t end. Continuous monitoring and adjustment ensure the
model adapts to changes in environment and data, enhancing performance over time. This
phase emphasizes iterative improvement.
Questions to Ask:
Tools:
● Experiment Tracking Tools: MLflow and Weights & Biases are effective for tracking
experiments and iterating models.
● Version Control: DVC (Data Version Control) or Git for versioning data and model
changes.
Stage Description:
Feedback is essential for understanding model effectiveness in the production environment. By
analyzing performance results, improvements can be made to boost model accuracy and value,
helping align the model with evolving business needs.
1. Collect User Feedback: Gather qualitative and quantitative feedback from end-users
who interact with the model to understand user satisfaction and identify issues.
2. Analyze Model Impact: Assess whether the model achieves the intended business
outcomes and provides value to stakeholders.
3. Identify Improvement Areas: Use feedback to pinpoint specific areas where the model
can be improved, such as accuracy, response time, or usability.
4. Establish Feedback Loops: Create a process for continuous feedback collection, so
insights are regularly integrated into model improvements.
5. Monitor User Satisfaction and Utility: Track metrics related to user engagement and
satisfaction to evaluate the model’s practical benefits.
Questions to Ask:
Tools:
Stage Description:
After ensuring model stability, consider expanding its application to new domains or systems.
This involves adapting the model to work with new datasets, different environments, or
additional uses aligned with initial objectives.
1. Identify New Use Cases: Determine additional areas or applications where the model
could be beneficial.
2. Adapt to New Data Sources: Modify the model to work effectively with new or varied
data sources, ensuring accuracy and relevance.
3. Customize for Different Environments: Optimize the model for different deployment
environments, such as mobile devices, edge computing, or other platforms.
4. Test Adaptations: Ensure that the model functions correctly with any new data sources
or in new environments through thorough testing.
5. Scale Model Operations: Implement scalability options to handle increased load or
usage across different environments.
6. Allocate Additional Resources: Plan for the additional computational or data resources
that may be required for expanded applications.
Questions to Ask:
● Are there new environments or domains that could benefit from the model?
● Does the model need modifications for new data or requirements?
● What adjustments are necessary for adapting to market or business changes?
● How will the model be tested during expanded application?
● What additional resources may the model need during expansion?
Tools:
● Model Adaptation Platforms: Use TensorFlow Hub and Hugging Face Transformers
to adapt models to new datasets or environments.
● Testing Tools: PyTest and Selenium can validate model performance in new
environments.
● Data Storage: Amazon S3 and Google Cloud Storage to handle additional data
storage needs for expansion.
Stage Description:
The storytelling phase is where the model’s results and analyses are presented in an engaging
way to help stakeholders understand the data and insights. Effective storytelling ensures that
scientific analysis reaches the target audience clearly and makes the desired impact.
1. Define the Key Message: Identify the central message or insight you want to
communicate, ensuring it aligns with business goals.
2. Contextualize Insights: Relate the model’s findings to real-world scenarios or business
objectives to make the story relevant to the audience.
3. Use Data Visualizations: Employ charts, graphs, and interactive visuals to illustrate
complex data insights in an easily understandable way.
4. Highlight Challenges and Solutions: Showcase any challenges faced during model
development and how they were overcome, adding depth to the narrative.
5. Encourage Audience Action: End with a clear call-to-action, guiding the audience on
what steps they can take based on the model’s insights.
6. Maintain Simplicity and Clarity: Ensure the story is easy to follow, avoiding overly
technical language unless it’s appropriate for the audience.
Questions to Ask:
Tools:
● Data Visualization Tools: Tableau and Power BI for crafting interactive data stories.
● Presentation Tools: Google Slides or Prezi to build visually engaging presentations.
● Infographics & Visuals: Use tools like Canva and Adobe Illustrator to design
impactful visuals.
● Collaboration and Feedback Tools: Slack or Microsoft Teams: Facilitate sharing
stories and gathering feedback from team members and stakeholders.
●
The project is a smart educational system that customizes learning content and methods for
each student based on their individual needs, preferences, and progress. The system leverages
artificial intelligence algorithms to analyze student data, providing lessons and activities that
match each student’s learning pace and style.
1) Problem Characterisation
Stage Description:
The project aims to design a personalized educational system that helps students achieve
optimal results based on their unique needs and progress. This requires understanding the
educational requirements and whether there is a need to customize content and assessments.
Key Questions:
● What is the main educational goal? Improving individual performance and increasing
student engagement in learning.
● What are the success criteria? Improving academic performance and achieving 90%
satisfaction from students and parents.
● Are there requirements to reduce bias in content? Yes, it’s important to avoid biases in
presenting educational materials.
● What are the expected inputs? Test results, learning preferences, and students' previous
activities.
● Is this a classification, recommendation, or adaptation problem? It’s primarily a
recommendation and adaptation problem with behavioral progress analysis.
2) Data Understanding
Stage Description:
Focus on collecting data about students, such as previous grades, preferred learning styles,
optimal study times, and content preferences. Understanding this data allows the system to
provide a personalized learning experience for each student.
Key Questions:
● What are the data sources? Learning Management System (LMS), assessment data,
and surveys on learning preferences.
● What is the required data size? Data from over 1,000 students to ensure adequate
training of the model.
● How will data be split? 70% for training, 15% for testing, and 15% for validation.
Stage Description:
Collect data from various sources and organize it into a unified format. This stage includes data
cleaning, handling missing data, and identifying key features such as learning style, difficulty
level, and progress rate.
Key Questions:
● What steps are needed for data cleaning? Removing duplicate data, correcting errors,
and standardizing formats.
● Is data augmentation necessary? Yes, adding features like preferred content type (video,
text, interactive) could enhance personalization.
● How will data be split? Dividing it into training, test, and validation sets to improve model
performance.
● Do we need to duplicate data? Duplicate records for advanced students may help
balance the dataset.
Stage Description:
Selecting suitable algorithms to build a personalized recommendation model. Machine learning
algorithms are applied to optimize the model’s parameters, achieving high accuracy in delivering
customized learning recommendations.
Key Questions:
● What are the appropriate algorithms? Collaborative Filtering is suitable for providing
tailored educational recommendations.
● How will model parameters (hyperparameters) be tuned? Techniques like Grid Search
can be used to optimize the parameters.
● What are the essential features? Student level, content preferences, previous
performance.
● Is there a need for ensemble models? Combining a recommendation model with a
classification model may help assess student progress more accurately.
Stage Description:
Evaluate the model using various criteria such as recommendation accuracy and alignment with
the student's level and progress.
Key Questions:
6) Model Deployment
Stage Description:
Deploy the model in an educational platform and assess its performance with students, allowing
them to receive personalized learning recommendations in real-time.
Key Questions:
Stage Description:
Continuously monitor the model's performance and adjust it based on changes in students'
needs and learning levels.
Key Questions:
● Is there a need to update features? Yes, new features like monthly progress rate can be
added.
● How will data drift be managed? By retraining the model with recent data.
● Have student learning needs changed? If there are changes, the recommendations are
adjusted accordingly.
8) Feedback
Stage Description:
Collect feedback from students and teachers to evaluate the system and refine
recommendations, ensuring continuous improvement.
Key Questions:
Stage Description:
Consider expanding the system to include more schools or add new subjects based on
teachers' and students' needs.
Key Questions:
● Can the system be extended to other schools? It can be scaled to include additional
institutions.
● What adjustments are needed for new subjects? Recommendations can be tailored to fit
different grade levels.
الهدف من المهمة:
تهدف هذه المهمة إلى تعريف الطالب بمنهجية Data Scienceمن خالل خطوات نظرية تساعدهم على فهم كيفية تحليل
البيانات التجارية ،وإعداد تقارير تحليلية دون استخدام البرمجة .سيقوم الطالب بتطبيق الخطوات األساسية لفهم Customer
Dataبهدف تحسين الخدمات المقدمة.
وصف المهمة:
تخيل أنكم فريق Data Teamيعمل في شركة خيالية تقدم خدمات متنوعة للعمالء ،وتهدف الشركة إلى تحسين خدماتها
عبر فهم Customer Behaviorواحتياجاتهم .ستقومون بتحليل بيانات العمالء ،مثل Age، Geographic
،Locationوتفضيالت الخدمة ،لتحديد األنماط التي قد تساعد الشركة في تحسين .Customer Experienceسيتم
تقسيم المهمة إلى عدة مراحل ،تشمل ،Problem Characterizationو ،Data UnderstandingوData
،Preparationو ،Hypothesis Evaluationوإعداد خطة .Deployment and Monitoring