0% found this document useful (0 votes)
86 views26 pages

Piyush Data Science 3

Uploaded by

yadaditya5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views26 pages

Piyush Data Science 3

Uploaded by

yadaditya5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Roadmap for Data Science & Roles

Piyush Shankar Garg


Professor of Practice
Computer & Management Sciences
LIET, Greater Noida
Contribution and Valuable Inputs by:
Rahul Sharma, Senior Director , SD Global Services
Samyak Jain, AI Researcher @ RocketFrog.ai | MSc DS @ ETHZ | BTech CS @ NITK
Surathkal (Gold Medallist)
Aditya Shankar Garg (IIT Delhi) , PHD Student, Columbia University, New York, USA
Mythology and AI

• Vishwkarma

• Pushpak

• Mantra-Tantra-Yantra
Data Analyst vs Data Engineer vs Data Scientist
Transaction Data OLTP
Data Analyst
(Structured & Unstructured} OLAP

Application Development SQL , Enterprise


Business Python, Excel
reporting tool
POS

ETL –
Data Warehouse
Extract/Transform/Load

Predictive Modelling, ML &


Maths
ERP

Data Engineer

Data Scientists
Role Definition
Parameter Data Scientist Data Analyst Data Engineers

Who? A data scientist develops and A data analyst collects, cleans, Data engineers are responsible
implements data-driven solutions to stores and organises data. for building and maintaining
overcome business challenges. the infrastructure and tools
needed to collect and store
large amounts of data
Focus Data Scientist focuses on a futuristic Data Analyst focuses on the Data Engineer focuses on
display of data. present technical analysis of improving data consumption
data. techniques continuously.
Role Details Data Scientist roles are to provide Data Analyst performs data Data Engineer roles are to
supervised / unsupervised / deep cleaning, organizes raw data, build data in an appropriate
learning of data, classify and regress analyse and visualize data to format. A data engineer works
data. Data Scientists heavily used interpret the analysis. at the back end. A data
neural networks, machine learning engineer uses optimized
for continuous regression analysis. machine learning algorithms
to maintain data and make
data available in the most
appropriate manner.
Role Definition
Parameter Data Scientist Data Analyst Data Engineers

Skills • Programming Proficient Knowledge • Expert in Excel • Deep knowledge of


(Python, R) • Programming Basic Programming (Python, R,
• Statistics and Calculus Knowledge (Python, SQL) Java),

• Machine Learning, Deep Learning • Data Manipulation (Pandas), • ETL & Data Modelling Tools,
(Scikit-learn, TensorFlow) Data Visualization & Big Data Technologies (Spark,
Enterprise BI tools (Tableau, Hadoop)
• Data Visualization (Matplotlib,
Seaborn) Power BI) • SQL/NoSQL, Data Storage
(Redshift, Big Query)
• Big Data Working knowledge • Statistical Analysis, Reporting
(Spark, Hadoop) Tools (Excel, Google Sheets), • Cloud Services (AWS, Azure)
SAS, SPSS, Business Acumen • Data Pipeline Tools (Airflow),
• SQL/NoSQL, Cloud Platforms (AWS,
Google Cloud) • Strong communication, Hadoop, Pig, Hive
presentation and domain
• Strong Communication and Domain
knowledge
Skills
Role Definition
Parameter Data Scientist Data Analyst Data Engineers

Responsibilities Data Scientist take any data science Data Analysts are good Data engineers are data
project from inception to end. Consider statistician , visualizing the data, architects, they bring the data
data scientist to be solutions architect create charts, reports, from various sources or formats
in software world. They generate dashboards and expert in data in the required format and data
models, use existing models, fine tune visualization tools such Tableau, source which can be consumed
them, provide hyper parameter tuning, PowerBI, Excel etc., and for storage, analysis, reporting
etc. They are very well aware of implement requests coming and archiving.
domain knowledge, customer from Data Scientist or business.
requirements and technical skills to
achieve the goal.
Example Building a predictive model to Analysing sales data to Designing a data warehouse
forecast customer attrition / identify trends and customer to store customer data from
retention rate , developing a segments, creating various sources, ETL (Extract,
recommendation system for dashboards to track key Transform, Load) processes
products metrics for data cleansing and
integration
Indicative Time Scale

Area Learning Path Indicative Time in Weeks with


15-25 hours per week effort

Common Skills Excel (Basic and Advanced) 3-5 weeks


Python 12-15 weeks
SQL 3-5 weeks
Statistics 8-10 weeks

Data Engineer Cloud Pipeline , Cloud Platform 20-22 weeks


and Engineering
ETL and Datawarehouse 10-14 weeks
Indicative Time Scale
Area Learning Path Indicative Time in Weeks with 15-25
hours per week effort

Data Analyst Data Cleaning and Understanding of business 8-10 weeks


domain
Data Analysis and Visualizations Tools 8-10 weeks

Business Intelligence 6-8 weeks

Data Scientist Machine Learning & AI concepts 5-7 weeks

Supervised Learning 6-8 weeks

Unsupervised Learning 6-8 weeks

Deep learning and Generative AI 7-9 weeks

Model Optimization and Evaluation 5-7 weeks


Role- Skill Level Mapping
Skill Data Engineer Data Analyst Data Scientist
Excel Course Good Excellent Very Good
Python Excellent Very Good Excellent
SQL Excellent Very Good Very Good
Statistics & Linear Good Very Good Excellent
Algebra
Cloud Platforms & Excellent Good Good
Data Pipeline
ETL & Excellent Good Good
Some of the
Datawarehouse
Data scientist
Data Cleaning Good Very Good Good
topics may
Data Analysis & Good Excellent Good
help for AI
Visualization
also
Business Intelligence Good Excellent Good
Machine Learning Good Good Excellent
Unsupervised Good Good Excellent
Learning
Supervised Learning Good Good Excellent
Advanced Machine Good Good Excellent
Learning
Model Optimization Good Good Excellent
& Evaluation
Phase 1: Foundations (For everyone)
Practical Exercises and Projects
Excel Course • Real-World Datasets: Practice Excel skills with real-world
Essential Excel Skills datasets from various sources (e.g., Kaggle
▪ Interface and Navigation: Familiarize yourself with Excel's layout, (https://www.kaggle.com/datasets ), UCI Machine Learning
ribbons, and functions Repository (https://archive.ics.uci.edu/ )).
▪ Data Entry and Formatting: Learn to input data, apply formatting (fonts,
colors, alignment), and create basic formulas • Data Cleaning and Preparation: Clean and prepare data for
▪ Cell References and Ranges: Understand how to reference cells, create analysis, handling missing values, outliers, and inconsistencies.
ranges, and use absolute and relative references
▪ Basic Formulas: Master essential formulas like SUM, AVERAGE, COUNT, • Data Analysis and Visualization: Create insightful visualizations
IF, and VLOOKUP
(charts, graphs) to communicate findings effectively.
Intermediate Excel Skills
▪ Advanced Formulas: Explore more complex formulas like nested IFs, Learning Resource Details
SUMIFS, COUNTIFS, and INDEX-MATCH
▪ Data Validation: Implement data validation rules to ensure data integrity Excel Data Analytics https://www.mygreatlearning.com/aca
▪ Pivot Tables: Create pivot tables to summarize and analyze large demy/learn-for-free/courses/data-
datasets efficiently analytics-using-excel
▪ Conditional Formatting: Apply conditional formatting to highlight
https://www.codecademy.com/learn/a
specific data values or trends
nalyze-data-with-microsoft-excel
Advanced Excel Skills
▪ Macros: Learn to automate repetitive tasks using VBA (Visual Basic for Business Analytics https://www.coursera.org/learn/busine
Applications) ss-analytics-excel
▪ Power Query: Explore Power Query (Get & Transform) for data cleaning, Full Project on Excel https://www.youtube.com/watch?v=op
transformation, and integration JgMj1IUrc
▪ Power Pivot: Create data models and perform advanced data analysis
using Power Pivot Excel Dashboards https://www.youtube.com/watch?v=m
▪ Data Analysis Tools: Utilize tools like Data Analysis ToolPak for statistical 13o5aqeCbM
analysis (t-tests, ANOVA, regression)
Phase 1: Foundations (For everyone) Machine Learning (3-5 weeks)
Learn Python • Scikit-learn: Understand machine learning concepts like supervised
learning (regression, classification), unsupervised learning
Fundamentals (1 week) (clustering), and model evaluation
▪ Introduction: Syntax and Control flow • TensorFlow or PyTorch: Choose a deep learning framework and learn
• Functions: Understand how to define and call functions, passing about neural networks, backpropagation, and building models
arguments and returning values • Natural Language Processing (NLP): Explore techniques for working
• Modules and Packages: Explore how to import and use modules and with text data, including tokenization, stemming, and sentiment
packages from the Python Standard Library analysis
Data Structures and Algorithms (part of your syllabus) Other Topics…..(2 weeks)
• Advanced Data Structures: Delve into linked lists, stacks, queues, trees, • Web Scraping: Learn to extract data from websites using libraries like
and graphs Beautiful Soup or Scrapy
• Algorithms: Study sorting algorithms (bubble, insertion, selection, • Web Development: Explore frameworks like Flask or Django for
merge, quick), searching algorithms (linear, binary), and graph algorithms building web applications. Understand Deployment as APIs or
(BFS, DFS) Microservices or how to analyze and show data on UI using
• Problem-Solving: Practice solving coding problems on platforms like tensorflow JS libraries.
LeetCode or HackerRank • Data Engineering Basics: Understand data pipelines, ETL processes,
Learn Python Libraries (3-5 weeks) and cloud platforms like AWS, GCP, or Azure.
• NumPy: Learn to perform numerical operations, create arrays, and
manipulate matrices efficiently. Learning Resource Details
• Pandas: Master data manipulation, cleaning, and analysis using Data Python for Data Science https://www.codecademy.com/learn/getting-
Frames and Series started-with-python-for-data-science
• Matplotlib: Explore data visualization techniques, creating various types Python Basics https://www.codecademy.com/learn/python-for-
of plots and charts programmers
• Seaborn: Enhance data visualizations with statistical plots and themes Data Analysis with Python https://www.freecodecamp.org/learn/data-
Practical Projects and Case Studies analysis-with-python/
• Kaggle Competitions: Participate in Kaggle competitions to apply your Introduction to Algorithms by Cormen, Leiserson, Rivest, Stein
skills and learn from others.
• Personal Projects: Work on personal projects that interest you to Algorithms illuminated by Tim Roughharden (www.algorithmsilluminated.org)
reinforce your learning.
Phase 1: Foundations (For everyone)
Learn SQL
Fundamentals (1 week) Practical Exercises
• Database Basics: Understand database concepts like tables, columns, rows, and
primary/foreign keys. • Database Design: Practice designing database schemas based on real-
• Data Types: Learn about common data types (numeric, character, date/time, world scenarios.
logical). • Query Writing: Write complex SQL queries to extract specific data
• DDL (Data Definition Language): Create, modify, and delete tables and
from databases.
databases.
• Data Analysis: Use SQL to analyze datasets and answer data-driven
• DML (Data Manipulation Language): Insert, update, and delete data from
tables. ACID properties.
questions.

Querying Data (1-2 weeks) Learning Resource Details


• SELECT Statement: Master the SELECT statement to retrieve data from tables.
• Filtering Data: Use WHERE and HAVING clauses to filter data based on SQL for Data Science https://www.edx.org/learn/data-
conditions. science/ibm-sql-for-data-science
• Grouping Data: Apply GROUP BY to aggregate data and use functions like https://www.edx.org/learn/data-
COUNT, SUM, AVG, MIN, and MAX.
SQL with Google Big
Query science/ibm-sql-for-data-science
• Joining Tables: Learn to combine data from multiple tables using INNER JOIN,
LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.
Intro to SQL https://www.codecademy.com/learn/intro-
to-sql
Advanced SQL (2-3 weeks)
• Subqueries: Understand how to use subqueries within SELECT, WHERE, and
HAVING clauses.
• Common Table Expressions (CTEs): Explore CTEs for temporary result sets
within a query
• Window Functions (Very Important): Learn to calculate values across rows
within a result set using RANK, DENSE_RANK, ROW_NUMBER, LEAD, and LAG
• Stored Procedures: Create stored procedures for reusable code blocks
• Performance Tuning and Query Optimization
Phase 1: Foundations (For everyone)
Learn Statistics
Descriptive Statistics (2 weeks) Bayesian Statistics (2 weeks)
• Bayesian Inference: Learn Bayesian inference concepts (prior, likelihood,
• Central Tendency: Understand measures like mean, median, and mode to summarize
posterior).
data.
• Bayesian Models: Explore Bayesian models for parameter estimation and
• Variability: Learn about variance, standard deviation, range, and interquartile range to
inference.
measure data dispersion.
• Distribution: Explore common probability distributions (normal, binomial, Poisson).
• Data Visualization: Create visualizations (histograms, box plots) to understand data Practical Exercises and Projects (Kaggle)
distribution. • Data Analysis: Apply statistical techniques to analyze real-world datasets.
Inferential Statistics (3 weeks) • Hypothesis Testing: Conduct hypothesis tests to answer research
questions.
• Sampling: Learn about sampling methods (simple random, stratified, cluster) and
• Regression Modeling: Build regression models to predict outcomes.
their implications.
• Case Studies: Work on case studies to apply statistical concepts to solve
• Cramer Rao bound (CRB)
problems from Kaggle or UCI Machine Learning Repository.
• Hypothesis Testing: Understand hypothesis testing concepts (null hypothesis,
• Statistical Software: Learn to use statistical software like R or Python with
alternative hypothesis, p-value, confidence intervals). T-tests and Z-tests: Conduct t
libraries like NumPy, Pandas, SciPy, and Statsmodels.
• T-tests and z-tests for comparing means

Regression Analysis (2 weeks) Learning Resource Details


• Simple Linear Regression: Model relationships between two variables. https://www.mygreatlearning.com/academ
• Multiple Linear Regression: Model relationships between a dependent variable and
Statistics for Data
multiple independent variables. Science y/learn-for-free/courses/statistics-for-data-
• Model Evaluation: Assess model performance using metrics like R-squared, adjusted R- science
squared, and RMSE.
Introduction to https://www.coursera.org/learn/stanford-
Statistics statistics
Probability Theory (2 weeks)
• Probability Concepts: Understand probability axioms, conditional probability, Bayes' https://intellipaat.com/academy/course/st
theorem.
Statistics for Data
atistics-for-data-science-free-course/
• Discrete and Continuous Distributions: Explore common probability distributions Science
(Bernoulli, binomial, Poisson, normal).
• Joint Probability Distributions: Understand joint probability distributions and Mathematical Statistics and Data Analysis (by John A.
independence
• Expectation
Rice) (Korivernon)
Phase 1: Foundations (For everyone)

Understand Linear Algebra

Fundamentals (3 weeks)
• Vector spaces Learning Resource Details
• Rank of matrices Gilbert Strang’s Linear Algebra Course (MIT OCW)
• Eigenvalues/eigenvectors Linear Algebra Done Right - Sheldon Axler
• Singular value decomposition (SVD)
• Matrix factorization
• Projection
• Inner Product Spaces
• Application of Linear Algebra concepts for data
scientist role (high level understanding only)
• Loss function and recommender system for ML
• Word embedding for NLP
• Image convolution for computer vision
Phase 2: Data Engineer Roadmap
Learn Cloud Platforms and Data Pipeline Data Analysis and ML part-2 (3-5 weeks)
(20-24 week) - Choose a cloud provider (AWS, GCP, or Azure) • Learn Data Pipelines: Explore tools like Apache Airflow or AWS Glue to build
and manage data pipelines.
and gain hands-on experience with its services • Learn ETL Processes: Understand the process of extracting, transforming, and
Fundamentals (2 weeks) loading data into data warehouses or data lakes.
• Build Data Warehouse: Design and implement a data warehouse using
• Cloud Computing Concepts: Understand fundamental concepts like IaaS (Infrastructure
cloud-based services.
as a Service), PaaS (Platform as a Service), and SaaS (Software as a Service).
• HDFS, MapReduce, Apachespark
• Cloud Providers: Choose a cloud provider (AWS, GCP, or Azure) and familiarize yourself
with its services and pricing models. Practical Exercises and Projects (Kaggle)
• Cloud Console: Learn to navigate the cloud provider's console and manage resources. • Cloud Migration: Migrate existing applications or data to the cloud.
Data Storage Concepts (5-6 weeks) • Data Pipeline Implementation: Build data pipelines to extract, transform,
and load data into cloud-based storage.
• Object Storage: Explore services like S3 (AWS), Blob Storage (Azure), and Cloud Storage
• Machine Learning Deployment: Deploy machine learning models to the
(GCP) for storing large amounts of unstructured data.
cloud for real-time predictions.
• Data Lakes: Understand the concept of data lakes and how to implement them using
• Cloud Architecture Design: Design cloud architectures for various use cases
cloud-based services.
• Data Warehouses: Learn about cloud-based data warehouses like Redshift (AWS), Big Learning Details
Query (GCP), and Synapse Analytics (Azure).
Resource
Data Processing (4-5 weeks)
• Serverless Computing: Explore services like Lambda (AWS), Cloud Functions (GCP), and Data https://www.edx.org/learn/data-engineering/ibm-data-
Azure Functions for running code without managing servers. Engineering engineering-basics-for-
• ETL Tools: Learn to use cloud-based ETL tools like AWS Glue, Dataflow (GCP), and Azure Basics everyone?index=product&queryID=764a801e40ccfd42bf011a
Data Factory. 379c137d3d&position=1&results_level=second-level-
• Data Pipelines: Design and implement data pipelines using orchestration tools like
Airflow.
results&term=Data+Engineering&objectID=course-f33be2a5-
322f-4b9c-9ac5-
Data Analysis and ML part-1 (3-4 weeks) a89b43080427&campaign=Data+Engineering+Basics+for+Ever
• Managed Services: Utilize managed services like EMR (AWS), Dataproc (GCP), and
HDInsight (Azure) for running big data analytics frameworks like Hadoop and Spark.
yone&source=edX&product_category=course&placement_url
• Machine Learning Platforms: Explore platforms like SageMaker (AWS), AI Platform =https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.edx.org%2Fsearch
(GCP), and Azure Machine Learning for building and deploying machine learning models. https://www.striim.com/blog/guide-to-data-pipelines/
• Data Visualization: Use cloud-based visualization tools like QuickSight (AWS), Looker
Studio (GCP), and Power BI (Azure).
Ghislain Fourny’s YouTube Lectures (Big Data Systems)
Phase 2: Data Engineer Roadmap
ETL and Data Warehouse (14 weeks)
Fundamentals (2 weeks) Learning Resource Details
• What is ETL? (Overview of Extract, Transform, Load)
Migrate SQL to Azure SQL https://learn.microsoft.com/en-
• Difference between ETL and ELT
• Introduction to Data Warehousing (OLTP vs. OLAP) us/credentials/applied-skills/migrate-sql-
• Star and Snowflake schema workloads-azure-sql-database/
Azure Data Engineering https://learn.microsoft.com/en-
ETL Tools and Platforms (4 weeks) us/credentials/certifications/azure-data-
• Overview of ETL tools: Apache Nifi, Talend, Informatica, Alteryx
engineer/?practice-assessment-
• Hands-on practice with Airflow or Prefect (scheduling and orchestrating ETL jobs)
• Data pipeline creation, error handling, and logging type=certification#certification-prepare-for-
the-exam
Data Transformation Technique (1 week) ETL, Dataflows https://www.edx.org/learn/data-
• Data cleaning and normalization engineering/ibm-building-etl-and-data-
• Deduplication and data validation pipelines-with-bash-airflow-and-
• Handling missing values and outliers
kafka?irclickid=1K4zsKUW2xyKRMXWqM12MzF
OUkCUBy1KCXt3Uc0&irgwc=1
Data Loading into Data Warehouse(2 week)
• Loading data into relational databases (PostgreSQL, MySQL)
• Loading data into cloud warehouses (Snowflake, Redshift, BigQuery) Big Data Computing by Dr https://onlinecourses.nptel.ac.in/noc24_cs130/
• Batch vs. streaming data loading Rajiv Mishra (IIT Patna) preview

Introduction to Big Data and Distributed Processing (2 week)


• Introduction to Hadoop and Spark
• Distributed computing concepts
• Working with PySpark for ETL Big Data Lectures You Tube Ghislain Fourny’s YouTube Lectures (Big Data
Systems)
Real-Time Data Processing
• Introduction to Kafka and event-driven architectures
• ETL in real-time using Kafka Streams or Apache Flink
• Data consistency in real-time pipelines
Learn Data Cleaning (12-14 weeks) Phase 2: Data Analyst Roadmap
Understanding Data Quality Issues Data Validation and Quality Assurance
• Common Data Problems: Identify common data issues like missing values,
outliers, inconsistencies, duplicates, and incorrect data types. Though this is • Data Validation Rules: Create data validation rules to ensure data integrity and
already covered in Python course but refresh your memory consistency
• Data Quality Metrics: Learn about metrics to assess data quality (accuracy, • Data Quality Checks: Implement regular data quality checks to monitor data
completeness, consistency, timeliness) quality over time
• Practice - Assess data quality using various metrics and techniques
• Practice- Build data cleaning pipelines to automate common data
Data Exploration and Visualization
• Exploratory Data Analysis (EDA): Use EDA techniques to explore data, identify
cleaning tasks
patterns, and uncover anomalies
• Visualization Tools: Learn to use visualization tools like Tableau, Power BI,Python
libraries (Matplotlib, Seaborn) to visualize data and identify anomalies

Handling Missing Data Learning Resource Details


• Missing Data Patterns: Understand different patterns of missing data (missing Data Cleaning Basics
completely at random, missing at random, missing not at random) https://www.kaggle.com/learn/data-
• Imputation Techniques: Explore techniques like mean/median imputation, mode cleaning
imputation, hot-deck imputation, and regression imputation

Dealing with Outliers Data Cleaning https://www.youtube.com/watch?v=ITy8R4278


• Outlier Detection: Learn to identify outliers using statistical methods (z-scores, IQR) sk
and visualization
• Outlier Handling: Decide whether to remove or correct outliers based on their
impact on analysis.

Data Standardization, Normalization and Scaling Techniques

Dealing with Inconsistent Data , Duplicate Data


• Data consistency checks such as matching value with related fields
• Data Cleaning methods such Fuzzy Matching Technique, Standardization
• Duplicate Identification and Resolution based upon business rules
Phase 2: Data Analyst Roadmap
Learn Data Analysis, Visualization (12weeks)
Understanding Tableau and PowerBI
• Basics of Tableau (connecting to data, creating charts) Connecting to data
sources (Excel, SQL, cloud services)
• Creating charts, tables, and visualizations (bar charts, line charts, maps)
Learning Resource Details
• Building interactive dashboards Power BI https://learn.microsoft.com/en-
• Using calculated fields, parameters, and filters us/credentials/certifications/data-analyst-
associate/?practice-assessment-
Building dashboards - Calculated fields, and Parameters type=certification#certification-prepare-for-the-
exam
Power BI: Data models, DAX, and interactive visualizations
• Importing and Cleaning data with Power Query Power BI Certification https://learn.microsoft.com/en-
• Building data models and relationships us/credentials/certifications/data-analyst-
• Creating reports and interactive visualizations associate/?practice-assessment-
• Writing DAX (Data Analysis Expressions) for calculated measures and type=certification
columns
Data Analysis with Python https://www.freecodecamp.org/learn/data-
analysis-with-python/#data-analysis-with-
Cloud Based Data Platforms python-course
• Google BigQuery: Learn to use BigQuery for large-scale data analysis in the
cloud.
• Amazon Redshift: Explore Amazon Redshift for cloud-based data Tableau Free Videos https://www.tableau.com/learn/training
warehousing.
• Azure Synapse Analytics: Understand Azure Synapse Analytics for unified
data analytics and machine learning
Advanced Tools
• NoSQL Databases: Explore NoSQL databases like MongoDB or Cassandra for
unstructured data.
• Data Mining Tools: Explore data mining tools like RapidMiner or KNIME for
advanced analytics.
Phase 2: Data Analyst Roadmap
Learn Business Intelligence (6-8 weeks)
Data Modelling for BI
• Data warehousing concepts (OLTP vs. OLAP)
• Star and snowflake schema design Learning Resource Details
• Data modeling in Power BI, Tableau, and Qlik
• Fact and dimension tables, data normalization Business Intelligence https://www.simplilearn.com/free-business-
intelligence-course-online-skillup
Automation with BI Tools
• Scheduling data refreshes and auto-updates in Tableau and Power BI
• Automation with Power Automate and Tableau Prep Business Intelligence https://www.youtube.com/watch?v=Hg8zBJ1Dh
• Setting alerts and triggers for data updates LQ

Advanced DAX for Power BI Advanced DAX https://www.udemy.com/course/advanced-dax-


• Row context vs. filter context for-power-bi/?
• Time intelligence functions (YTD, MTD, moving averages)
• Nested functions and advanced calculations
DAX https://www.datacamp.com/courses/introductio
n-to-dax-in-power-bi?
Predictive Analysis with BI Tools
• Building forecasts in Power BI and Tableau
• Integrating R or Python for advanced analytics in BI tools Predictive Analysis Tutorial https://www.datacamp.com/tutorial/predictive-
• Predictive modeling techniques (regression, time series forecasting) analytics-with-power-bi
Phase 2: Data Scientist Roadmap
Learn Machine Learning (8-12 weeks)

Introduction to ML Learning Resource Details


• Real World examples – Personalized recommendations, virtual
assistants, Smart Home devices etc. Introduction to ML Andrew Ng’s Stanford Course (CS229)
• Supervised learning (Linear Regression, Logistic Regression,
Decision Trees),
• Unsupervised learning (K-means clustering, PCA)
• Reinforcement :Learning A Basic Course in Machine https://onlinecourses.swayam2.ac.in/imb24_mg
• Introduction to Neural Networks Learning for All by S. 126/preview
• Basic terms: feature, target, training, testing, overfitting, Padmanabhan
underfitting
• ML pipeline: data preprocessing, model training, evaluation
Phase 2: Data Scientist Roadmap
Supervised Learning Algorithms(8-9 weeks) Un-Supervised Learning Algorithms(8-9 weeks)

Linear Regression K-Means Clustering


• Linear regression with one variable, multiple variables • Centroids, distance metrics, and clustering
• Gradient descent and cost function • Elbow method and silhouette score to determine optimal clusters
• Performance metrics (RMSE, MAE) • Applications of clustering (e.g., customer segmentation)

Logistics Regression Hierarchical Clustering


• Sigmoid function and probability prediction • Agglomerative vs divisive clustering
• Cost function and optimization in logistic regression • Dendrograms and linkage methods
• Binary classification metrics: accuracy, precision, recall, F1-score • Distance metrics for clustering (Euclidean, Manhattan)

Decision Trees and Random Forests


• Decision tree structure (splits, nodes, leaf nodes)
• Gini index, entropy, information gain
Principal Component Analysis
• Agglomerative vs divisive clustering
• Random forest algorithm, bagging, and feature importance
• Dendrograms and linkage methods
• Distance metrics for clustering (Euclidean, Manhattan)

Support Vector Machines (SVM)

• Hyperplanes and support vectors


• SVM kernels (linear, RBF)
• Regularization (C parameter) and handling non-linear data
Phase 2: Data Scientist Roadmap
Advanced ML Topics
Probabilistic AI
• Bayesian linear regression, Gaussian processes
Advanced Machine Learning • Bayesian networks, Bayesian neural networks, Bayesian Optimization
• Boosting vs. bagging
• Non Parametric methods
• Support vector machines, boosting algorithms (XGBoost, LightGBM) Reinforcement Learning
• Gradient boosting algorithm and loss function minimization • Markov Decision Processes (MDPs)
• Implementing GBM, XGBoost, and LightGBM for classification and • Value iteration
regression • Policy gradient methods
• Ensemble methods (Random Forest, Gradient Boosting) • Q-learning, Deep reinforcement learning (DQN, PPO)

Neural Networks and Deep Learning Understanding Generative AI applications


• Introduction to neural networks (perceptron, activation functions)
• Forward and backward propagation, loss functions
• Deep learning concepts: convolutional neural networks (CNNs), recurrent Model Optimization
neural networks (RNNs) • Hyperparameter Tuning and Model Evaluation
• Feature Engineering
Natural Language Processing
• Text pre-processing (tokenization, stemming, lemmatization)
• Bag-of-words, TF-IDF, and word embeddings (Word2VeC, GloVe)
• Language Models (n-grams, RNN, Transformers
• NLP applications: sentiment analysis, named entity recognition (NER)

Computer Vision
• Image processing, image classification (CNNs),
• Object detection (YOLO, SSD)
• Object segmentation, transfer learning
• CNN architectures (ResNet, Inception)
Phase 2: Data Scientist Roadmap
Learning Resource Details
Machine Learning with Python https://www.freecodecamp.org/learn/machine-learning-with-python/

ML for Beginners https://microsoft.github.io/ML-For-Beginners/#/


Machine Learning https://learn.microsoft.com/en-
us/collections/qrqzamz1nn2wx3?WT.mc_id=academic-77952-bethanycheum

Probabilistic Machine Learning by Kevin P Murphy (Book)


Introduction to Statistical Learning by Tibshriani (Book)
Reinforcement Learning by Sutton Barto (Book)
Data Science for Beginners https://microsoft.github.io/Data-Science-For-Beginners/#/
Data Science https://developers.google.com/machine-learning/crash-course
Bishop Textbook on Deep Learning (https://www.bishopbook.com/)
Pattern Recognition and Machine Learning by (https://www.amazon.in/PATTERN-RECOGNITION-MACHINE-LEARNING-
Christopher M. Bishop Christopher/dp/1493938436)

Gradient Boost Video https://www.youtube.com/watch?v=3CC4N4z3GJc

XGBoost Videos https://xgboost.readthedocs.io/en/stable/


Go to Kaggle for Code Contest

You might also like