0% found this document useful (0 votes)

86 views26 pages

Piyush Data Science 3

Uploaded by

yadaditya5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views26 pages

Piyush Data Science 3

Uploaded by

yadaditya5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Roadmap for Data Science & Roles

Piyush Shankar Garg

Professor of Practice
Computer & Management Sciences
LIET, Greater Noida
Contribution and Valuable Inputs by:
Rahul Sharma, Senior Director , SD Global Services
Samyak Jain, AI Researcher @ RocketFrog.ai | MSc DS @ ETHZ | BTech CS @ NITK
Surathkal (Gold Medallist)
Aditya Shankar Garg (IIT Delhi) , PHD Student, Columbia University, New York, USA
Mythology and AI

• Vishwkarma

• Pushpak

• Mantra-Tantra-Yantra
Data Analyst vs Data Engineer vs Data Scientist
Transaction Data OLTP
Data Analyst
(Structured & Unstructured} OLAP

Application Development SQL , Enterprise

Business Python, Excel
reporting tool
POS

ETL –
Data Warehouse
Extract/Transform/Load

Predictive Modelling, ML &

Maths
ERP

Data Engineer

Data Scientists
Role Definition
Parameter Data Scientist Data Analyst Data Engineers

Who? A data scientist develops and A data analyst collects, cleans, Data engineers are responsible
implements data-driven solutions to stores and organises data. for building and maintaining
overcome business challenges. the infrastructure and tools
needed to collect and store
large amounts of data
Focus Data Scientist focuses on a futuristic Data Analyst focuses on the Data Engineer focuses on
display of data. present technical analysis of improving data consumption
data. techniques continuously.
Role Details Data Scientist roles are to provide Data Analyst performs data Data Engineer roles are to
supervised / unsupervised / deep cleaning, organizes raw data, build data in an appropriate
learning of data, classify and regress analyse and visualize data to format. A data engineer works
data. Data Scientists heavily used interpret the analysis. at the back end. A data
neural networks, machine learning engineer uses optimized
for continuous regression analysis. machine learning algorithms
to maintain data and make
data available in the most
appropriate manner.
Role Definition
Parameter Data Scientist Data Analyst Data Engineers

Skills • Programming Proficient Knowledge • Expert in Excel • Deep knowledge of

(Python, R) • Programming Basic Programming (Python, R,
• Statistics and Calculus Knowledge (Python, SQL) Java),

• Machine Learning, Deep Learning • Data Manipulation (Pandas), • ETL & Data Modelling Tools,
(Scikit-learn, TensorFlow) Data Visualization & Big Data Technologies (Spark,
Enterprise BI tools (Tableau, Hadoop)
• Data Visualization (Matplotlib,
Seaborn) Power BI) • SQL/NoSQL, Data Storage
(Redshift, Big Query)
• Big Data Working knowledge • Statistical Analysis, Reporting
(Spark, Hadoop) Tools (Excel, Google Sheets), • Cloud Services (AWS, Azure)
SAS, SPSS, Business Acumen • Data Pipeline Tools (Airflow),
• SQL/NoSQL, Cloud Platforms (AWS,
Google Cloud) • Strong communication, Hadoop, Pig, Hive
presentation and domain
• Strong Communication and Domain
knowledge
Skills
Role Definition
Parameter Data Scientist Data Analyst Data Engineers

Responsibilities Data Scientist take any data science Data Analysts are good Data engineers are data
project from inception to end. Consider statistician , visualizing the data, architects, they bring the data
data scientist to be solutions architect create charts, reports, from various sources or formats
in software world. They generate dashboards and expert in data in the required format and data
models, use existing models, fine tune visualization tools such Tableau, source which can be consumed
them, provide hyper parameter tuning, PowerBI, Excel etc., and for storage, analysis, reporting
etc. They are very well aware of implement requests coming and archiving.
domain knowledge, customer from Data Scientist or business.
requirements and technical skills to
achieve the goal.
Example Building a predictive model to Analysing sales data to Designing a data warehouse
forecast customer attrition / identify trends and customer to store customer data from
retention rate , developing a segments, creating various sources, ETL (Extract,
recommendation system for dashboards to track key Transform, Load) processes
products metrics for data cleansing and
integration
Indicative Time Scale

Area Learning Path Indicative Time in Weeks with

15-25 hours per week effort

Common Skills Excel (Basic and Advanced) 3-5 weeks

Python 12-15 weeks
SQL 3-5 weeks
Statistics 8-10 weeks

Data Engineer Cloud Pipeline , Cloud Platform 20-22 weeks

and Engineering
ETL and Datawarehouse 10-14 weeks
Indicative Time Scale
Area Learning Path Indicative Time in Weeks with 15-25
hours per week effort

Data Analyst Data Cleaning and Understanding of business 8-10 weeks

domain
Data Analysis and Visualizations Tools 8-10 weeks

Business Intelligence 6-8 weeks

Data Scientist Machine Learning & AI concepts 5-7 weeks

Supervised Learning 6-8 weeks

Unsupervised Learning 6-8 weeks

Deep learning and Generative AI 7-9 weeks

Model Optimization and Evaluation 5-7 weeks

Role- Skill Level Mapping
Skill Data Engineer Data Analyst Data Scientist
Excel Course Good Excellent Very Good
Python Excellent Very Good Excellent
SQL Excellent Very Good Very Good
Statistics & Linear Good Very Good Excellent
Algebra
Cloud Platforms & Excellent Good Good
Data Pipeline
ETL & Excellent Good Good
Some of the
Datawarehouse
Data scientist
Data Cleaning Good Very Good Good
topics may
Data Analysis & Good Excellent Good
help for AI
Visualization
also
Business Intelligence Good Excellent Good
Machine Learning Good Good Excellent
Unsupervised Good Good Excellent
Learning
Supervised Learning Good Good Excellent
Advanced Machine Good Good Excellent
Learning
Model Optimization Good Good Excellent
& Evaluation
Phase 1: Foundations (For everyone)
Practical Exercises and Projects
Excel Course • Real-World Datasets: Practice Excel skills with real-world
Essential Excel Skills datasets from various sources (e.g., Kaggle
▪ Interface and Navigation: Familiarize yourself with Excel's layout, (https://www.kaggle.com/datasets ), UCI Machine Learning
ribbons, and functions Repository (https://archive.ics.uci.edu/ )).
▪ Data Entry and Formatting: Learn to input data, apply formatting (fonts,
colors, alignment), and create basic formulas • Data Cleaning and Preparation: Clean and prepare data for
▪ Cell References and Ranges: Understand how to reference cells, create analysis, handling missing values, outliers, and inconsistencies.
ranges, and use absolute and relative references
▪ Basic Formulas: Master essential formulas like SUM, AVERAGE, COUNT, • Data Analysis and Visualization: Create insightful visualizations
IF, and VLOOKUP
(charts, graphs) to communicate findings effectively.
Intermediate Excel Skills
▪ Advanced Formulas: Explore more complex formulas like nested IFs, Learning Resource Details
SUMIFS, COUNTIFS, and INDEX-MATCH
▪ Data Validation: Implement data validation rules to ensure data integrity Excel Data Analytics https://www.mygreatlearning.com/aca
▪ Pivot Tables: Create pivot tables to summarize and analyze large demy/learn-for-free/courses/data-
datasets efficiently analytics-using-excel
▪ Conditional Formatting: Apply conditional formatting to highlight
https://www.codecademy.com/learn/a
specific data values or trends
nalyze-data-with-microsoft-excel
Advanced Excel Skills
▪ Macros: Learn to automate repetitive tasks using VBA (Visual Basic for Business Analytics https://www.coursera.org/learn/busine
Applications) ss-analytics-excel
▪ Power Query: Explore Power Query (Get & Transform) for data cleaning, Full Project on Excel https://www.youtube.com/watch?v=op
transformation, and integration JgMj1IUrc
▪ Power Pivot: Create data models and perform advanced data analysis
using Power Pivot Excel Dashboards https://www.youtube.com/watch?v=m
▪ Data Analysis Tools: Utilize tools like Data Analysis ToolPak for statistical 13o5aqeCbM
analysis (t-tests, ANOVA, regression)
Phase 1: Foundations (For everyone) Machine Learning (3-5 weeks)
Learn Python • Scikit-learn: Understand machine learning concepts like supervised
learning (regression, classification), unsupervised learning
Fundamentals (1 week) (clustering), and model evaluation
▪ Introduction: Syntax and Control flow • TensorFlow or PyTorch: Choose a deep learning framework and learn
• Functions: Understand how to define and call functions, passing about neural networks, backpropagation, and building models
arguments and returning values • Natural Language Processing (NLP): Explore techniques for working
• Modules and Packages: Explore how to import and use modules and with text data, including tokenization, stemming, and sentiment
packages from the Python Standard Library analysis
Data Structures and Algorithms (part of your syllabus) Other Topics…..(2 weeks)
• Advanced Data Structures: Delve into linked lists, stacks, queues, trees, • Web Scraping: Learn to extract data from websites using libraries like
and graphs Beautiful Soup or Scrapy
• Algorithms: Study sorting algorithms (bubble, insertion, selection, • Web Development: Explore frameworks like Flask or Django for
merge, quick), searching algorithms (linear, binary), and graph algorithms building web applications. Understand Deployment as APIs or
(BFS, DFS) Microservices or how to analyze and show data on UI using
• Problem-Solving: Practice solving coding problems on platforms like tensorflow JS libraries.
LeetCode or HackerRank • Data Engineering Basics: Understand data pipelines, ETL processes,
Learn Python Libraries (3-5 weeks) and cloud platforms like AWS, GCP, or Azure.
• NumPy: Learn to perform numerical operations, create arrays, and
manipulate matrices efficiently. Learning Resource Details
• Pandas: Master data manipulation, cleaning, and analysis using Data Python for Data Science https://www.codecademy.com/learn/getting-
Frames and Series started-with-python-for-data-science
• Matplotlib: Explore data visualization techniques, creating various types Python Basics https://www.codecademy.com/learn/python-for-
of plots and charts programmers
• Seaborn: Enhance data visualizations with statistical plots and themes Data Analysis with Python https://www.freecodecamp.org/learn/data-
Practical Projects and Case Studies analysis-with-python/
• Kaggle Competitions: Participate in Kaggle competitions to apply your Introduction to Algorithms by Cormen, Leiserson, Rivest, Stein
skills and learn from others.
• Personal Projects: Work on personal projects that interest you to Algorithms illuminated by Tim Roughharden (www.algorithmsilluminated.org)
reinforce your learning.
Phase 1: Foundations (For everyone)
Learn SQL
Fundamentals (1 week) Practical Exercises
• Database Basics: Understand database concepts like tables, columns, rows, and
primary/foreign keys. • Database Design: Practice designing database schemas based on real-
• Data Types: Learn about common data types (numeric, character, date/time, world scenarios.
logical). • Query Writing: Write complex SQL queries to extract specific data
• DDL (Data Definition Language): Create, modify, and delete tables and
from databases.
databases.
• Data Analysis: Use SQL to analyze datasets and answer data-driven
• DML (Data Manipulation Language): Insert, update, and delete data from
tables. ACID properties.
questions.

Querying Data (1-2 weeks) Learning Resource Details

• SELECT Statement: Master the SELECT statement to retrieve data from tables.
• Filtering Data: Use WHERE and HAVING clauses to filter data based on SQL for Data Science https://www.edx.org/learn/data-
conditions. science/ibm-sql-for-data-science
• Grouping Data: Apply GROUP BY to aggregate data and use functions like https://www.edx.org/learn/data-
COUNT, SUM, AVG, MIN, and MAX.
SQL with Google Big
Query science/ibm-sql-for-data-science
• Joining Tables: Learn to combine data from multiple tables using INNER JOIN,
LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.
Intro to SQL https://www.codecademy.com/learn/intro-
to-sql
Advanced SQL (2-3 weeks)
• Subqueries: Understand how to use subqueries within SELECT, WHERE, and
HAVING clauses.
• Common Table Expressions (CTEs): Explore CTEs for temporary result sets
within a query
• Window Functions (Very Important): Learn to calculate values across rows
within a result set using RANK, DENSE_RANK, ROW_NUMBER, LEAD, and LAG
• Stored Procedures: Create stored procedures for reusable code blocks
• Performance Tuning and Query Optimization
Phase 1: Foundations (For everyone)
Learn Statistics
Descriptive Statistics (2 weeks) Bayesian Statistics (2 weeks)
• Bayesian Inference: Learn Bayesian inference concepts (prior, likelihood,
• Central Tendency: Understand measures like mean, median, and mode to summarize
posterior).
data.
• Bayesian Models: Explore Bayesian models for parameter estimation and
• Variability: Learn about variance, standard deviation, range, and interquartile range to
inference.
measure data dispersion.
• Distribution: Explore common probability distributions (normal, binomial, Poisson).
• Data Visualization: Create visualizations (histograms, box plots) to understand data Practical Exercises and Projects (Kaggle)
distribution. • Data Analysis: Apply statistical techniques to analyze real-world datasets.
Inferential Statistics (3 weeks) • Hypothesis Testing: Conduct hypothesis tests to answer research
questions.
• Sampling: Learn about sampling methods (simple random, stratified, cluster) and
• Regression Modeling: Build regression models to predict outcomes.
their implications.
• Case Studies: Work on case studies to apply statistical concepts to solve
• Cramer Rao bound (CRB)
problems from Kaggle or UCI Machine Learning Repository.
• Hypothesis Testing: Understand hypothesis testing concepts (null hypothesis,
• Statistical Software: Learn to use statistical software like R or Python with
alternative hypothesis, p-value, confidence intervals). T-tests and Z-tests: Conduct t
libraries like NumPy, Pandas, SciPy, and Statsmodels.
• T-tests and z-tests for comparing means

Regression Analysis (2 weeks) Learning Resource Details

• Simple Linear Regression: Model relationships between two variables. https://www.mygreatlearning.com/academ
• Multiple Linear Regression: Model relationships between a dependent variable and
Statistics for Data
multiple independent variables. Science y/learn-for-free/courses/statistics-for-data-
• Model Evaluation: Assess model performance using metrics like R-squared, adjusted R- science
squared, and RMSE.
Introduction to https://www.coursera.org/learn/stanford-
Statistics statistics
Probability Theory (2 weeks)
• Probability Concepts: Understand probability axioms, conditional probability, Bayes' https://intellipaat.com/academy/course/st
theorem.
Statistics for Data
atistics-for-data-science-free-course/
• Discrete and Continuous Distributions: Explore common probability distributions Science
(Bernoulli, binomial, Poisson, normal).
• Joint Probability Distributions: Understand joint probability distributions and Mathematical Statistics and Data Analysis (by John A.
independence
• Expectation
Rice) (Korivernon)
Phase 1: Foundations (For everyone)

Understand Linear Algebra

Fundamentals (3 weeks)
• Vector spaces Learning Resource Details
• Rank of matrices Gilbert Strang’s Linear Algebra Course (MIT OCW)
• Eigenvalues/eigenvectors Linear Algebra Done Right - Sheldon Axler
• Singular value decomposition (SVD)
• Matrix factorization
• Projection
• Inner Product Spaces
• Application of Linear Algebra concepts for data
scientist role (high level understanding only)
• Loss function and recommender system for ML
• Word embedding for NLP
• Image convolution for computer vision
Phase 2: Data Engineer Roadmap
Learn Cloud Platforms and Data Pipeline Data Analysis and ML part-2 (3-5 weeks)
(20-24 week) - Choose a cloud provider (AWS, GCP, or Azure) • Learn Data Pipelines: Explore tools like Apache Airflow or AWS Glue to build
and manage data pipelines.
and gain hands-on experience with its services • Learn ETL Processes: Understand the process of extracting, transforming, and
Fundamentals (2 weeks) loading data into data warehouses or data lakes.
• Build Data Warehouse: Design and implement a data warehouse using
• Cloud Computing Concepts: Understand fundamental concepts like IaaS (Infrastructure
cloud-based services.
as a Service), PaaS (Platform as a Service), and SaaS (Software as a Service).
• HDFS, MapReduce, Apachespark
• Cloud Providers: Choose a cloud provider (AWS, GCP, or Azure) and familiarize yourself
with its services and pricing models. Practical Exercises and Projects (Kaggle)
• Cloud Console: Learn to navigate the cloud provider's console and manage resources. • Cloud Migration: Migrate existing applications or data to the cloud.
Data Storage Concepts (5-6 weeks) • Data Pipeline Implementation: Build data pipelines to extract, transform,
and load data into cloud-based storage.
• Object Storage: Explore services like S3 (AWS), Blob Storage (Azure), and Cloud Storage
• Machine Learning Deployment: Deploy machine learning models to the
(GCP) for storing large amounts of unstructured data.
cloud for real-time predictions.
• Data Lakes: Understand the concept of data lakes and how to implement them using
• Cloud Architecture Design: Design cloud architectures for various use cases
cloud-based services.
• Data Warehouses: Learn about cloud-based data warehouses like Redshift (AWS), Big Learning Details
Query (GCP), and Synapse Analytics (Azure).
Resource
Data Processing (4-5 weeks)
• Serverless Computing: Explore services like Lambda (AWS), Cloud Functions (GCP), and Data https://www.edx.org/learn/data-engineering/ibm-data-
Azure Functions for running code without managing servers. Engineering engineering-basics-for-
• ETL Tools: Learn to use cloud-based ETL tools like AWS Glue, Dataflow (GCP), and Azure Basics everyone?index=product&queryID=764a801e40ccfd42bf011a
Data Factory. 379c137d3d&position=1&results_level=second-level-
• Data Pipelines: Design and implement data pipelines using orchestration tools like
Airflow.
results&term=Data+Engineering&objectID=course-f33be2a5-
322f-4b9c-9ac5-
Data Analysis and ML part-1 (3-4 weeks) a89b43080427&campaign=Data+Engineering+Basics+for+Ever
• Managed Services: Utilize managed services like EMR (AWS), Dataproc (GCP), and
HDInsight (Azure) for running big data analytics frameworks like Hadoop and Spark.
yone&source=edX&product_category=course&placement_url
• Machine Learning Platforms: Explore platforms like SageMaker (AWS), AI Platform =https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.edx.org%2Fsearch
(GCP), and Azure Machine Learning for building and deploying machine learning models. https://www.striim.com/blog/guide-to-data-pipelines/
• Data Visualization: Use cloud-based visualization tools like QuickSight (AWS), Looker
Studio (GCP), and Power BI (Azure).
Ghislain Fourny’s YouTube Lectures (Big Data Systems)
Phase 2: Data Engineer Roadmap
ETL and Data Warehouse (14 weeks)
Fundamentals (2 weeks) Learning Resource Details
• What is ETL? (Overview of Extract, Transform, Load)
Migrate SQL to Azure SQL https://learn.microsoft.com/en-
• Difference between ETL and ELT
• Introduction to Data Warehousing (OLTP vs. OLAP) us/credentials/applied-skills/migrate-sql-
• Star and Snowflake schema workloads-azure-sql-database/
Azure Data Engineering https://learn.microsoft.com/en-
ETL Tools and Platforms (4 weeks) us/credentials/certifications/azure-data-
• Overview of ETL tools: Apache Nifi, Talend, Informatica, Alteryx
engineer/?practice-assessment-
• Hands-on practice with Airflow or Prefect (scheduling and orchestrating ETL jobs)
• Data pipeline creation, error handling, and logging type=certification#certification-prepare-for-
the-exam
Data Transformation Technique (1 week) ETL, Dataflows https://www.edx.org/learn/data-
• Data cleaning and normalization engineering/ibm-building-etl-and-data-
• Deduplication and data validation pipelines-with-bash-airflow-and-
• Handling missing values and outliers
kafka?irclickid=1K4zsKUW2xyKRMXWqM12MzF
OUkCUBy1KCXt3Uc0&irgwc=1
Data Loading into Data Warehouse(2 week)
• Loading data into relational databases (PostgreSQL, MySQL)
• Loading data into cloud warehouses (Snowflake, Redshift, BigQuery) Big Data Computing by Dr https://onlinecourses.nptel.ac.in/noc24_cs130/
• Batch vs. streaming data loading Rajiv Mishra (IIT Patna) preview

Introduction to Big Data and Distributed Processing (2 week)

• Introduction to Hadoop and Spark
• Distributed computing concepts
• Working with PySpark for ETL Big Data Lectures You Tube Ghislain Fourny’s YouTube Lectures (Big Data
Systems)
Real-Time Data Processing
• Introduction to Kafka and event-driven architectures
• ETL in real-time using Kafka Streams or Apache Flink
• Data consistency in real-time pipelines
Learn Data Cleaning (12-14 weeks) Phase 2: Data Analyst Roadmap
Understanding Data Quality Issues Data Validation and Quality Assurance
• Common Data Problems: Identify common data issues like missing values,
outliers, inconsistencies, duplicates, and incorrect data types. Though this is • Data Validation Rules: Create data validation rules to ensure data integrity and
already covered in Python course but refresh your memory consistency
• Data Quality Metrics: Learn about metrics to assess data quality (accuracy, • Data Quality Checks: Implement regular data quality checks to monitor data
completeness, consistency, timeliness) quality over time
• Practice - Assess data quality using various metrics and techniques
• Practice- Build data cleaning pipelines to automate common data
Data Exploration and Visualization
• Exploratory Data Analysis (EDA): Use EDA techniques to explore data, identify
cleaning tasks
patterns, and uncover anomalies
• Visualization Tools: Learn to use visualization tools like Tableau, Power BI,Python
libraries (Matplotlib, Seaborn) to visualize data and identify anomalies

Handling Missing Data Learning Resource Details

• Missing Data Patterns: Understand different patterns of missing data (missing Data Cleaning Basics
completely at random, missing at random, missing not at random) https://www.kaggle.com/learn/data-
• Imputation Techniques: Explore techniques like mean/median imputation, mode cleaning
imputation, hot-deck imputation, and regression imputation

Dealing with Outliers Data Cleaning https://www.youtube.com/watch?v=ITy8R4278

• Outlier Detection: Learn to identify outliers using statistical methods (z-scores, IQR) sk
and visualization
• Outlier Handling: Decide whether to remove or correct outliers based on their
impact on analysis.

Data Standardization, Normalization and Scaling Techniques

Dealing with Inconsistent Data , Duplicate Data

• Data consistency checks such as matching value with related fields
• Data Cleaning methods such Fuzzy Matching Technique, Standardization
• Duplicate Identification and Resolution based upon business rules
Phase 2: Data Analyst Roadmap
Learn Data Analysis, Visualization (12weeks)
Understanding Tableau and PowerBI
• Basics of Tableau (connecting to data, creating charts) Connecting to data
sources (Excel, SQL, cloud services)
• Creating charts, tables, and visualizations (bar charts, line charts, maps)
Learning Resource Details
• Building interactive dashboards Power BI https://learn.microsoft.com/en-
• Using calculated fields, parameters, and filters us/credentials/certifications/data-analyst-
associate/?practice-assessment-
Building dashboards - Calculated fields, and Parameters type=certification#certification-prepare-for-the-
exam
Power BI: Data models, DAX, and interactive visualizations
• Importing and Cleaning data with Power Query Power BI Certification https://learn.microsoft.com/en-
• Building data models and relationships us/credentials/certifications/data-analyst-
• Creating reports and interactive visualizations associate/?practice-assessment-
• Writing DAX (Data Analysis Expressions) for calculated measures and type=certification
columns
Data Analysis with Python https://www.freecodecamp.org/learn/data-
analysis-with-python/#data-analysis-with-
Cloud Based Data Platforms python-course
• Google BigQuery: Learn to use BigQuery for large-scale data analysis in the
cloud.
• Amazon Redshift: Explore Amazon Redshift for cloud-based data Tableau Free Videos https://www.tableau.com/learn/training
warehousing.
• Azure Synapse Analytics: Understand Azure Synapse Analytics for unified
data analytics and machine learning
Advanced Tools
• NoSQL Databases: Explore NoSQL databases like MongoDB or Cassandra for
unstructured data.
• Data Mining Tools: Explore data mining tools like RapidMiner or KNIME for
advanced analytics.
Phase 2: Data Analyst Roadmap
Learn Business Intelligence (6-8 weeks)
Data Modelling for BI
• Data warehousing concepts (OLTP vs. OLAP)
• Star and snowflake schema design Learning Resource Details
• Data modeling in Power BI, Tableau, and Qlik
• Fact and dimension tables, data normalization Business Intelligence https://www.simplilearn.com/free-business-
intelligence-course-online-skillup
Automation with BI Tools
• Scheduling data refreshes and auto-updates in Tableau and Power BI
• Automation with Power Automate and Tableau Prep Business Intelligence https://www.youtube.com/watch?v=Hg8zBJ1Dh
• Setting alerts and triggers for data updates LQ

Advanced DAX for Power BI Advanced DAX https://www.udemy.com/course/advanced-dax-

• Row context vs. filter context for-power-bi/?
• Time intelligence functions (YTD, MTD, moving averages)
• Nested functions and advanced calculations
DAX https://www.datacamp.com/courses/introductio
n-to-dax-in-power-bi?
Predictive Analysis with BI Tools
• Building forecasts in Power BI and Tableau
• Integrating R or Python for advanced analytics in BI tools Predictive Analysis Tutorial https://www.datacamp.com/tutorial/predictive-
• Predictive modeling techniques (regression, time series forecasting) analytics-with-power-bi
Phase 2: Data Scientist Roadmap
Learn Machine Learning (8-12 weeks)

Introduction to ML Learning Resource Details

• Real World examples – Personalized recommendations, virtual
assistants, Smart Home devices etc. Introduction to ML Andrew Ng’s Stanford Course (CS229)
• Supervised learning (Linear Regression, Logistic Regression,
Decision Trees),
• Unsupervised learning (K-means clustering, PCA)
• Reinforcement :Learning A Basic Course in Machine https://onlinecourses.swayam2.ac.in/imb24_mg
• Introduction to Neural Networks Learning for All by S. 126/preview
• Basic terms: feature, target, training, testing, overfitting, Padmanabhan
underfitting
• ML pipeline: data preprocessing, model training, evaluation
Phase 2: Data Scientist Roadmap
Supervised Learning Algorithms(8-9 weeks) Un-Supervised Learning Algorithms(8-9 weeks)

Linear Regression K-Means Clustering

• Linear regression with one variable, multiple variables • Centroids, distance metrics, and clustering
• Gradient descent and cost function • Elbow method and silhouette score to determine optimal clusters
• Performance metrics (RMSE, MAE) • Applications of clustering (e.g., customer segmentation)

Logistics Regression Hierarchical Clustering

• Sigmoid function and probability prediction • Agglomerative vs divisive clustering
• Cost function and optimization in logistic regression • Dendrograms and linkage methods
• Binary classification metrics: accuracy, precision, recall, F1-score • Distance metrics for clustering (Euclidean, Manhattan)

Decision Trees and Random Forests

• Decision tree structure (splits, nodes, leaf nodes)
• Gini index, entropy, information gain
Principal Component Analysis
• Agglomerative vs divisive clustering
• Random forest algorithm, bagging, and feature importance
• Dendrograms and linkage methods
• Distance metrics for clustering (Euclidean, Manhattan)

Support Vector Machines (SVM)

• Hyperplanes and support vectors

• SVM kernels (linear, RBF)
• Regularization (C parameter) and handling non-linear data
Phase 2: Data Scientist Roadmap
Advanced ML Topics
Probabilistic AI
• Bayesian linear regression, Gaussian processes
Advanced Machine Learning • Bayesian networks, Bayesian neural networks, Bayesian Optimization
• Boosting vs. bagging
• Non Parametric methods
• Support vector machines, boosting algorithms (XGBoost, LightGBM) Reinforcement Learning
• Gradient boosting algorithm and loss function minimization • Markov Decision Processes (MDPs)
• Implementing GBM, XGBoost, and LightGBM for classification and • Value iteration
regression • Policy gradient methods
• Ensemble methods (Random Forest, Gradient Boosting) • Q-learning, Deep reinforcement learning (DQN, PPO)

Neural Networks and Deep Learning Understanding Generative AI applications

• Introduction to neural networks (perceptron, activation functions)
• Forward and backward propagation, loss functions
• Deep learning concepts: convolutional neural networks (CNNs), recurrent Model Optimization
neural networks (RNNs) • Hyperparameter Tuning and Model Evaluation
• Feature Engineering
Natural Language Processing
• Text pre-processing (tokenization, stemming, lemmatization)
• Bag-of-words, TF-IDF, and word embeddings (Word2VeC, GloVe)
• Language Models (n-grams, RNN, Transformers
• NLP applications: sentiment analysis, named entity recognition (NER)

Computer Vision
• Image processing, image classification (CNNs),
• Object detection (YOLO, SSD)
• Object segmentation, transfer learning
• CNN architectures (ResNet, Inception)
Phase 2: Data Scientist Roadmap
Learning Resource Details
Machine Learning with Python https://www.freecodecamp.org/learn/machine-learning-with-python/

ML for Beginners https://microsoft.github.io/ML-For-Beginners/#/

Machine Learning https://learn.microsoft.com/en-
us/collections/qrqzamz1nn2wx3?WT.mc_id=academic-77952-bethanycheum

Probabilistic Machine Learning by Kevin P Murphy (Book)

Introduction to Statistical Learning by Tibshriani (Book)
Reinforcement Learning by Sutton Barto (Book)
Data Science for Beginners https://microsoft.github.io/Data-Science-For-Beginners/#/
Data Science https://developers.google.com/machine-learning/crash-course
Bishop Textbook on Deep Learning (https://www.bishopbook.com/)
Pattern Recognition and Machine Learning by (https://www.amazon.in/PATTERN-RECOGNITION-MACHINE-LEARNING-
Christopher M. Bishop Christopher/dp/1493938436)

Gradient Boost Video https://www.youtube.com/watch?v=3CC4N4z3GJc

XGBoost Videos https://xgboost.readthedocs.io/en/stable/

Go to Kaggle for Code Contest

Data Contracts Early Release 042024
No ratings yet
Data Contracts Early Release 042024
52 pages
Spark QA
No ratings yet
Spark QA
34 pages
Database
No ratings yet
Database
145 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Manish Resume Github
No ratings yet
Manish Resume Github
1 page
It Officer MCQ PDF
75% (4)
It Officer MCQ PDF
49 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
97 Free Udemy Courses
No ratings yet
97 Free Udemy Courses
17 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Python For Data Engineering Guide
No ratings yet
Python For Data Engineering Guide
4 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Shelly Bansal - SR Data Engineer
No ratings yet
Shelly Bansal - SR Data Engineer
6 pages
Databricks Interview Question & Answers
No ratings yet
Databricks Interview Question & Answers
10 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
Resume Template in Docx Format
No ratings yet
Resume Template in Docx Format
1 page
1 Introduction To Databricks Machine Learning
No ratings yet
1 Introduction To Databricks Machine Learning
9 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
AyushiPatra Resume
No ratings yet
AyushiPatra Resume
1 page
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
Danilo Cáceres Tanaka
No ratings yet
Danilo Cáceres Tanaka
1 page
Iti Pdfs
No ratings yet
Iti Pdfs
10 pages
ANSAR HAYAT BigData Architect
No ratings yet
ANSAR HAYAT BigData Architect
3 pages
Ajay Resume VLaF
No ratings yet
Ajay Resume VLaF
2 pages
DP-200 Exam: Exam DP-200 Exam Title Implementing An Azure Data Solution 8.0 Product Type 120 Q&A With Explanations
No ratings yet
DP-200 Exam: Exam DP-200 Exam Title Implementing An Azure Data Solution 8.0 Product Type 120 Q&A With Explanations
156 pages
Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
No ratings yet
Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
25 pages
Single-Row Functions
No ratings yet
Single-Row Functions
3 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Data Engineering 6 Months Plan
No ratings yet
Data Engineering 6 Months Plan
3 pages
ADF Course Content
No ratings yet
ADF Course Content
11 pages
Python Questions With Solutions
No ratings yet
Python Questions With Solutions
3 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
POA - Tracker
No ratings yet
POA - Tracker
60 pages
Azure Data Engineer: Venkata Krishna Rao Gundapu
No ratings yet
Azure Data Engineer: Venkata Krishna Rao Gundapu
2 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Hive Interview Questions Answers
No ratings yet
Hive Interview Questions Answers
6 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Hemanshu Kumar Saraf - Resume New
No ratings yet
Hemanshu Kumar Saraf - Resume New
1 page
Jarupula Praveen
No ratings yet
Jarupula Praveen
7 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
4 Data-Testing PDF
No ratings yet
4 Data-Testing PDF
79 pages
C2 Databricks - Sparks - EE
No ratings yet
C2 Databricks - Sparks - EE
9 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
Koustav BigData Resume
No ratings yet
Koustav BigData Resume
2 pages
Azure Synpase Analytics Service
No ratings yet
Azure Synpase Analytics Service
22 pages
Data Lake Bootcamp: Building Reliable Data Lakes
No ratings yet
Data Lake Bootcamp: Building Reliable Data Lakes
29 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
Data Engineering Explanation
No ratings yet
Data Engineering Explanation
43 pages
Exam DP 100 Data Science Solution On Azure Skills Measured
No ratings yet
Exam DP 100 Data Science Solution On Azure Skills Measured
6 pages
Databricks
No ratings yet
Databricks
11 pages
Nagarjuna Hadoop Resume
No ratings yet
Nagarjuna Hadoop Resume
7 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
4.2.4 - Data Source Architectural Patterns
No ratings yet
4.2.4 - Data Source Architectural Patterns
20 pages
MIS - Database Management Systems
100% (1)
MIS - Database Management Systems
33 pages
Finals CS-352-LEC-1913T
No ratings yet
Finals CS-352-LEC-1913T
20 pages
Using Hibernate in A Java Swing Application
No ratings yet
Using Hibernate in A Java Swing Application
22 pages
Access Part 1
No ratings yet
Access Part 1
5 pages
Adbms
No ratings yet
Adbms
3 pages
Traditional File Oriented Approach
No ratings yet
Traditional File Oriented Approach
6 pages
Coursera Ibm Data
No ratings yet
Coursera Ibm Data
1 page
Upgrading Oracle Application 11i To e
No ratings yet
Upgrading Oracle Application 11i To e
23 pages
SQL Server Theory
No ratings yet
SQL Server Theory
2 pages
PHP With Mysql
No ratings yet
PHP With Mysql
3 pages
DBMS MST-1 Paper New
No ratings yet
DBMS MST-1 Paper New
1 page
Quick Guide To Installing Oracle 9i Client On A Controller 8 Application Server, and Configuring Afterwards
No ratings yet
Quick Guide To Installing Oracle 9i Client On A Controller 8 Application Server, and Configuring Afterwards
18 pages
Mid-Term Review-Questions
No ratings yet
Mid-Term Review-Questions
7 pages
CIS Ubuntu Linux 18.04 LTS Benchmark v1.0.0
No ratings yet
CIS Ubuntu Linux 18.04 LTS Benchmark v1.0.0
410 pages
Active Directory BloodHound
No ratings yet
Active Directory BloodHound
139 pages
Talend Data Catalog Basics - Assessment
No ratings yet
Talend Data Catalog Basics - Assessment
13 pages
Project Assignment.2025
No ratings yet
Project Assignment.2025
2 pages
Doubly-Linked Lists
No ratings yet
Doubly-Linked Lists
16 pages
Gis History
No ratings yet
Gis History
1 page
Using Binary Search With SQL Injection
No ratings yet
Using Binary Search With SQL Injection
3 pages
MIS
No ratings yet
MIS
55 pages
SQL Database Project Class12
No ratings yet
SQL Database Project Class12
19 pages
11i 10gDB Migration
No ratings yet
11i 10gDB Migration
7 pages
Lec 3
No ratings yet
Lec 3
15 pages
PRACTICAL CS XII MySQL 2022-23
No ratings yet
PRACTICAL CS XII MySQL 2022-23
18 pages
Figures For Chapter 8 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Figures For Chapter 8 Introduction To Data Mining: by Tan, Steinbach, Kumar
41 pages
SampleQs SAP 001042023
No ratings yet
SampleQs SAP 001042023
2 pages
2.1 - Multi-Dimensional Data Model
No ratings yet
2.1 - Multi-Dimensional Data Model
4 pages
ORACLe Backup Policy
No ratings yet
ORACLe Backup Policy
2 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Getting Started with Greenplum for Big Data Analytics
From Everand
Getting Started with Greenplum for Big Data Analytics
Sunila Gollapudi
No ratings yet
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
From Everand
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
CertSquad Professional Trainers
No ratings yet

Piyush Data Science 3

Uploaded by

Piyush Data Science 3

Uploaded by

Roadmap for Data Science & Roles

Piyush Shankar Garg

Application Development SQL , Enterprise

Predictive Modelling, ML &

Skills • Programming Proficient Knowledge • Expert in Excel • Deep knowledge of

Area Learning Path Indicative Time in Weeks with

Common Skills Excel (Basic and Advanced) 3-5 weeks

Data Engineer Cloud Pipeline , Cloud Platform 20-22 weeks

Data Analyst Data Cleaning and Understanding of business 8-10 weeks

Business Intelligence 6-8 weeks

Data Scientist Machine Learning & AI concepts 5-7 weeks

Supervised Learning 6-8 weeks

Unsupervised Learning 6-8 weeks

Deep learning and Generative AI 7-9 weeks

Model Optimization and Evaluation 5-7 weeks

Querying Data (1-2 weeks) Learning Resource Details

Regression Analysis (2 weeks) Learning Resource Details

Understand Linear Algebra

Introduction to Big Data and Distributed Processing (2 week)

Handling Missing Data Learning Resource Details

Dealing with Outliers Data Cleaning https://www.youtube.com/watch?v=ITy8R4278

Data Standardization, Normalization and Scaling Techniques

Dealing with Inconsistent Data , Duplicate Data

Advanced DAX for Power BI Advanced DAX https://www.udemy.com/course/advanced-dax-

Introduction to ML Learning Resource Details

Linear Regression K-Means Clustering

Logistics Regression Hierarchical Clustering

Decision Trees and Random Forests

Support Vector Machines (SVM)

• Hyperplanes and support vectors

Neural Networks and Deep Learning Understanding Generative AI applications

ML for Beginners https://microsoft.github.io/ML-For-Beginners/#/

Probabilistic Machine Learning by Kevin P Murphy (Book)

Gradient Boost Video https://www.youtube.com/watch?v=3CC4N4z3GJc

XGBoost Videos https://xgboost.readthedocs.io/en/stable/

You might also like