0% found this document useful (0 votes)
9 views

Data science curriculum

The document provides a comprehensive overview of data science, covering its relevance in society, various disciplines such as business analytics, machine learning, and artificial intelligence, as well as common techniques and tools used in the field. It also includes detailed sections on SQL databases, statistics, version control with Git and GitHub, Power BI, Python programming, data visualization libraries like Matplotlib and Seaborn, and machine learning concepts. Each section is structured to guide learners through foundational knowledge to advanced applications in data science.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data science curriculum

The document provides a comprehensive overview of data science, covering its relevance in society, various disciplines such as business analytics, machine learning, and artificial intelligence, as well as common techniques and tools used in the field. It also includes detailed sections on SQL databases, statistics, version control with Git and GitHub, Power BI, Python programming, data visualization libraries like Matplotlib and Seaborn, and machine learning concepts. Each section is structured to guide learners through foundational knowledge to advanced applications in data science.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

FOUNDATION

INTRODUCTION TO DATA SCIENCE

- Where data science fits into today’s society.


- Why are there so many business and data science buzzwords?
- Analysis vs Analytics
- Intro to Business Analytics, Data Analytics, and Data Science
- Adding Business Intelligence (BI), Machine Learning (ML), and Artificial
Intelligence (AI) to the picture
- The Relationship Between Different Data Science Field
- When are Traditional data, Big Data, BI, Traditional Data Science and ML applied?
- What Is The Purpose Of Each Data Science Field
- Why do we Need each of these Disciplines?
- Common Data Science Techniques
- Traditional Data: Techniques
- Traditional Data: Real-life Examples
- Big Data: Techniques
- Big Data: Real-life Examples
- Business Intelligence (BI): Techniques
- Business Intelligence (BI): Real-life Examples
- Traditional Methods: Techniques
- Traditional Methods: Real-life Examples
- Machine Learning (ML): Techniques
- Machine Learning (ML): Types of Machine Learning
- Machine Learning (ML): Real-life Examples
- Common Data Science Tools
- Programming Languages & Software Employed in Data Science - All the Tools You
Need
- Data Science Job Positions: What Do They Involve And What To Look Out For?
- Data Science Job Positions: What do they Involve and What to Look out for?
- Dispelling Common Misconceptions

SQL AND DATABASES FOR DATA SCIENCE
Getting Started & Installation
​ What Is A Database?
​ SQL vs. MySQL
​ Installation
Creating Databases & Tables
​ Showing Databases
​ Creating Databases
​ Dropping and Using Databases
​ Introducing Tables
​ Data Types: The Basics
​ Creating Tables
​ How Do We Know It Worked?
​ Dropping Tables
​ Tables Basics Activity
​ MySQL Comments
Inserting Data
​ INSERT: The Basics
​ A Quick Preview of SELECT
​ Multi-inserts
​ Working With NOT NULL
​ Sidenote: Quotes In MySQL
​ Adding DEFAULT Values
​ Introducing Primary Keys
​ Working With AUTO_INCREMENT

CRUD Basics
​ Introducing CRUD
​ Getting Our New "Dataset"
​ Officially Introducing SELECT
​ The WHERE clause
​ Aliases
​ Using UPDATE
​ A Quick Rule Of Thumb
​ Introducing DELETE
String Functions
​ The World Of String Functions
​ Loading Our Books Data
​ CONCAT
​ SUBSTRING
​ Combining String Functions
​ Sidenote: SQL Formatting
​ REPLACE
​ REVERSE
​ CHAR_LENGTH
​ UPPER & LOWER
​ Other String Functions
Refining Selections
​ Adding Some New Books
​ DISTINCT
​ ORDER BY
​ More On ORDER BY
​ LIMIT
​ LIKE
​ Escaping Wildcards
Aggregate Functions
​ Count Basics
​ GROUP BY
​ MIN and MAX Basics
​ Subqueries
​ Grouping By Multiple Columns
​ MIN and MAX With GROUP BY
​ SUM
​ AVG
​ Aggregate Functions Docs
Revisiting Data Types
​ Surveying Other Data Types
​ CHAR vs. VARCHAR
​ INT, TINYINT, BIGINT, etc.
​ DECIMAL
​ FLOAT & DOUBLE
​ DATE and TIME
​ Working With Dates
​ CURDATE, CURTIME, & NOW
​ Date Functions
​ Time Functions
​ Formatting Dates
​ Date Maths
​ TIMESTAMPS
​ DEFAULT & ON UPDATE TIMESTAMPS
Comparison & Logical Operators
​ Not Equal
​ NOT LIKE
​ Greater Than
​ Less Than Or Equal To
​ Logical AND
​ Logical OR
​ Between
​ Comparing Dates
​ The IN Operator
​ CASE
​ IS NULL
Constraints & ALTER TABLE
​ UNIQUE Constraint
​ CHECK Constraints
​ Named Constraints
​ Multiple Column Constraints
​ ALTER TABLE: Adding Columns
​ ALTER TABLE: Dropping Columns
​ ALTER TABLE: Renaming
​ ALTER TABLE: Modifying Columns
​ ALTER TABLE: Constraints
One to Many & Joins
​ Data is Messy
​ Relationships Basics
​ One to Many Relationship
​ Working with FOREIGN KEY
​ Cross Joins
​ Inner Joins
​ Inner Joins With Group By
​ Left Join
​ Left Join With Group By
​ Right Join
​ On Delete Cascade
Many to Many
​ Many to Many Basics
​ Creating Our Many To Many Tables
​ TV Series Challenge #1
​ TV Series Challenge #2
​ TV Series Challenge #3
​ TV Series Challenge #4
​ TV Series Challenge #5
​ TV Series Challenge #6
​ TV Series Challenge #7
Views, Modes, & More!13 lectures • 49min
​ Introducing Views
​ Updateable Views
​ Replacing/Altering Views
​ HAVING clause
​ WITH ROLLUP
​ SQL Modes Basics
​ STRICT_TRANS_TABLES
​ Slicer
​ Synchronising slicers to multiple pages
​ Slicer Warning
​ Adding more control to your visualisations - Filters and slicers
​ Sort visuals
​ Configure small multiples
​ Use Bookmarks for reports
​ Group and layer visuals by using the Selection pane
​ Adding more control to your visualisations
​ Drillthrough
​ Buttons and Actions
​ Page Navigation and Drill through actions
​ Enable Natural Language Queries (Ask A Question) and Page Formatting
​ Tooltip Pages
​ Page and Bookmark Navigator
​ Adding more control to your visualisations - Part
STATISTICS FOR DATA SCIENCE
- Introduction to Statistical Research Methods
- Data Visualization
- Measures of Central Tendency
- Variability
- Standardisation
- Normal Distribution
- Sampling Distributions
- Estimation
- Hypothesis Testing
- t-Tests
- One-way Analysis of Variance (ANOVA)
- Two-way Analysis of Variance (ANOVA)
- Correlation
- Regression
- Chi-Squared Tests

VERSION CONTROL - GIT AND GITHUB


- The Terminal
- Install Git Bash on Windows
- Introduction to Version Control and Git
- Version Control using Git and the Command Line
- Github and Remote Repositories
- Gitignore
- Cloning
- Branching and Merging
- Forking and Pull Requests
- Setting Up Comet

Power BI

1. Getting Started with Power BI:

- Understanding Power BI Desktop, Power BI Service, and Power BI Mobile

- Importing data from various sources (Excel, CSV, SQL Server, Web, etc.)

- Basic navigation and interface of Power BI Desktop

2. Data Preparation:

- Data cleaning and transformation using Power Query Editor

- Merging and appending queries

- Data types and error handling

3. Data Modeling:

- Creating relationships between tables

- Understanding and using star and snowflake schemas

- Managing relationships (one-to-one, one-to-many, many-to-many)

- Using calculated columns and tables

4. DAX (Data Analysis Expressions):

- Basics of DAX syntax and functions

- Creating calculated columns and measures


- Understanding row context and filter context

- Common DAX functions (SUM, COUNT, AVERAGE, MIN, MAX)

- Time intelligence functions (DATEADD, DATESYTD, SAMEPERIODLASTYEAR)

- Advanced DAX functions (CALCULATE, ALL, FILTER, RELATED)

5. Visualization:

- Creating and customizing basic charts (bar, line, pie, scatter, etc.)

- Using slicers for filtering data

- Creating and customizing tables and matrices

- Using maps and geographical data visualizations

- Custom visualizations from the marketplace

6. Advanced Visualization:

- Using bookmarks and selections for interactive reports

- Creating drill-through and drill-down reports

- Using tooltips for enhanced data presentation

- Implementing conditional formatting

7. Reports and Dashboards:

- Designing report layouts and themes

- Creating and managing dashboards in Power BI Service

- Pinning visuals to dashboards

- Using Q&A feature for natural language queries

8. Power BI Service:

- Publishing reports to Power BI Service

- Understanding workspaces, apps, and content packs


- Managing datasets and data refresh schedules

- Sharing reports and dashboards with stakeholders

- Collaborating with team members

9. Power BI Embedded:

- Integrating Power BI reports into applications

- Using Power BI REST API for automation

10. Security:

- Implementing row-level security (RLS)

- Managing roles and permissions

- Understanding and applying data protection and compliance measures

11. Performance Optimization:

- Optimizing data models for performance

- Using Performance Analyzer tool

- Best practices for efficient report design

12. Advanced Analytics:

- Using AI visuals (Key Influencers, Decomposition Tree, Q&A Visual)

- Integrating R and Python scripts in Power BI

- Implementing what-if parameters for scenario analysis

13. Power BI Integration:

- Connecting Power BI with other Microsoft services (Excel, Azure, SQL Server)

- Integrating with third-party tools and data sources

- Using Power Automate for workflow automation


14. Power BI Administration:

- Managing Power BI gateway for on-premises data sources

- Monitoring usage and performance

- Implementing governance and best practices for organisation-wide usage

15. Power BI Community and Resources:

- Participating in Power BI community forums and events

- Utilising Power BI documentation and learning resources

- Staying updated with new features and updates

PYTHON FOR DATA SCIENCE


Why Python Programming
- Introduction to Python and its popularity
- Python's use in various domains (Web development, Data science, Automation, etc.)
- Advantages of Python over other programming languages
- Python community and resources
Data Types and Operators
- Variables and data types (integers, floats, strings, booleans)
- Type conversion and casting
- Basic operators (arithmetic, comparison, logical)
- String manipulation and formatting
- Working with variables and constants
Data Structures in Python
- Lists: creation, indexing, slicing, and manipulation
- Tuples: immutability and use cases
- Dictionaries: key-value pairs and dictionary methods
- Sets: unique elements and set operations
- Lists vs. Tuples vs. Dictionaries vs. Sets
Control Flow
- Conditional statements (if, elif, else)
- Loops (for and while loops)
- Loop control statements (break, continue)
- Using loops for iteration and pattern printing
- Exception handling (try, except, finally)

Functions
- Defining and calling functions
- Parameters and arguments
- Return statements and function documentation (docstrings)
- Scope and lifetime of variables
- Lambda functions and built-in functions

Scripting:
- Reading and writing files
- Command-line arguments (sys.argv)
- Creating and running Python scripts
- Understanding shebang (#!/usr/bin/env python)
- Organising code into modules and packages
-
NUMPY FOR DATA SCIENCE
- Introduction to NumPy and its importance in data science
- Creating NumPy arrays
- Array indexing and slicing
- Array manipulation and broadcasting
- Mathematical operations with NumPy arrays
- Loading and saving data using NumPy
PANDAS FOR DATA WRANGLING
- Introduction to Pandas for data manipulation and analysis
- Series and DataFrame objects
- Loading data into Pandas
- Data exploration and basic statistics
- Data cleaning and handling missing values
- Data filtering, selection, and sorting
- Data visualisation with Pandas
- What is data wrangling and why is it important?
- Data acquisition methods (reading from files, web scraping, APIs)
- Data cleaning techniques (handling missing values, dealing with duplicates)
- Data transformation (reshaping data, merging and joining datasets)
- Data aggregation and grouping
- Data normalisation and scaling
- Dealing with outliers
- Handling categorical data (encoding and one-hot encoding)
- Date and time data manipulation
- Introduction to data quality and validation
- Advanced Pandas techniques for data manipulation (pivot tables, melt, stack, unstack)
- Combining and merging DataFrames (concatenation, merging on keys)
- Data filtering and selection (loc, iloc)
- Using Pandas functions to clean and transform data
- Handling missing data with Pandas
- Applying custom functions to data using Pandas

MATPLOTLIB
- Introduction to Matplotlib and its role in data visualisation
- Basic plotting with Matplotlib (line plots, scatter plots, bar charts)
- Customising plots (labels, titles, legends)
- Subplots and figure customization
- Advanced plotting techniques (histograms, box plots, heatmaps)
- Saving and exporting plots in different formats
SEABORN
- Introduction to Seaborn and its advantages over Matplotlib
- Seaborn's aesthetics and built-in themes
- Creating statistical visualisations (distribution plots, categorical plots)
- Visualising relationships (scatter plots, pair plots, heatmaps)
- Advanced customization and styling in Seaborn
- Combining Seaborn with Pandas DataFrames for effective data exploration

VISUALIZATION
- Univariate Exploration of Data
- In this lesson, you will see how you can use matplotlib and seaborn to produce
informative visualisations of single variables.
- Bivariate Exploration of Data
- Multivariate Exploration of Data
- Explanatory Visualisations

MACHINE LEARNING
ADVANCED REGRESSION
- Introduction To Machine Learning
- Predictive Modelling And Classification
- Assessing Accuracy And The Train-Test Split
- Statistical Learning
- Linear Models
- Least Squares Regression
- Splitting Datasets
- The Train/Test Split
- Multiple Linear Regression
- Multiple Linear Regression
- Variables And Variable Selection
- Feature Engineering
- Saving And Restoring Models
- Regularisation - Data Scaling
- Regularisation : Ridge Regression
- Regularisation : LASSO Regression
- Decision Trees
- Bias-Variance Tradeoff
- Parametric Methods, Ensembling And Bootstrapping
- Random Forests
ADVANCED CLASSIFICATION
- Advanced Classification
- Natural Language Processing
- How Machines Understand Language
- Logistic Regression
- Intro To Binary Classification Using Logistic Regression
- Classification Metrics
- Model Improvements
- Improving Classification Models
- Dealing With Imbalanced Data
- Tree-Based Classification Methods
- Training A Decision Tree
- Tree-Based Methods For Classification
- Support Vector Classification
- Support Vector Machines
- Nearest Neighbours And Naive Bayes
- KNNs And Naive Bayes
- Hyperparameter Tuning & Model Validation
- Hyperparameters And Model Validation
- Neural Network Classifiers
- Classifier Model Selection
- Build All The Classifiers

UNSUPERVISED LEARNING
- Principal Component Analysis
- Advanced Dimensionality Reduction
- Advanced Dimensionality Reduction Techniques
- K-Means Clustering
- Hierarchical Clustering
- Gaussian Mixture Models
- Clustering And Geospatial Analysis
- Recommender Systems
Introduction to Streamlit

○ What is Streamlit?
○ Installing Streamlit
○ Basic Streamlit Concepts: Widgets, Layouts, and State Management
○ Running and Sharing Streamlit Apps

Streamlit Components and Layouts

○ Advanced Layouts and Widgets


○ Creating Interactive User Interfaces
○ Integrating Plotly, Matplotlib, and Altair with Streamlit

Introduction to Big Data

○ What is Big Data?


○ Characteristics of Big Data (Volume, Velocity, Variety, Veracity)
○ Overview of Big Data Technologies (Hadoop, Spark, NoSQL)
○ Data Storage: HDFS, Cloud Storage

Data Wrangling with PySpark

● Topics Covered:
○ Introduction to Apache Spark
○ Working with PySpark DataFrames
○ Data Cleaning and Transformation with PySpark

Data Visualization for Big Data

● Topics Covered:
○ Visualisation Techniques for Large Datasets
○ Aggregation and Filtering in PySpark
○ Integrating PySpark with Streamlit for Real-Time Visualisations

Connecting Streamlit with Big Data Storage

● Topics Covered:
○ Connecting Streamlit to Cloud Storage (AWS S3, Google Cloud Storage)
○ Streaming Data into Streamlit from Big Data Sources
○ Real-time Data Processing with Kafka and Streamlit

Machine Learning on Big Data

● Topics Covered:
○ Introduction to Machine Learning on Big Data
○ Using MLlib with PySpark
○ Integrating Machine Learning Models in Streamlit

Advanced Streamlit Features

○ Custom Components in Streamlit


○ Deploying Streamlit Apps on Heroku, AWS, and Google Cloud
○ Streamlit Authentication and Security

Big Data Project Development

● Topics Covered:
○ Project Planning and Management
○ Integrating All Components: Data Ingestion, Processing, Visualization, and
Machine Learning
○ Optimising Streamlit Apps for Performance
Projects
The Blackjack Capstone Project
Higher Lower Game
Data Analysis of a CSV File
Obtain a dataset in CSV format (e.g., from Kaggle or other open datasets).
Use Pandas to load and clean the data.
Perform exploratory data analysis (EDA) using Pandas and NumPy to answer
questions and visualise patterns in the data.
Generate summary statistics, histograms, and other visualisations to gain insights
from the dataset.

Stock Portfolio Analysis


Retrieve historical stock price data using Pandas' data reader or API.
Create a Pandas DataFrame to store and manipulate the data.
Calculate and visualise portfolio statistics, such as returns, volatility, and risk-adjusted
performance.
Implement simple portfolio optimization strategies, such as the Markowitz Efficient
Frontier.

Customer Segmentation
Obtain a customer dataset (e.g., retail sales data or online store data).
Use Pandas to preprocess and clean the dataset.
Utilise NumPy for clustering algorithms like k-means to segment customers based on
their purchase behaviour.
Visualise customer segments and analyse their characteristics.

Time Series Forecasting


Collect a time series dataset (e.g., stock prices, weather data).
Load and manipulate the data with Pandas.
Use NumPy and Pandas to implement time series forecasting models like moving
averages, exponential smoothing, or ARIMA.
Visualise the time series data and the forecasted values.
Movie Recommender System
Acquire a movie ratings dataset (e.g., MovieLens dataset).
Clean and preprocess the data using Pandas.
Implement a basic movie recommender system using NumPy and Pandas, based on
user ratings and movie metadata.
Provide movie recommendations for a given user.

E-commerce Sales Analysis


Collect e-commerce sales data, including customer transactions and product
information.
Use Pandas for data cleaning and merging datasets.
Analyse sales trends, customer behaviour, and product performance using Pandas and
NumPy.
Create visualisations and reports to summarise the findings.

Data Cleaning and Transformation Tool


Develop a tool that allows users to upload messy datasets.
Use Pandas to clean and transform the data, addressing common data quality issues
like missing values, duplicates, and inconsistent formatting.
Provide options for data export in different formats (e.g., CSV, Excel) after cleaning.

House Price Prediction:


Utilise a dataset of housing prices, including features like square footage, number of
bedrooms, and location.
Build regression models (linear regression, decision tree regression, or random forest
regression) to predict house prices.
Evaluate and compare model performance using metrics like Mean Absolute Error
(MAE) and Root Mean Squared Error (RMSE).
Energy Consumption Forecasting
Gather time-series data on energy consumption along with weather-related features.
Develop a time series forecasting model (e.g., ARIMA, LSTM) to predict future
energy consumption.
Assess the accuracy of the model's predictions.
Stock Price Prediction
Collect historical stock price data for a specific company or stock market index.
Implement a time series regression model to predict future stock prices.
Evaluate the model's performance using metrics like Mean Squared Error (MSE) and
visualise the predictions.

Customer Churn Prediction


Work with customer data from a business (telecom, subscription service, etc.).
Create a classification model (logistic regression, random forest, or support vector
machine) to predict customer churn.
Evaluate the model's accuracy, precision, recall, and F1-score.

Sentiment Analysis on Social Media


Collect social media data (e.g., tweets or reviews) related to a product or topic of
interest.
Build a text classification model using techniques like natural language processing
(NLP) and sentiment analysis.
Analyse sentiment trends and sentiment distribution.

Image Classification (e.g., MNIST, CIFAR-10)


Use popular image datasets like MNIST or CIFAR-10.
Create a convolutional neural network (CNN) for image classification tasks.
Visualise the model's performance and make predictions on new images.

Customer Segmentation
Apply clustering algorithms like k-means or hierarchical clustering to segment
customers based on their purchasing behaviour.
Analyse customer segments and develop targeted marketing strategies.
Anomaly Detection in Network Traffic
Work with network traffic data and focus on anomaly detection.
Implement unsupervised learning techniques (e.g., isolation forests or autoencoders)
to identify unusual patterns or attacks in network traffic.
Topic Modeling for Text Data
Use a dataset of text documents (e.g., news articles, research papers).
Apply topic modelling techniques like Latent Dirichlet Allocation (LDA) to discover
underlying topics within the documents.

Market Basket Analysis


Work with transaction data from a retail store.
Use association rule mining (e.g., Apriori algorithm) to identify patterns in customer
purchasing behaviour.
Suggest product recommendations based on frequent itemsets.

Introduction to Streamlit

● Practical Exercise:
○ Build a basic Streamlit app that displays text, images, and charts.
● Assignment:
○ Create a simple dashboard with user inputs (e.g., sliders, checkboxes).

Streamlit Components and Layouts

● Practical Exercise:
○ Develop a Streamlit app with complex layouts and multiple interactive charts.
● Assignment:
○ Design a multi-page Streamlit app.

Introduction to Big Data

● Practical Exercise:
○ Explore a small dataset using traditional methods.
● Assignment:
○ Write a brief report on the challenges and opportunities of Big Data.
Data Wrangling with PySpark

● Practical Exercise:
○ Process a medium-sized dataset using PySpark.
● Assignment:
○ Clean and transform a dataset using PySpark and load it into a Streamlit app.

Data Visualization for Big Data

● Practical Exercise:
○ Visualize a large dataset in Streamlit using PySpark.
● Assignment:
○ Build a data dashboard in Streamlit that visualizes trends in a large dataset.

Connecting Streamlit with Big Data Storage

● Practical Exercise:
○ Set up a connection between Streamlit and a cloud storage service.
● Assignment:
○ Create a Streamlit app that pulls data from a cloud storage service and
visualizes it.

Machine Learning on Big Data

● Practical Exercise:
○ Build and deploy a machine learning model using PySpark and Streamlit.
● Assignment:
○ Develop a Streamlit app that allows users to train and test a machine learning
model on large datasets.

Advanced Streamlit Features

● Practical Exercise:
○ Create and deploy a Streamlit app with custom components.
● Assignment:
○ Secure a Streamlit app and deploy it to a cloud platform.

Big Data Project Development


● Practical Exercise:
○ Start working on a capstone project that integrates Streamlit and Big Data.
● Assignment:
○ Submit a project proposal outlining the scope, objectives, and technologies
used.

Capstone Project Presentation

● Practical Exercise:
○ Complete and present the capstone project.
● Assignment:
○ Submit the final project and present it to the class.

You might also like