Data science curriculum
Data science curriculum
CRUD Basics
Introducing CRUD
Getting Our New "Dataset"
Officially Introducing SELECT
The WHERE clause
Aliases
Using UPDATE
A Quick Rule Of Thumb
Introducing DELETE
String Functions
The World Of String Functions
Loading Our Books Data
CONCAT
SUBSTRING
Combining String Functions
Sidenote: SQL Formatting
REPLACE
REVERSE
CHAR_LENGTH
UPPER & LOWER
Other String Functions
Refining Selections
Adding Some New Books
DISTINCT
ORDER BY
More On ORDER BY
LIMIT
LIKE
Escaping Wildcards
Aggregate Functions
Count Basics
GROUP BY
MIN and MAX Basics
Subqueries
Grouping By Multiple Columns
MIN and MAX With GROUP BY
SUM
AVG
Aggregate Functions Docs
Revisiting Data Types
Surveying Other Data Types
CHAR vs. VARCHAR
INT, TINYINT, BIGINT, etc.
DECIMAL
FLOAT & DOUBLE
DATE and TIME
Working With Dates
CURDATE, CURTIME, & NOW
Date Functions
Time Functions
Formatting Dates
Date Maths
TIMESTAMPS
DEFAULT & ON UPDATE TIMESTAMPS
Comparison & Logical Operators
Not Equal
NOT LIKE
Greater Than
Less Than Or Equal To
Logical AND
Logical OR
Between
Comparing Dates
The IN Operator
CASE
IS NULL
Constraints & ALTER TABLE
UNIQUE Constraint
CHECK Constraints
Named Constraints
Multiple Column Constraints
ALTER TABLE: Adding Columns
ALTER TABLE: Dropping Columns
ALTER TABLE: Renaming
ALTER TABLE: Modifying Columns
ALTER TABLE: Constraints
One to Many & Joins
Data is Messy
Relationships Basics
One to Many Relationship
Working with FOREIGN KEY
Cross Joins
Inner Joins
Inner Joins With Group By
Left Join
Left Join With Group By
Right Join
On Delete Cascade
Many to Many
Many to Many Basics
Creating Our Many To Many Tables
TV Series Challenge #1
TV Series Challenge #2
TV Series Challenge #3
TV Series Challenge #4
TV Series Challenge #5
TV Series Challenge #6
TV Series Challenge #7
Views, Modes, & More!13 lectures • 49min
Introducing Views
Updateable Views
Replacing/Altering Views
HAVING clause
WITH ROLLUP
SQL Modes Basics
STRICT_TRANS_TABLES
Slicer
Synchronising slicers to multiple pages
Slicer Warning
Adding more control to your visualisations - Filters and slicers
Sort visuals
Configure small multiples
Use Bookmarks for reports
Group and layer visuals by using the Selection pane
Adding more control to your visualisations
Drillthrough
Buttons and Actions
Page Navigation and Drill through actions
Enable Natural Language Queries (Ask A Question) and Page Formatting
Tooltip Pages
Page and Bookmark Navigator
Adding more control to your visualisations - Part
STATISTICS FOR DATA SCIENCE
- Introduction to Statistical Research Methods
- Data Visualization
- Measures of Central Tendency
- Variability
- Standardisation
- Normal Distribution
- Sampling Distributions
- Estimation
- Hypothesis Testing
- t-Tests
- One-way Analysis of Variance (ANOVA)
- Two-way Analysis of Variance (ANOVA)
- Correlation
- Regression
- Chi-Squared Tests
Power BI
- Importing data from various sources (Excel, CSV, SQL Server, Web, etc.)
2. Data Preparation:
3. Data Modeling:
5. Visualization:
- Creating and customizing basic charts (bar, line, pie, scatter, etc.)
6. Advanced Visualization:
8. Power BI Service:
9. Power BI Embedded:
10. Security:
- Connecting Power BI with other Microsoft services (Excel, Azure, SQL Server)
Functions
- Defining and calling functions
- Parameters and arguments
- Return statements and function documentation (docstrings)
- Scope and lifetime of variables
- Lambda functions and built-in functions
Scripting:
- Reading and writing files
- Command-line arguments (sys.argv)
- Creating and running Python scripts
- Understanding shebang (#!/usr/bin/env python)
- Organising code into modules and packages
-
NUMPY FOR DATA SCIENCE
- Introduction to NumPy and its importance in data science
- Creating NumPy arrays
- Array indexing and slicing
- Array manipulation and broadcasting
- Mathematical operations with NumPy arrays
- Loading and saving data using NumPy
PANDAS FOR DATA WRANGLING
- Introduction to Pandas for data manipulation and analysis
- Series and DataFrame objects
- Loading data into Pandas
- Data exploration and basic statistics
- Data cleaning and handling missing values
- Data filtering, selection, and sorting
- Data visualisation with Pandas
- What is data wrangling and why is it important?
- Data acquisition methods (reading from files, web scraping, APIs)
- Data cleaning techniques (handling missing values, dealing with duplicates)
- Data transformation (reshaping data, merging and joining datasets)
- Data aggregation and grouping
- Data normalisation and scaling
- Dealing with outliers
- Handling categorical data (encoding and one-hot encoding)
- Date and time data manipulation
- Introduction to data quality and validation
- Advanced Pandas techniques for data manipulation (pivot tables, melt, stack, unstack)
- Combining and merging DataFrames (concatenation, merging on keys)
- Data filtering and selection (loc, iloc)
- Using Pandas functions to clean and transform data
- Handling missing data with Pandas
- Applying custom functions to data using Pandas
MATPLOTLIB
- Introduction to Matplotlib and its role in data visualisation
- Basic plotting with Matplotlib (line plots, scatter plots, bar charts)
- Customising plots (labels, titles, legends)
- Subplots and figure customization
- Advanced plotting techniques (histograms, box plots, heatmaps)
- Saving and exporting plots in different formats
SEABORN
- Introduction to Seaborn and its advantages over Matplotlib
- Seaborn's aesthetics and built-in themes
- Creating statistical visualisations (distribution plots, categorical plots)
- Visualising relationships (scatter plots, pair plots, heatmaps)
- Advanced customization and styling in Seaborn
- Combining Seaborn with Pandas DataFrames for effective data exploration
VISUALIZATION
- Univariate Exploration of Data
- In this lesson, you will see how you can use matplotlib and seaborn to produce
informative visualisations of single variables.
- Bivariate Exploration of Data
- Multivariate Exploration of Data
- Explanatory Visualisations
MACHINE LEARNING
ADVANCED REGRESSION
- Introduction To Machine Learning
- Predictive Modelling And Classification
- Assessing Accuracy And The Train-Test Split
- Statistical Learning
- Linear Models
- Least Squares Regression
- Splitting Datasets
- The Train/Test Split
- Multiple Linear Regression
- Multiple Linear Regression
- Variables And Variable Selection
- Feature Engineering
- Saving And Restoring Models
- Regularisation - Data Scaling
- Regularisation : Ridge Regression
- Regularisation : LASSO Regression
- Decision Trees
- Bias-Variance Tradeoff
- Parametric Methods, Ensembling And Bootstrapping
- Random Forests
ADVANCED CLASSIFICATION
- Advanced Classification
- Natural Language Processing
- How Machines Understand Language
- Logistic Regression
- Intro To Binary Classification Using Logistic Regression
- Classification Metrics
- Model Improvements
- Improving Classification Models
- Dealing With Imbalanced Data
- Tree-Based Classification Methods
- Training A Decision Tree
- Tree-Based Methods For Classification
- Support Vector Classification
- Support Vector Machines
- Nearest Neighbours And Naive Bayes
- KNNs And Naive Bayes
- Hyperparameter Tuning & Model Validation
- Hyperparameters And Model Validation
- Neural Network Classifiers
- Classifier Model Selection
- Build All The Classifiers
UNSUPERVISED LEARNING
- Principal Component Analysis
- Advanced Dimensionality Reduction
- Advanced Dimensionality Reduction Techniques
- K-Means Clustering
- Hierarchical Clustering
- Gaussian Mixture Models
- Clustering And Geospatial Analysis
- Recommender Systems
Introduction to Streamlit
○ What is Streamlit?
○ Installing Streamlit
○ Basic Streamlit Concepts: Widgets, Layouts, and State Management
○ Running and Sharing Streamlit Apps
● Topics Covered:
○ Introduction to Apache Spark
○ Working with PySpark DataFrames
○ Data Cleaning and Transformation with PySpark
● Topics Covered:
○ Visualisation Techniques for Large Datasets
○ Aggregation and Filtering in PySpark
○ Integrating PySpark with Streamlit for Real-Time Visualisations
● Topics Covered:
○ Connecting Streamlit to Cloud Storage (AWS S3, Google Cloud Storage)
○ Streaming Data into Streamlit from Big Data Sources
○ Real-time Data Processing with Kafka and Streamlit
● Topics Covered:
○ Introduction to Machine Learning on Big Data
○ Using MLlib with PySpark
○ Integrating Machine Learning Models in Streamlit
● Topics Covered:
○ Project Planning and Management
○ Integrating All Components: Data Ingestion, Processing, Visualization, and
Machine Learning
○ Optimising Streamlit Apps for Performance
Projects
The Blackjack Capstone Project
Higher Lower Game
Data Analysis of a CSV File
Obtain a dataset in CSV format (e.g., from Kaggle or other open datasets).
Use Pandas to load and clean the data.
Perform exploratory data analysis (EDA) using Pandas and NumPy to answer
questions and visualise patterns in the data.
Generate summary statistics, histograms, and other visualisations to gain insights
from the dataset.
Customer Segmentation
Obtain a customer dataset (e.g., retail sales data or online store data).
Use Pandas to preprocess and clean the dataset.
Utilise NumPy for clustering algorithms like k-means to segment customers based on
their purchase behaviour.
Visualise customer segments and analyse their characteristics.
Customer Segmentation
Apply clustering algorithms like k-means or hierarchical clustering to segment
customers based on their purchasing behaviour.
Analyse customer segments and develop targeted marketing strategies.
Anomaly Detection in Network Traffic
Work with network traffic data and focus on anomaly detection.
Implement unsupervised learning techniques (e.g., isolation forests or autoencoders)
to identify unusual patterns or attacks in network traffic.
Topic Modeling for Text Data
Use a dataset of text documents (e.g., news articles, research papers).
Apply topic modelling techniques like Latent Dirichlet Allocation (LDA) to discover
underlying topics within the documents.
Introduction to Streamlit
● Practical Exercise:
○ Build a basic Streamlit app that displays text, images, and charts.
● Assignment:
○ Create a simple dashboard with user inputs (e.g., sliders, checkboxes).
● Practical Exercise:
○ Develop a Streamlit app with complex layouts and multiple interactive charts.
● Assignment:
○ Design a multi-page Streamlit app.
● Practical Exercise:
○ Explore a small dataset using traditional methods.
● Assignment:
○ Write a brief report on the challenges and opportunities of Big Data.
Data Wrangling with PySpark
● Practical Exercise:
○ Process a medium-sized dataset using PySpark.
● Assignment:
○ Clean and transform a dataset using PySpark and load it into a Streamlit app.
● Practical Exercise:
○ Visualize a large dataset in Streamlit using PySpark.
● Assignment:
○ Build a data dashboard in Streamlit that visualizes trends in a large dataset.
● Practical Exercise:
○ Set up a connection between Streamlit and a cloud storage service.
● Assignment:
○ Create a Streamlit app that pulls data from a cloud storage service and
visualizes it.
● Practical Exercise:
○ Build and deploy a machine learning model using PySpark and Streamlit.
● Assignment:
○ Develop a Streamlit app that allows users to train and test a machine learning
model on large datasets.
● Practical Exercise:
○ Create and deploy a Streamlit app with custom components.
● Assignment:
○ Secure a Streamlit app and deploy it to a cloud platform.
● Practical Exercise:
○ Complete and present the capstone project.
● Assignment:
○ Submit the final project and present it to the class.