INTRODUCTION TO PYTHON FOR DATA SCIENCE
Pasty Asamoah
+233 (0) 546 116 102
[email protected]
Kwame Nkrumah University of Science and Technology
School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
THE CONCEPT OF DATA
DATA
Data is raw facts. It is anything you can measure or record.
STRUCTURED DATA
• Follows a rigid format.
• Well-defined structure.
• Organized into rows and columns (tabular).
• Stored in a well-defined schema such as databases.
STRUCTURED DATA
UNSTRUCTURED DATA
• Complex and mostly qualitative in nature.
• Not organized in rows and columns.
• Does not have an easily identifiable structure.
• Does not follow any particular rigid format or sequence.
UNSTRUCTURED DATA
SEMI-STRUCTURED DATA
• A mix of structured and unstructured data
• It lacks a rigid schema.
BIG DATA
THE BIG DATA PROBLEM
Data size is growing exponentially! It is becoming BIG
Image courtesy: IDC
But why is this a PROBLEM ?
Well,…
THE BIG DATA PROBLEM
The data coming in is
• Large (Volume),
• A mix of structured, unstructured, and semi-structured (Variety),
• The data is generated near real-time (Velocity)
• It contains BUSINESS VALUE
While big data is valuable,
it is becoming increasingly
difficult to manage using
traditional approaches –
Big Data
DEALING WITH BIG DATA
We develop capabilities to leverage big data to drive
performance!!
BIG DATA - FACEBOOK
https://www.youtube.com/watch?v=_r97qdyQtIk
THE DATA SCIENCE CONCEPT
DATA SCIENCE
• Data science is a multidisciplinary field of study with
goal to address the challenges in big data. They solve
problems with data!
• Data science is an interdisciplinary field of scientific
methods, processes, algorithms, and systems to
extract knowledge or insights from data in various
forms, either structured or unstructured...” - wikipedia
• Data scientists manages, manipulates, extracts, and
interprets knowledge from tremendous amount of
data.
DATA SCIENCE
• Some say: data science is a link between
computational, statistical, and substantive expertise.
DATA SCIENCE
• Others say
DATA SCIENCE VS BIG DATA
• Of course, data science is NOT same as big data
• See big data as the “raw material”, and data science
as the “processing of the raw material” for insights
and better understanding (application).
• Data science is about “applications”, therefore any
domain with large data sets are potential candidates.
DATA SCIENCE APPLICATIONS
Source: https://indico.ictp.it/event/7658/session/10/contribution/58/material/slides/0.pdf
DATA SCIENCE APPLICATIONS
Source: https://indico.ictp.it/event/7658/session/10/contribution/58/material/slides/0.pdf
DATA SCIENCE ECOSYSTEM
DATA SCIENCE ECOSYSTEM
• Data Engineering : building and managing data
pipelines
• Data Analytics: data exploration, building models
and algorithms, data visualization and storytelling
• Data Protection & Ethics: data security, privacy,
ethical concerns and regulatory issues
DATA ENGINEERING
• Data engineering - building of systems to enable the collection
and usage of data. This data is usually used to enable subsequent
analysis and data science- Wikipedia
• Data engineering in most cases is concerned with building data
pipelines leveraging varied platforms and tools.
DATA PIPELINES EXPLAINED
• From the concept of big data, we understand that organizational
data is stored or sourced from varied sources.
• These are
called sources
DATA PIPELINES EXPLAINED
• For analysis and data mining, the organization would like to
collect and store all data from the varied sources to a single place
– data warehouse or data lake. These are called destinations
DATA PIPELINE
• The organization will now need to build an automated process to
transfer the data from the source to the destination.
• The automated process involved in moving the data from the
varied sources to the destination (data warehouse/ data lake) is
called the data pipeline.
Source: Nischay Thapa
ACTIVITIES IN BUILDING A DATA
PIPELINE
• Building a data pipeline involves series of activities to ensure the
quality of the data transferred.
Source: Nischay Thapa
WHY DATA QUALITY MATTERS
• Building a good data pipeline is an important aspect of data
science because the quality of the data affect predictions.
Source: https://data.cs.sfu.ca/QjZo/slides.pdf
DATA QUALITY DIMENSIONS
Source: https://data.cs.sfu.ca/QjZo/slides.pdf
COMMON TYPES OF DATA PIPELINES
1. ETL Pipelines (Most used in data warehouses)
• Extracts data from various sources
• Transforms it (e.g., remove duplicates, special chars, etc.)
• Loads it into the destination system.
2. ELT Pipelines (Most used in data lakes)
• Extracts data from various sources
• Loads it into the destination system.
• Transforms it (e.g., remove duplicates, special chars, etc.)
3. Batch Data Pipeline
• Process data in large chunks
• At specific intervals
• Used for non-time-sensitive data
4. Streaming Data Pipeline
• Process data in real-time
• Used for time-sensitive data (e.g., financial transactions)
TOOLS AND PLATFORMS FOR
BUILDING DATA PIPELINES
DATA ANALYTICS
• Data analytics form part of the data science ecosystem.
• It is the application of statistical and machine learning
techniques to draw insights from data under study and to make
predictions about the behavior of the system under study - M.
TAMER ÖZSU
• Statistics draw inference about a population from a sample
• Machine learning finds generalizable predictive patterns and
models.
TYPES OF DATA ANALYTICS
LEVEL OF COMPLEXITY IN DATA
ANALYTICS
DATA ANALYTICS TASKS AND METHODS
1. Clustering – identifying groups and structures that are “similar”
DATA ANALYTICS TASKS AND METHODS
2. Outlier detection – identifying anomalies
DATA ANALYTICS TASKS AND METHODS
3. Regression– best fit model with least error margin
DATA ANALYTICS TASKS AND METHODS
4. Summarisation – a more compact representation of the data
DATA PROTECTION & ETHICS
• Data needs to be protected
Facebook's Cambridge Analytica data
scandal
https://www.youtube.com/watch?v=VDR8qGmyEQg
DATA PROTECTION & ETHICS
DIMENSIONS OF DATA PROTECTION
PRIVACY SECURITY
• Proper handling, processing, storage • Protecting information from
and usage of information unauthorized access
• Privacy policies • Encryption
• Data retention & deletion policies • Infrastructure security
• Third-party management • Access control
• User consent • Monitoring
DATA SCIENCE LIFECYCLE
DATA SCIENCE LIFECYCLE
Source: https://data.cs.sfu.ca/QjZo/slides.pdf
DATA SCIENCE LIFECYCLE
Source: https://data.cs.sfu.ca/QjZo/slides.pdf
NEXT STEPS
TO THIS END…
We know that;
• big data holds value however, presents a challenge in managing it
with traditional data management tools.
• organizations develop data science capabilities to leverage big data
for performance.
• the data science competence involves data engineering where data
pipelines are built to ensure quality and the continuous flow of
data (data management).
• the destination for data pipelines are either data lakes or data
warehouses (there are of course others like data mart) which are
simply databases.
• For the data analyst or scientist to build models, they will
sometimes need to query such databases, hence the need to
understand basic database concepts and SQL application. In some
cases, data is presented in files (e.g., csv).
NEXT STEPS
Database and SQL
for Data Science
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE
Pasty Asamoah
+233 (0) 546 116 102
[email protected]
Kwame Nkrumah University of Science and Technology
School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
RELATIONAL DATABASES
DATABASES
When data is generated, we need to save or keep it in a safe place. The
place where the data is stored and managed is called a “database”.
Database is simply a collection of organized and interconnected data
that is stored electronically.
Database is a repository for storing, managing, and retrieving vast
amounts of data.
DATABASES
But wait, I have students
data, social media feed, and
financial data.
Should I put them together?
Well, create a
Table for
each data set.
DATABASES
Relational databases are a collection of tables and other elements
(e.g., views, stored procedures, etc.)
THE CONCEPT OF TABLES IN RDBMS
A table is a collection of data elements organized in terms of rows and
columns. A table is also considered as a convenient representation of
relations. It is the simplest form of data storage.
CHARACTERISTICS OF TABLES
Records or Rows: A single entry in a table is called a record or row. A
record in a table represents a set of related data.
Attributes or Columns: A table consists of several records (row), each
record can be broken down into several smaller parts of data known as
attributes. The table above has four attributes; ID, Name, Age and
Salary.
Data types: Data types are classifications that identify possible values
that columns in a table can store. (e.g., text, number, date)
DATA TYPES
Each column in a database table is required to have a data type.
The database developer must decide what type of data will be stored in each
column when creating a table. SQL data types can be broadly categorized as:
Date & Time
String Data Type Numeric Data Type
Data Type
Text and strings Numerical values Date and time
DATA TYPES – (STRING)
source: w3schools.com
DATA TYPES – (NUMERIC)
source: w3schools.com
DATA TYPES – (DATE & TIME)
source: w3schools.com
CONSTRAINTS
Constraints are the rules enforced on the data columns of a table. These are used to limit the type of data that can go into a
table. This ensures the accuracy and reliability of the data in the database.
Commonly used constraints in relational database design are:
NOT NULL − Column cannot have NULL value.
AUTO_INCREMENT – Increase column value automatically
DEFAULT − Default value for empty column.
UNIQUE − Ensures unique values in column.
PRIMARY Key − Uniquely identifies each record.
FOREIGN Key − Identifies a record in a related table.
INDEX − Create and retrieve data from the database very
quickly.
RELATIONAL DATABASE KEYS
Relationships between tables in a relational database can be created
with relational keys. A relation key is an attribute which can uniquely
identify a particular row or record in a relation. There are two (2) main
relational keys
Primary Key: A primary key is a special relational database table
column designated to uniquely identify each table record. It is used as a
unique identifier to quickly parse data within the table. Every table has
a primary key however, cannot have more than one primary key.
Foreign Key: A foreign key is a column or columns of data in one table
that refers to the unique data values (often the primary key data) in
another table.
PRIMARY AND FOREIGN KEYS IN ACTION
DATA INTEGRITY
DATA INTEGRITY
• Data integrity is about ensuring the accuracy, completeness,
consistency, and validity of an organization's data.
• Designing databases to ensure data integrity guarantee accurate and
correct data.
• But why is data integrity important to us?
Source: https://data.cs.sfu.ca/QjZo/slides.pdf
ASPECTS OF DATA INTEGRITY
Entity Integrity Domain Integrity Referential Integrity
Ensure rows that are
Ensure there are no Enforce valid entries
used by other
duplicate rows in a in tables using
records cannot be
table constraints
deleted
User-Defined Integrity
Enforces specific business rules that do
not fall within entity, domain, or
referential integrity
DATABASE NORMALIZATION
Consider the class table below:
1. Do you see that students, courses, and instructors information are put
together in a single table?
2. What happens to the records if instructor “Peter” changes his name to “John
Doe” ?
3. Can you spot redundancy?
DATABASE NORMALIZATION
To efficiently organize data in a database, we need to normalize it.
Database normalization is a database design technique which organizes
tables in a manner that reduces redundancy and dependency of data. It
divides larger tables to smaller tables and links them using relationships
Reasons for database normalization
1. Eliminate redundant data (e.g., storing same data in multiple tables)
2. Ensure data dependencies make sense.
NORMAL FORMS (NF)
• Normal forms are guidelines that help in designing database that
are efficient, organized, and free from data anomalies.
1NF
FIRST NORMAL FORM (1NF)
A table is in 1NF when it has unique values and no
repeating groups
ORIGINAL TABLE
Source: https://byjus.com/gate/first-normal-form-in-dbms/
2NF
SECOND NORMAL FORM (2NF)
A table is in 2NF when it is in 1NF with no partial
dependency (an attribute in a table depends on only a part
of the primary key and not the whole key)
ORIGINAL TABLE
Source: https://byjus.com/gate/second-normal-form-in-dbms/
3NF
THIRD NORMAL FORM (3NF)
A table is in 3NF when it is in 2NF with transitive
dependency
ORIGINAL TABLE
Source: https://byjus.com/gate/third-normal-form-in-dbms/
DENORMALIZATION
• Denormalization refers to the process of deliberately introducing
redundancy into a relational database by grouping data that is
distributed across multiple tables.
• This is done to improve the performance of certain queries at the
cost of data redundancy.
• Data scientist work with denormalized databases often called data
warehouse
ENTITY RELATIONSHIP MODELS (ERD)
ENTITY RELATIONSHIP DIAGRAM
ERD
• ER model forms the basis of an ER diagram
• ERD represents conceptual database as viewed by end user
• ERDs depict database’s main components:
• Entities (the object)
• Attributes (the characteristics and constraints)
• Relationships (links)
ERD - ENTITY
• ERD Refers to entity set and not to single entity occurrence
• Corresponds to table and not to row in relational environment
• The entity name, a noun, is written in capital letters
ERD - ATTRIBUTE
• Characteristics of entities
ERD - RELATIONSHIP
• Association between entities
• Participants are entities that participate in a relationship
• Relationships between entities always operate in both directions
• Relationship can be classified as 1:M
What are these
weird symbols?
ERD – CONNECTIVITY & CARDINALITY
• Connectivity describes the relationship classification
• Cardinality expresses minimum and maximum number of entity
occurrences associated with one occurrence of related entity
• Established by very concise statements known as business rules
SIMPLE HOSPITAL ERD
CLASS ERD
STRUCTURED QUERY LANGUAGE - I
STRUCTURED QUERY LANGUAGE
(SQL)
• SQL is a programming language designed for managing and
manipulating relational databases.
• It provides a standardized way to communicate with relational
database management systems (RDBMS) and perform various
operations such as querying, inserting, updating, and deleting data.
• SQL allows users to define the structure of databases, create tables,
specify relationships between tables, and set constraints to ensure
data integrity. It also provides a set of commands, known as SQL
statements, to perform operations on the data stored in the database.
SQL
• The standard SQL commands to interact with relational databases are
• CREATE <databases and tables>
• SELECT <tables>
• INSERT <tables>
• UPDATE <tables>
• DELETE <tables>
• DROP <databases and tables>
• These commands can be categorized based on their nature as:
• Data Definition Language
• Data Manipulation Language
• Data Control Language
• Data Query Language
DATA DEFINITION LANGUAGE
DATA MANIPULATION LANGUAGE
Source: tutorials point
DATA CONTROL LANGUAGE
DATA QUERY LANGUAGE
Source: tutorials point
STRUCTURED QUERY LANGUAGE - II
SQL – CREATE, USE & DROP DATABASE
The SQL CREATE DATABASE statement is used to create a new SQL database
Syntax: CREATE DATABASE DatabaseName;
Example: CREATE DATABASE KSB
The SQL SHOW DATABASE statement is used to list all SQL databases
Syntax & Example: SHOW DATABASES
The SQL USE command is used to select an SQL database
Syntax: USE DatabaseName;
Example: CREATE DATABASE KSB
The SQL DROP command is used to delete an SQL database
Syntax: DROP DATABASE DatabaseName;
Example: DROP DATABASE KSB
SQL – CREATE & DROP TABLES
The SQL CREATE TABLE statement is used to create a new table in an SQL database
Syntax:
CREATE TABLE TableName (
FirstColumnName datatype constraint,
SecondColumnName datatype constraint,
.....
LastColumnName datatype constraint,
PRIMARY KEY( ColumnName )
);
Example:
CREATE TABLE CUSTOMERS ( The SQL DROP command is also used to
delete tables just like the case of database
ID INT NOT NULL AUTO_INCREMENT,
Syntax: DROP TABLE TableName;
NAME VARCHAR (20) NOT NULL,
AGE INT NOT NULL, Example: DROP TABLE CUSTOMERS;
PRIMARY KEY (ID)
);
SQL – INSERT QUERY
The SQL INSERT statement is used to add new record of data to a table in the database
Syntax:
INSERT INTO TableName ( FirstColumnName, SecondColumnName,….LastColumnName )
VALUES ( FirstValue, SecondValue, ….LastValue );
Example 1: Example 2:
INSERT INTO CUSTOMERS (NAME, AGE) INSERT INTO CUSTOMERS VALUES
VALUES ('Esther Ama Amoh', 23), ('Esther Ama Amoh', 23),
('John Doe', 30), ('John Doe', 30),
('Jane Smith', 25); ('Jane Smith', 25);
SQL – SELECT QUERY
The SQL SELECT statement is used to fetch data from a database table which returns data in
the form of result table.
Syntax:
SELECT FirstColumnName, SecondColumnName,…,LastColumnName FROM TableName;
OR
SELECT * FROM TableName;
Example 1: Example 2:
SELECT ID, NAME, AGE FROM CUSTOMERS; SELECT * FROM CUSTOMERS;
Example 3:
SELECT * FROM CUSTOMERS WHERE ID > 1;
SQL – UPDATE QUERY
The SQL UPDATE statement is used to modify the existing record in a table
Syntax:
UPDATE TableName SET
FirstColumnValue = NewValue1,
SecondColumnValue = NewValue2,
….
LastColumnValue = NewValueN,
WHERE [condition];
Example 2:
Example 1:
UPDATE CUSTOMERS SET
UPDATE CUSTOMERS SET
NAME=“Emmanuel Ackah”
NAME=“Esther Nana Ama Amoh”
WHERE (AGE > 20) AND (AGE < 40);
WHERE ID=1;
SQL – DELETE QUERY
The SQL DELETE statement is used to delete existing record from a table
Syntax:
DELETE FROM TableName WHERE [condition];
Example 1: Example 2:
DELETE FROM CUSTOMERS WHERE ID=1; DELETE FROM CUSTOMERS WHERE AGE < 18
STRUCTURED QUERY LANGUAGE - III
ADVANCED CONCEPTS IN SQL
In this lecture, we’re focusing on the basics. For a mastery and a more
advanced concepts like stored procedures, using JOINS, conditions,
aggregate functions, etc., these platforms present impressively free
tutorials for your attention.
PERSONAL ASSIGNMENT
o https://www.programiz.com/SQL
o https://www.w3schools.com/sql/
o https://www.codecademy.com/learn/learn-sql
o https://www.tutorialspoint.com/sql/index.htm
o https://www.sqltutorial.org/
DATABASE MANAGEMENT SYSTEMS
DATABASE MANAGEMENT SYSTEMS
A DBMS serves as an interface between an end-user and a database,
allowing users to create, read, update, and delete data in the database.
They also allow for database performance monitoring and tuning.
Examples include SQL server, MySQL workbench, DBeaver, etc.
NEXT STEPS
TO THIS END…
We know that;
• Databases are simply a collection of organized and interconnected
data that is stored electronically.
• In relational databases, data is stored in tables.
• Before we set out to create our databases, we leverage ERDs to
define the structure and rules of the database and related tables.
• Finally, we leverage the SQL programming language to interact
and manage databases. Most often, we use database management
systems – GUI applications that allow users to interact with
databases with ease.
• Now that we know how to design databases, and data pipelines
using data engineering techniques, we’re ready to learn the
fundamentals of the python programming language – the primary
tool we will be using to create and manage our models.
NEXT STEPS
Introduction to
Python
Programming
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE
Pasty Asamoah
+233 (0) 546 116 102
[email protected]
Kwame Nkrumah University of Science and Technology
School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
DATABASES & SQL LAB WORK
DOWNLOAD MYSQL WORKBENCH
https://dev.mysql.com/get/Downloads/MySQLGUITools/mysql-
workbench-community-8.0.34-winx64.msi
MYSQL WORKBENCH INSTALLATION
CREATE CONNECTION
START MYSQL SERVER
INTERFACE
LAB ACTIVITIES
CONTINUITY
DATA ENGINEERING LAB WORK
DOWNLOAD JAVA SE
https://dev.mysql.com/get/Downloads/MySQLGUITools/mysql-
workbench-community-8.0.34-winx64.msi
JAVA SE INSTALLATION
https://www.youtube.com/watch?v=SQykK40fFds
DOWNLOAD PENTAHO
https://privatefilesbucket-community-edition.s3.us-west-
2.amazonaws.com/9.4.0.0-343/ce/client-tools/pdi-ce-9.4.0.0-343.zip
LAB ACTIVITIES
CONTINUITY
NEXT STEPS
Introduction to
Python
Programming
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE
Pasty Asamoah
+233 (0) 546 116 102
[email protected]
Kwame Nkrumah University of Science and Technology
School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
PYTHON PROGRAMMING
PYTHON – Of course not a snake!
Python is a popular general-purpose programming language created by
Guido van Rossum, and released in 1991.
Python can be used for:
Software & Game Data Analytics,
Scientific
Development Visualizations, and
Computing
Machine Learning
DOWNLOAD & INSTALLATION
Python Anaconda
https://www.python. https://repo.anaconda
org/ftp/python/3.12. .com/archive/Anacond
1/python-3.12.1- a3-2023.09-0-
amd64.exe Windows-x86_64.exe
PYTHON SYNTAX
The set of rules that defines how a Python program will be written and
interpreted.
Indentation
Defines a block of
code
Syntax Error
We’ll discover other important rules as we progress
Credit: w3schools
HELLO WORLD! – Programming Tradition
!!! Notice the absence of a semi-colon after the bracket close
VARIABLES
• Variables are containers for storing data values.
• Python has no command for declaring a variable.
• A variable is created the moment you first assign a value to it.
DO YOU REMEMBER BACK IN HIGH SCHOOL?
These are variables in python
VARIABLES
• Variables are containers for storing data values.
• Python has no command for declaring a variable.
• A variable is created the moment you first assign a value to it.
Notes:
Emmanuella is wrapped
in a quotation mark.
Print(username) is not in
a quotation mark
Print(‘Female’) is
wrapped in a quotation
mark
RULES IN NAMING VARIABLES
• A variable name:
• must start with a letter or the underscore character
• cannot start with a number
• can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
• is case-sensitive (age, Age and AGE are three different variables)
• cannot be any of the Python keywords.
Error but why?
Credit: w3schools
PYTHON KEYWORDS
Keywords are predefined, reserved words used in Python programming that
have special meanings to the compiler.
!!! These keywords are reserved for the python programming languages
Credit:programiz
COMMENTS
• In computer programming, comments are hints that we use to make our
code more understandable. They are completely ignored by the
interpreter. In python, we use the # symbol for commenting.
Notes:
Anything after the # symbol is ignored Comment
Credit: programiz
DATA TYPES
DATA TYPES
• In computer programming, data type refers to the type of value a variable
holds. The data type of a variable ensures that mathematical, relational or
logical operations can be applied without causing an errors. Python
supports the following:
Credit: w3schools
DATA TYPES
Credit: w3schools
NUMERIC DATA TYPES
Notes:
Concatenation. We
But how do I know the data type of a variable ?
can also use the +
symbol
Credit: programiz
CHECK DATA TYPE
• In python programming, to know the data type of a variable, we use the
type() function. Of course you don’t know what functions are. We’ll talk
about them in moments. For now, understand that we use the type()
function to get the data type of a variable.
But how do I convert between data types? Well… Notice ‘complex’,’float’, ‘int’
Credit: programiz
DATA TYPE CONVERTION
• We can easily switch between data types. Pay close attention to the results
after we checked the data types. They returned something like:
<class ‘int’>. Now, to convert any numeric value to an integer, we use int()
Wrapped in float()
function
This applies to converting to int(), float(), complex().
Credit: programiz
Hands-on: 5 mins maximum
Create:
variable X = 23
Variable Y = “12”
Tasks:
1. Print the data type of X
2. Print the data type of Y
3. Compute Z = X+Y such that the result of Z = 35
STRING DATA TYPE
• In python programming, to know the strings are enclosed in quotation
marks “”. For instance, 22 is an integer but “22” is a string. We can use
the single or double quotes to create string variables in python
Not in quote
In quote
Credit: programiz
STRING MANIPULATION: LEN()
• There are several operations on strings. For instance, we can get the
length of the string, slice parts of the string, check values in a string, etc.
18 characters
Len() function counts the number of
characters including white spaces
Credit: programiz
CHECK STRING EXISTENCE: IN
• To check if a set of characters are present in the string, we use the IN
keyword. The result is a Boolean: True / False
Boolean result
The in keyword to check if the set of
characters in a sequential order “prog”
exists in the text
Credit: programiz
CHECK STRING NOT EXIST: NOT IN
• To check if a set of characters are not present in the string, we use the
NOT keyword. Basically the NOT keyword is for negation. The result is
a Boolean: True / False
Boolean result
The not keyword to check if the set of
characters is not in a text.
Credit: programiz
STRING SLICING
• Sometimes you may want to slice a portion of a string. Imagine you have
the string “Hello world”, but your interest is the text “world”. To slice
those characters from the string, we leverage the slicing technique
• Text = “Hello world”
-i -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
+i 0 1 2 3 4 5 6 7 8 9 10
char H e l l o w o r l d
• Variable[start index : end index] # default start and end is 0:0 respectively
• The length of your result will always be = end index – start index
Credit: programiz
STRING SLICING
From the back, the
index starts from -1
Credit: programiz
STRING METHODS
• There are other several operations we can carry on strings. For instance,
we can convert from upper to lower, check if a string is numeric, etc.
Kindly find a complete list for your reference at:
• https://www.w3schools.com/python/python_strings_methods.asp
Name of the string,
dot, method name
Credit: w3schools
STRING FORMATTING
• There are different approaches in formatting string.
Use + to
concatenate strings
Use placeholders for the
variable names. It matches the
positions of the variables used
in the format method
Injected the variable names
• I will be using the 3rd approach for examples
Credit: w3schools
LIST DATA TYPE
• In python programming, lists are used to store multiple items in a single
variable. Have you realized we have been creating single items per
variable? There are instances where you may need to store more than one
item Variable name
items List accepts
duplicates
0 1 2 3 4
apple banana cherry apple cherry
• We can get the length, count items, slice, etc.
Credit: w3schools
REPLACE & ADD TO LIST
• Replace
Banana will be replaced
with blackcurrant
Index of item to replace
• Add
Item to be added to the
list
Call the method append() on the name of the
list or variable
Credit: w3schools
REMOVE FROM LIST
• Remove
Call the method remove() on the name of the
list or variable
• We can also achieve that with the del keyword
Credit: w3schools
LIST METHODS
• Reference: https://www.w3schools.com/python/python_lists_methods.asp
Credit: w3schools
TUPLE DATA TYPE
• Tuples are used to store multiple items in a single variable. A tuple is a
collection which is ordered and unchangeable. Tuples are written with
round brackets.
Variable name
items
• How would you add “grapes” to the tuple?
• Remember we can convert a tuple to a list, add grapes, and convert it
back to tuple!
Tuple Methods Reference:
https://www.w3schools.com/python/python_tuples_methods.asp
Credit: w3schools
SET DATA TYPE
• Sets are unindexed data types used to store multiple items in a single
variable. Sets remove duplicates from list of items
Variable name
items
It removed all
duplicated values
Credit: w3schools
MANIPULATING SETS
• ADD
Use the add method
• UPDATE
Set 1
Set 2
Call the update method to
Credit: w3schools
merge items into a single set
SET METHODS
• Reference: https://www.w3schools.com/python/python_sets_methods.asp
Credit: w3schools
DICTIONARY DATA TYPE
• Dictionaries are used to store data values in key:value pairs. Dictionary
items are ordered, changeable, and do not allow duplicates.
Values
Use colon
Dictionary keys
• We can easily access the car brand as:
thisdict[“brand”]
Credit: w3schools, programiz
DICTIONARY METHODS
Methods Reference: https://www.w3schools.com/python/python_dictionaries_methods.asp
Credit: w3schools
OPERATORS
OPERATORS
• In programming, operators are used to perform operations on variables
and values. There are several of them in python
o Arithmetic operators
o Assignment operators
o Comparison operators
o Logical operators
o Etc.
Credit: w3schools
ASSIGNMENT OPERATORS
• Assignment operators are used to assign values to variables:
The statement
is not true
Credit: w3schools
COMPARISON OPERATORS
• Comparison operators are used to compare two values
The statement
is true
Credit: w3schools
LOGICAL OPERATORS
• Logical operators are used to combine conditional statements.
The statement
is not true
Credit: w3schools
CONDITIONAL CONTROLS
IF-STATEMENT
• In computer programming, the if statement is a conditional statement. It
is used to execute a block of code only when a specific condition is met.
For example, Suppose we need to assign different grades to students
based on their scores:
• If a student scores above 90, assign grade A
• If a student scores above 75, assign grade B
• If a student scores above 65, assign grade C
colon
Indentation Execute this
block of code if
and only if the
Credit: programiz condition is true
IF-STATEMENT
• The logic
Credit: programiz
IF-STATEMENT
Pay attention to the little space. Note that the print function is But what if the statement or condition returned False?
not directly under the “if”. This indicates that, the print We can handle that with an else block
statement belongs to the body of the if-statement
Credit: w3schools
IF-ELSE-STATEMENT
Notice that it did not print anything on the
screen. That is because the condition is
False. And hence, it did not execute the
block. Lets say we want to print some text if
the condition is false.
We included the else: block so
that if the condition fails, we
can still print some text on the
screen
Credit: w3schools
IF-ELIF-STATEMENT
• Sometimes you may want to test multiple conditions. In that case, we
employ the if-elif statement
We included the elif to check
two more conditions. You can
add as many as you wish
The else block
Credit: w3schools
LOOPS
LOOPS
• In programming, a loop is a control flow statement that is used to
repeatedly execute a group of statements as long as the condition is
satisfied. Such a type of statement is also known as an iterative statement
• Python has two (2) primitive loop commands:
• For loop
• While loop
FOR-LOOP
• A for loop is used for iterating over a sequence (that is either a list, a
tuple, a dictionary, a set, or a string). The python for-loop is less like for-
loops in several other programming languages
Credit: programiz
FOR-LOOP
Items to loop
through
Variable names. It
can be anything
Credit: w3schools
FOR-LOOP: BREAK AND CONTINUE
• The break and continue keywords are important for stepping out of the
loop and skipping items respectively.
The break keyword in action
If it gets to the condition where fruit is banana, Cherry??
it breaks our of the loop. Meaning, it will not
loop through any item after banana
Credit: w3schools
FOR-LOOP: BREAK AND CONTINUE
• The break and continue keywords are important for stepping out of the
loop and skipping items respectively.
The continue keyword in action
If it gets to the condition where fruit is banana, Banana was skipped
it skips it. And continue looping over the rest
of the items
Credit: w3schools
NESTED -FOR-LOOP
Credit: w3schools
WHILE-LOOP
• With the while loop we can execute a set of statements as long as a
condition is true. The implication is that, we need to keep track of an
updating element.
Credit: programiz
WHILE-LOOP
Execute as long as this
condition is true
Updates the value of
the conditioning item.
We will get into an
infinite loop without
this line of code
The break keyword in action The continue keyword in action
Credit: w3schools
FUNCTIONS
FUNCTIONS
• A function is a block of code which only runs when it is called. You can
pass data, known as parameters, into a function. A function can return
data as a result.
Credit: programiz
FUNCTIONS
• Defining and calling a function Function definition
Calling the function
NB: Until you make a function call, it will never get executed
Credit: w3schools
FUNCTIONS WITH ARGUMENT
• Defining functions with argument
Parameter
argument
FUNCTIONS WITH ARGUMENT
• Defining functions with argument
Placeholder/ parameter
We append the person
Function in action. name to the Good
morning string. This
will apply to any name
provided as an
argument to the
function
Credit: w3schools
FUNCTIONS WITH ARGUMENT
• Lets modify
Credit: w3schools
FUNCTIONS RETURN VALUES
• So far, our functions do not return values that are reusable. We can use
the return keyword to achieve that. Note that, functions that return values
do not automatically print to the screen.
Returning the sum of No output? But we
the two numbers called the function…
Credit: w3schools
FUNCTIONS RETURN VALUES
• So far, our functions do not return values that are reusable. We can use
the return keyword to achieve that. Note that, functions that return values
do not automatically print to the screen.
Assigning result to a
variable. This is possible
because we are returning the
result after the computation x
+y
Print the returned value
Credit: w3schools
OBJECTED-ORIENTED PROGRAMMING (OOP)
CLASS AND OBJECT
• Python is an object oriented programming language. Almost everything
in Python is an object, with its properties and methods. A Class is like an
object constructor, or a "blueprint" for creating objects.
• In OOP, everything is considered an Object with certain properties
(nouns) and functionalities/methods (verbs).
Credit: programiz
CLASS DEFINITION
• It is very simple to create a class in python
Class name
Class properties
Class methods
Credit: programiz
CLASS DEFINITION
Every function requires self. It can be
anything. It injects an instance of the class
We are trying to set the name value
We are trying to access the name property
using the dot
Credit: programiz
CLASS INSTANTIATION
We could decide to eliminate this code
Creating an instance of the class Human
Calling the method
Setting the human name
Credit: programiz
CLASS CONSTRUCTORS
constructor
We inject the name of the user when
instantiating the class
Credit: programiz
CASE
Credit: programiz
MODULES
MODULE
• Consider a module to be the same as a code library. It is a file containing
a set of functions you want to include in your application. To create a
module just save the code you want in a file with the file extension .py
• Create a file and name it: mathematics.py
function
MODULE
• Create another file in the same directory as the mathematics.py and name
it: use.py
Importing the mathematics.py file
Using the add function in the
mathematics.py file
• Note that we imported all the codes in the mathematics file but we used
only the add function.
• So lets see how to import only the add function
MODULE
See how we import the add function
See the use too
• Sometimes we can import and rename
MODULE
We introduce the “as” aliasing keyword
We access the functions in the
mathematics file with mt, the aliase
NEXT STEPS
TO THIS END…
We have learnt the basics of python programming specifically;
• Loops (e.g., for-loop and while-loop)
• Functions
• OOP
• Modules
• In the next session, we will learn about data manipulation using
pandas
READING ASSIGNMENT
Be sure to go through these tutorials for a practical experience on python
programming
• https://www.programiz.com/python-programming
• https://www.w3schools.com/python/
• https://www.javatpoint.com/python-tutorial
• https://www.youtube.com/watch?v=QXeEoD0pB3E&list=PLsyeobzWx
l7poL9JTVyndKe62ieoN-MZ3
NEXT STEPS
Data Pre-processing
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE
Pasty Asamoah
+233 (0) 546 116 102
[email protected]
Kwame Nkrumah University of Science and Technology
School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
DATA PREPROCESSING
REAL WORLD DATA CAN BE “MESSY”
Data preprocessing is the crucial first step in data analysis, where you
transform raw data into a clean and understandable format suitable for
further analysis.
DATA PREPROCESSING TECHNIQUES
Credit: w3schools
INTRODUCTION TO PANDAS
PANDAS
Pandas is a Python library used for working with data sets.
Data Visualization Data Cleaning Data Exploration
e.g., heatmap e.g., duplicates e.g., correlation
Data Manipulation
e.g., transformation
READING DATA
We need to import the pandas package to use it.
Pandas package and use as pd
Credit: w3schools
SERIES
In Pandas, series is just like a column in a table.
Column without label
Column with label
Credit: w3schools
DATAFRAME
DataFrame is like the whole table.
Table column values
Table column names
Credit: w3schools
LOCATING ROW VALUE
Locating row value is similar to indexing
First row value
First row or index zero. Notice one print(df.loc[0:1])
open and close square bracket
Credit: w3schools
DATA TYPE CONVERTION
Locating row value is similar to indexing
These are float data types
Convert column A to integer
data type
First row or index zero. Notice one
open and close square bracket
HANDLING DATA
IMPORT DATA
Often, you’d want to import and work with data other than creating
them manually. In pandas, we can import from array of sources
including CSV.
The directory and
name of the file
Reading file
Credit: w3schools
SNAPSHOT OF THE DATA
HEAD AND TAIL
We can use the head() and tail() methods to have a snapshot of the topn
and lastn details of our data
Top 3 records
Last 3 records
Displaying top 3 records
DATA SHAPE
Data shape refers to the number of rows and columns in our data. The
size is the product of the rows and columns df.size
Columns
Rows
DATAFRAME COLUMNS
Dataframe columns are the headings in the table.
Columns
All columns in the
dataframe
FEATURE SELECTION
Selecting table column(s) in pandas is quite easy. The column selection
is on the premise that, you may want to work with a fraction of the
table. For instance, you might be just interested in Temperature column
and not the others.
For instance these two queries are valid and will return the temperature
column. However, the first approach can be used only when the column
name does not contain space:
df.Temperature Returns the top 3 rows
of the temperature
df['Temperature'] column
Selecting
multiple
columns
ALL vs UNIQUE VALUES
Selecting table column(s) in pandas is quite easy. The column selection
is on the premise that, you may want to work with a fraction of the
All values in temperature column
Unique values in
temperature column
CONDITIONAL SELECTION
Sometimes, you may want to select columns or rows based on some conditions
Selecting all columns where
temperature is >= 40
Selecting specific columns
where temperature is >= 40
CONDITIONAL SELECTION
What is going on here?
GET TO KNOW YOUR DATA
Basic descriptive statistics using pandas: Minimum and Maximum values
Maximum value in each
column?
MEAN / AVERAGE
Mean value of the temperature column
Means of each column
STANDARD DEVIATION
Standard deviation value of the temperature
column
Standard deviation of each
column
CORRELATION AND COVARIANCE
Covariance matrix
Correlation matrix
Correlation matrix
EVERYTHING AT A GLANCE
HANDLING MISSING VALUES
ANY MISSING VALUES?
Missing values can be handled in several ways. For instance, we may want to
drop, impute, interpolate, or even fill it with some values (e.g., average)
Checks if any column has missing values
Have missing
values
WHEN ENTIRE ROW IS MISSING
We may want to drop
Drop row
Drop row if the entire row is missing
WHEN ENTIRE ROW IS MISSING
We may want to drop but not the best approach
Drop row
Drop row if any of the columns has missing value
Its not the best solution. Our data is finished!
FILL ALL NULL VALUES
Fill every missing value with 12
These are supposed to be categories.
We got it wrong!
FILL NULL VALUES BY COLUMNS
Fill respective columns with specific value
Looks a bit better
FORWARD & BACKWARD FILL
forward
backward
Looks a better with the backward fill
INTERPOLATE
forward
backward
Looks a better with the backward fill
HANDLING DUPLICATES
DUPLICATES
Duplicated rows affect results. We handle them by deleting them.
Check if any row is
duplicated
No row is duplicated
DUPLICATES
Duplicated rows affect results. We handle them by deleting them.
Check if any row is
duplicated
No row is duplicated
Replace the original
copy of the data
Drop duplicates
NEXT STEPS
TO THIS END…
We know;
• The basics for creating and handling dataframes
• Basic data cleaning techniques
• In our next session, we will learn about building basic machine
learning models
NEXT STEPS
Introduction to
Machine Learning
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE
Pasty Asamoah
+233 (0) 546 116 102
Kwame Nkrumah University of Science and Technology
School of Business
Supply Chain and Information Systems Dept.
DATA VISUALISATION & STORY TELLING
INTRODUCTION
Data is only as good as your ability to understand and communicate it.
It is essential to choose the right visualization to communicate and tell the story
behind the data.
Key insights and understandings are lost when data is ineffectively presented,
which affects the story behind the data.
VISUALISATION? WHAT IS IT??
DATA VISUALISATION
Data visualization is the process of representing data in visual
or graphical formats to facilitate understanding and
communication of complex information.
It involves transforming raw data into charts, graphs, maps,
and other visual elements that convey patterns, trends, and
insights.
IMPORTANCE OF
DATA VISUALISATION
Data visualization is important for several reasons:
1. Enhances data understanding: Enables users to grasp information
quickly and comprehend complex relationships.
IMPORTANCE OF DATA
VISUALISATION’
2. Identification of patterns and trends: Helps uncover hidden
patterns, trends, and relationships in data that may not be apparent
in raw form.
IMPORTANCE OF DATA
VISUALISATION’’
3. Effective communication: Facilitate the communication of data
insights to a broad audience, making it easier to present findings,
tell stories, and convey messages.
4. Data-driven decision making: Well-designed visualizations
empower users to make informed decisions based on data analysis
and identification of trends, outliers, and correlations.
5. Data exploration: Data visualization helps in exploratory
analysis, and story telling or reporting.
CHALLENGES OF DATA
VISUALISATION
• Data quality and preprocessing: Ensuring data accuracy,
consistency, and completeness is crucial before creating
visualizations.
• Interpretating and avoiding biases: Designing visualizations that
are intuitive and free from misleading interpretations or biases to
ensure accurate understanding by users.
??
CHALLENGES OF DATA
VISUALISATION
• Choosing appropriate visual representations: Selecting the right
chart types, graphs, or maps that effectively represent the data
and align with the message or analysis goals
• Handling large and complex data: Visualizing big datasets or
complex data structures can pose challenges in terms of
scalability, performance, and usability.
STEPS IN DATA VISUALISATION
1. Know the data: Understand the category and type of data.
2. Understand the information needed from the data.
3. Select the appropriate technique and visualization tool.
The steps can be complex in real-life implementation
1. KNOW THE DATA
Data can be quantitative or qualitative in nature. Let’s focus on
what’s measurable.
2. UNDERSTANDING THE INFORMATION
NEEDED FROM THE DATA’
Commonly, the questions asked ranges from nominal comparisons,
time-series, correlations, ranking, deviations, distributions, and
part-to-whole relationships.
Nominal Comparisons
Simple comparison of quantitative values
(e.g.) Total employees, Average salary, etc.
Time-series
Changes in values of a consistent metric over equal
time-spaced interval. (e.g.) Monthly sales.
2. UNDERSTANDING THE INFORMATION
NEEDED FROM THE DATA’’
Correlations
Determines whether there exist a relationship between
variables and the extent of the relationship. (e.g.)
Relationship between employee salary and
performance.
Ranking
Comparison between two or more values relative
magnitude. (e.g.) Performance ranked from educational
levels of employees from Degree holders to PhD.
2. UNDERSTANDING THE INFORMATION
NEEDED FROM THE DATA’’’
Deviations
Dispersion of the data points from each other
especially the average. (e.g.) Performance of
employees this year versus last year.
Distribution
Data distribution, often around a
central value. It shows how often values occur in a
dataset (e.g.) Age distribution of employees
Part-to-whole relationship
Subset of data compared to the larger whole. It shows a
breakdown of elements that add up to a whole. (e.g.)
Number of employees who were absent today.
3. SELECT APPROPRIATE TECHNIQUE
AND VISUALISATION TOOL
Understanding data and knowing the kind of information needed
(answers) influence the visualization technique and tool applied. We
explore commonly used tools, techniques, and visualizations in
industries.
Data Visualization Tools
Graphical User Interface Tools Programming Languages
TECHNIQUES AND VISUALISATIONS
Column and Bar Charts
Used to show change over time, compare categories or compare parts of a whole. Each bar
or column represent a category and the length of the bar or column being proportional to the
value it represents.
It’s a number line
Ideal for visualizing chronological data Ideal for visualizing chronological data
with long category names
TECHNIQUES AND VISUALISATIONS
Stacked Bar Charts and 100% Stacked Bar Charts
Stacked bar chart 100% stacked bar chart
Ideal for visualizing chronological data Ideal when total categories is not the
while comparing multiple part-to-whole focus. It focuses on the composition of
relations each subcategory over time
TECHNIQUES AND VISUALISATIONS
Double Bar Charts
Double bar chart
Ideal for comparing categories over- What do you think about this column chart??
time
TECHNIQUES AND VISUALISATIONS
Pie and Donut Charts
Used to compare categories or compare parts of a whole. It is ideal for small data sets with
fewer categories.
Pie Chart Donut Chart
TECHNIQUES AND VISUALISATIONS
Pie and Donut Charts
What do you think about this Pie Chart??
TECHNIQUES AND VISUALISATIONS
Line Charts
Used to show changes over time (time-series) by using data points represented by dots that are
connected by a straight line. Put differently, it shows time-series relationships with continuous data.
They help show trend, acceleration, deceleration, and volatility.
Shows changes in data over time while
comparing different categories
TECHNIQUES AND VISUALISATIONS
Scatter Plots
Used to show the relationship between two variables. They are best used to show correlation in large
data sets and identifying outliers.
Scatter plot with an outlier
TECHNIQUES AND VISUALISATIONS
Funnel
Used to visualize a linear process that has connected sequential stages. The value of each stage in the
process is indicated by the funnel's width as it gets narrower.
TECHNIQUES AND VISUALISATIONS
Cards
Mostly used to display KPIs. (e.g.) turnover
TECHNIQUES AND VISUALISATIONS
Guage
A gauge consists of a circular arc which shows a singular value that measures progress towards a KPI or
goal. The line on the arc represents the target or goal and the shading represents the progress made
towards it. The value inside of the arc shows the progress value.
TECHNIQUES AND VISUALISATIONS
Map
Used for visualizing data across different locations and distances. (e.g.) Answer questions on cities or
countries and the related data such as number of employees, sales, etc.
TECHNIQUES AND VISUALISATIONS
Treemap
Used to display large quantities of hierarchically structured data, using nested rectangles. The chart
shows different perspectives of the data by displaying the rectangles as different sizes and colors based
on the frequency of occurrence. It is not ideal for visualising large categories
What do you think about this
treemap??
DOs AND DON’Ts IN DATA
VISUALISATION
1. Avoid slated labels if possible
DOs AND DON’Ts IN DATA
VISUALISATION
2. Include a Zero baseline
DOs AND DON’Ts IN DATA
VISUALISATION
3. Order your data
DOs AND DON’Ts IN DATA
VISUALISATION
4. Use suitable proportions
DOs AND DON’Ts IN DATA VISUALISATION
5. Keep it simple but insigtful
PYTHON DATA VISUALISATION LIBRARIES
DATA VISUALISATION WITH PYTHON
• Python is widely used for data visualization due to its
simplicity, versatility, and rich ecosystem of libraries.
• Basic plots: Matplotlib or Seaborn.
• Interactive dashboards: Plotly
• Quick declarative visuals: ggplot
DATA VISUALISATION WITH PYTHON
• Python is widely used for data visualization due to its
simplicity, versatility, and rich ecosystem of libraries.
• Basic plots: Matplotlib or Seaborn.
• Interactive dashboards: Plotly
• Quick declarative visuals: ggplot
DATA VISUALISATION WITH MATPLOTLIB
INTRODUCTION TO MATPLOTLIB
• Matplotlib is “a comprehensive library for creating static,
animated, and interactive visualizations in Python. Matplotlib
makes easy things easy and hard things possible” (matplotlib,
2025).
• Most of the Matplotlib utilities lies under the pyplot submodule
Accessing just the Alias
Credit: w3schools pyplot module
BASIC PLOTS
x-axis values
y-axis values
method for plotting
display the graph
Output/graph
Credit: google colab
LINE CHARTS
Notice that we have
just single points
Credit: google colab
MARKERS
Markers
Credit: w3schools
MARKERS
Full list - https://www.w3schools.com/python/matplotlib_markers.asp
Credit: w3schools
LINES
Lines
Credit: w3schools
LINES
Full list - https://www.w3schools.com/python/matplotlib_line.asp
Credit: w3schools
LINES
• Lines have other properties that allows for modifying colors,
line width, etc.
color = ‘red’
linewidth = '20.5'
Credit: w3schools
MULTIPLE LINES
Credit: w3schools
LABELS
• Have you noticed that all our visuals do not communicate any
specific insight?
• Pyplot allow users to set labels to define the information
communicated. E.g., Title, x-axis, y-axis
What is this visual
communicating?
Credit: w3schools
ADDING LABELS
xlabel
Credit: w3schools
Title
ADDING TITLE
Y-values
Title
Credit: w3schools X-values
MULTIPLE PLOTS
• Sometimes, you may want to display multiple visuals on the
same graph. Matplotlib provide the subplot function to achieve
this.
• plt.subplot(rows, columns, position)
• rows indicate the number of rows (integer)
• columns indicate the number of columns (integer)
• Position indicate whether the current graph should be displayed
first, second,…,nth
Credit: w3schools
MULTIPLE COLUMNS PLOT 1 row, 2 columns subplots
First subplot Second subplot
First subplot
Second subplot
Credit: w3schools
MULTIPLE ROWS PLOT 2 rows, 1 column subplots
First subplot
2 rows, 1 column
Second subplot
2 rows, 1 column
Credit: w3schools
MULTIPLE ROWS AND COLUMNS PLOT
Credit: w3schools Try this
SCATTER PLOTS
• We can use the scatter method to achieve this.
Credit: w3schools
SCATTER PLOTS
2 rows, 1 column
Credit: w3schools
SCATTER PLOTS
• Note that the scatter method allows for setting marker colors as
well
Red color for the first plots
Blue color for the first plots
Credit: w3schools
BAR PLOTS
• We can use the bar method for bar charts
• The bar plots has properties that allow for changing colors, bar
width, height, etc.
• Color = (string)
• Width = (float)
• Height = (float)
Credit: w3schools
BAR PLOTS
Using the bar function
Credit: w3schools
BAR PLOTS
What will this graph look like?
Credit: w3schools
HORIZONTAL BARS (barh)
Notice we used the barh
method
Credit: w3schools
PIE CHARTS
• We can use the pie method visualize data using a pie chart
The pie method
Credit: w3schools
PIE CHARTS
• We can override the default colors
List of colors for each label
Credit: w3schools
CHART LEGEND
• We can easily define a legend for the chart
Legend
Credit: w3schools
NEXT STEPS
TO THIS END…
We know;
• The basics for creating and handling visualizations
• In our next session, we will learn about building basic predictive
machine learning models
NEXT STEPS
Introduction to
Machine Learning
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE
Pasty Asamoah
+233 (0) 546 116 102
[email protected]
Kwame Nkrumah University of Science and Technology
School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
INTRODUCTION TO MACHINE LEARNING
MACHINE LEARNING
Machine learning is a field of AI that involves the development of
algorithms and statistical models that enable computers to learn and
improve their performance on a specific task without being explicitly
programmed.
learns learns from labeled data
from
unlabeled
data
learns to make
decisions by
interacting with an
environment
MACHINE LEARNING MODELS
Machine learning models can range from simple linear regression to
complex deep neural networks.
Simple linear regression
SIMPLE LINEAR REGRESSION MODEL
Data preprocessing Build Model Evaluate
Clean data
Select model Check accuracy
Split data
OUR FIRST MACHINE LEARNING MODEL
Snapshot of the housing
dataset
DATA INGESTION
Import
packages
Load data
DATA CLEANING
Handle duplicates
There are no
missing values
DATA CLEANING
Column data
types
We will be
working with
the integer data
types at this
stage.
FEATURE SELECTION
Predictors
What we want
to predict
MODEL SELECTION
Define: What type of model will it be? A decision tree? Some other
type of model? Some other parameters of the model type are
specified too.
Fit: Capture patterns from provided data. This is the heart of
modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model's predictions are.
In this case we want to build a very basic
linear regression model using the scikit
learn library
Importing the
MODEL SELECTION linear regression
model
Create the model
Train the model
Importing the
MODEL SELECTION linear regression
model
Create the model
Train the model
We predict with a
MAKING PREDICTIONS set of predictors
The predictions
DECISION TREE
SIMPLE DECISION TREE MODEL
Data preprocessing Build Model Evaluate
Clean data
Select model Check accuracy
Split data
DECISION TREE MODEL
Machine learning models can range from simple linear regression to
complex deep neural networks.
Decision Tree
DECISION TREE Import decision tree from sklearn
model
Train model
Make predictions
Predicted VS
Actual are the
same. That is a
100% accuracy.
BUT WHY??
LETS MODIFY OUR MODEL BY
INTRODUCING TRAINING AND TEST
DATASETS
We realized that our model performed well with an accuracy of 100%.
This is unlikely in real-world scenerios.
The reason for the 100% accuracy is that, we were trying to predict Y
values with X values that the model has seen before. The model saw
it in the Training Stage
What about testing our model on data that the model has not seen
before??
Let’s give it a shot!!!
INGESTION, CLEANING, AND SELECTING
VARIABLES
We import the
decision tree
model
Dependent Independent variable
variable
SPLIT DATA
The method for
splitting the data
SPLIT DATA
data 80% for training and 20%
for testing
Dataset for
training
Dataset for
testing
MODEL SELECTION
Train dataset
Test dataset
MODEL PERFORMANCE
Checks error
margin
Error margin
LETS MODIFY THE MODEL A BIT BY
SPECIFYING LEAVES
Error margin before updating parameter
Error margin after updating
parameter
PROBLEM OF UNDERFITTING AND
OVERFITTING
DIFFERENT LEVELS OF LEAVES
Error margin is high for 50 leaves
HANDLING CATEGORICAL DATA
CATEGORICAL DATA
Have you realized that we couldn’t include these attributes in the model?
HANDLE CATEGORICAL COLUMNS
Label Encoder One-Hot-Encoder Dummies
LABEL ENCODERS
Importing LabelEncoder
LABEL ENCODERS’
Columns of interest. We
believe that these columns
predict house prices. We
need to convert them to
numerical forms
TRANSFOMING CATEGORICAL COLUMNS
Instantiate Label encoder Transform values Categorical column
to convert
ADD TRANSFORMED COLUMNS TO
DATAFRAME
New column name Transformed values
ADD TRANSFORMED COLUMNS TO
DATAFRAME
New column name Transformed values
SNAPSHOT OF TRANSFORMED COLUMNS
New columns added
INDEPENDENT & DEPENDENT VARIABLES
Select columns based on data types. Drop the price column. By default, it will be included because
Exclude columns with data type object we are selecting all columns other than objects.
DUMMIES columns
Pandas method to handle
categorical columns
Note that it create multiple columns for each of them
based on the number of unique values in the column
DUMMIES columns
Pandas method to handle
categorical columns
Note that it create multiple columns for each of them
based on the number of unique values in the column
INDEPENDENT & DEPENDENT VARIABLES
Select columns based on data types. Drop the price column. By default, it will be included because
Exclude columns with data type object we are selecting all columns other than objects.
Task 1: Build a model with either linear
regression or decision tree and report on the
best model. Remember to apply all skills and
knowledge you have acquired especially
splitting data set into training and testing, and
encoding categorical columns
ENSEMBLE MODELS
RANDOM FOREST MODEL
Ensemble models combine multiple individual models to improve predictive
performance. A popular ensemble method is RandomForest, but there are
others like Gradient Boosting and AdaBoost.
ANY QUESTIONS??