0% found this document useful (0 votes)
6 views

Python For Data Science 2025 Slides

The document serves as an introduction to Python for Data Science, covering essential concepts such as data types, structured and unstructured data, and the importance of data quality. It discusses the data science ecosystem, including data engineering, analytics, and ethical considerations, while also explaining relational databases, normalization, and SQL. The content emphasizes the challenges of managing big data and the necessity of building data pipelines for effective data management and analysis.

Uploaded by

papwilly
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Python For Data Science 2025 Slides

The document serves as an introduction to Python for Data Science, covering essential concepts such as data types, structured and unstructured data, and the importance of data quality. It discusses the data science ecosystem, including data engineering, analytics, and ethical considerations, while also explaining relational databases, normalization, and SQL. The content emphasizes the challenges of managing big data and the necessity of building data pipelines for effective data management and analysis.

Uploaded by

papwilly
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 364

INTRODUCTION TO PYTHON FOR DATA SCIENCE

Pasty Asamoah
+233 (0) 546 116 102
[email protected]

Kwame Nkrumah University of Science and Technology


School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
THE CONCEPT OF DATA
DATA

Data is raw facts. It is anything you can measure or record.


STRUCTURED DATA

• Follows a rigid format.


• Well-defined structure.
• Organized into rows and columns (tabular).
• Stored in a well-defined schema such as databases.
STRUCTURED DATA
UNSTRUCTURED DATA

• Complex and mostly qualitative in nature.


• Not organized in rows and columns.
• Does not have an easily identifiable structure.
• Does not follow any particular rigid format or sequence.
UNSTRUCTURED DATA
SEMI-STRUCTURED DATA

• A mix of structured and unstructured data


• It lacks a rigid schema.
BIG DATA
THE BIG DATA PROBLEM

Data size is growing exponentially! It is becoming BIG

Image courtesy: IDC

But why is this a PROBLEM ?


Well,…
THE BIG DATA PROBLEM
The data coming in is
• Large (Volume),
• A mix of structured, unstructured, and semi-structured (Variety),
• The data is generated near real-time (Velocity)
• It contains BUSINESS VALUE

While big data is valuable,


it is becoming increasingly
difficult to manage using
traditional approaches –

Big Data
DEALING WITH BIG DATA
We develop capabilities to leverage big data to drive
performance!!
BIG DATA - FACEBOOK

https://www.youtube.com/watch?v=_r97qdyQtIk
THE DATA SCIENCE CONCEPT
DATA SCIENCE

• Data science is a multidisciplinary field of study with


goal to address the challenges in big data. They solve
problems with data!

• Data science is an interdisciplinary field of scientific


methods, processes, algorithms, and systems to
extract knowledge or insights from data in various
forms, either structured or unstructured...” - wikipedia

• Data scientists manages, manipulates, extracts, and


interprets knowledge from tremendous amount of
data.
DATA SCIENCE

• Some say: data science is a link between


computational, statistical, and substantive expertise.
DATA SCIENCE
• Others say
DATA SCIENCE VS BIG DATA

• Of course, data science is NOT same as big data

• See big data as the “raw material”, and data science


as the “processing of the raw material” for insights
and better understanding (application).

• Data science is about “applications”, therefore any


domain with large data sets are potential candidates.
DATA SCIENCE APPLICATIONS

Source: https://indico.ictp.it/event/7658/session/10/contribution/58/material/slides/0.pdf
DATA SCIENCE APPLICATIONS

Source: https://indico.ictp.it/event/7658/session/10/contribution/58/material/slides/0.pdf
DATA SCIENCE ECOSYSTEM
DATA SCIENCE ECOSYSTEM

• Data Engineering : building and managing data


pipelines

• Data Analytics: data exploration, building models


and algorithms, data visualization and storytelling

• Data Protection & Ethics: data security, privacy,


ethical concerns and regulatory issues
DATA ENGINEERING

• Data engineering - building of systems to enable the collection


and usage of data. This data is usually used to enable subsequent
analysis and data science- Wikipedia

• Data engineering in most cases is concerned with building data


pipelines leveraging varied platforms and tools.
DATA PIPELINES EXPLAINED

• From the concept of big data, we understand that organizational


data is stored or sourced from varied sources.

• These are
called sources
DATA PIPELINES EXPLAINED

• For analysis and data mining, the organization would like to


collect and store all data from the varied sources to a single place
– data warehouse or data lake. These are called destinations
DATA PIPELINE

• The organization will now need to build an automated process to


transfer the data from the source to the destination.

• The automated process involved in moving the data from the


varied sources to the destination (data warehouse/ data lake) is
called the data pipeline.

Source: Nischay Thapa


ACTIVITIES IN BUILDING A DATA
PIPELINE

• Building a data pipeline involves series of activities to ensure the


quality of the data transferred.

Source: Nischay Thapa


WHY DATA QUALITY MATTERS

• Building a good data pipeline is an important aspect of data


science because the quality of the data affect predictions.

Source: https://data.cs.sfu.ca/QjZo/slides.pdf
DATA QUALITY DIMENSIONS

Source: https://data.cs.sfu.ca/QjZo/slides.pdf
COMMON TYPES OF DATA PIPELINES

1. ETL Pipelines (Most used in data warehouses)


• Extracts data from various sources
• Transforms it (e.g., remove duplicates, special chars, etc.)
• Loads it into the destination system.

2. ELT Pipelines (Most used in data lakes)


• Extracts data from various sources
• Loads it into the destination system.
• Transforms it (e.g., remove duplicates, special chars, etc.)

3. Batch Data Pipeline


• Process data in large chunks
• At specific intervals
• Used for non-time-sensitive data

4. Streaming Data Pipeline


• Process data in real-time
• Used for time-sensitive data (e.g., financial transactions)
TOOLS AND PLATFORMS FOR
BUILDING DATA PIPELINES
DATA ANALYTICS

• Data analytics form part of the data science ecosystem.

• It is the application of statistical and machine learning


techniques to draw insights from data under study and to make
predictions about the behavior of the system under study - M.
TAMER ÖZSU

• Statistics draw inference about a population from a sample

• Machine learning finds generalizable predictive patterns and


models.
TYPES OF DATA ANALYTICS
LEVEL OF COMPLEXITY IN DATA
ANALYTICS
DATA ANALYTICS TASKS AND METHODS

1. Clustering – identifying groups and structures that are “similar”


DATA ANALYTICS TASKS AND METHODS

2. Outlier detection – identifying anomalies


DATA ANALYTICS TASKS AND METHODS

3. Regression– best fit model with least error margin


DATA ANALYTICS TASKS AND METHODS

4. Summarisation – a more compact representation of the data


DATA PROTECTION & ETHICS
• Data needs to be protected
Facebook's Cambridge Analytica data
scandal

https://www.youtube.com/watch?v=VDR8qGmyEQg
DATA PROTECTION & ETHICS
DIMENSIONS OF DATA PROTECTION

PRIVACY SECURITY
• Proper handling, processing, storage • Protecting information from

and usage of information unauthorized access

• Privacy policies • Encryption

• Data retention & deletion policies • Infrastructure security

• Third-party management • Access control

• User consent • Monitoring


DATA SCIENCE LIFECYCLE
DATA SCIENCE LIFECYCLE

Source: https://data.cs.sfu.ca/QjZo/slides.pdf
DATA SCIENCE LIFECYCLE

Source: https://data.cs.sfu.ca/QjZo/slides.pdf
NEXT STEPS
TO THIS END…
We know that;

• big data holds value however, presents a challenge in managing it


with traditional data management tools.

• organizations develop data science capabilities to leverage big data


for performance.

• the data science competence involves data engineering where data


pipelines are built to ensure quality and the continuous flow of
data (data management).

• the destination for data pipelines are either data lakes or data
warehouses (there are of course others like data mart) which are
simply databases.

• For the data analyst or scientist to build models, they will


sometimes need to query such databases, hence the need to
understand basic database concepts and SQL application. In some
cases, data is presented in files (e.g., csv).
NEXT STEPS

Database and SQL


for Data Science
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE

Pasty Asamoah
+233 (0) 546 116 102
[email protected]

Kwame Nkrumah University of Science and Technology


School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
RELATIONAL DATABASES
DATABASES
When data is generated, we need to save or keep it in a safe place. The
place where the data is stored and managed is called a “database”.

Database is simply a collection of organized and interconnected data


that is stored electronically.

Database is a repository for storing, managing, and retrieving vast


amounts of data.
DATABASES
But wait, I have students
data, social media feed, and
financial data.

Should I put them together?

Well, create a
Table for
each data set.
DATABASES

Relational databases are a collection of tables and other elements


(e.g., views, stored procedures, etc.)
THE CONCEPT OF TABLES IN RDBMS
A table is a collection of data elements organized in terms of rows and
columns. A table is also considered as a convenient representation of
relations. It is the simplest form of data storage.
CHARACTERISTICS OF TABLES

Records or Rows: A single entry in a table is called a record or row. A


record in a table represents a set of related data.

Attributes or Columns: A table consists of several records (row), each


record can be broken down into several smaller parts of data known as
attributes. The table above has four attributes; ID, Name, Age and
Salary.
Data types: Data types are classifications that identify possible values
that columns in a table can store. (e.g., text, number, date)
DATA TYPES

Each column in a database table is required to have a data type.


The database developer must decide what type of data will be stored in each
column when creating a table. SQL data types can be broadly categorized as:

Date & Time


String Data Type Numeric Data Type
Data Type

Text and strings Numerical values Date and time


DATA TYPES – (STRING)

source: w3schools.com
DATA TYPES – (NUMERIC)

source: w3schools.com
DATA TYPES – (DATE & TIME)

source: w3schools.com
CONSTRAINTS
Constraints are the rules enforced on the data columns of a table. These are used to limit the type of data that can go into a
table. This ensures the accuracy and reliability of the data in the database.

Commonly used constraints in relational database design are:

NOT NULL − Column cannot have NULL value.


AUTO_INCREMENT – Increase column value automatically
DEFAULT − Default value for empty column.
UNIQUE − Ensures unique values in column.
PRIMARY Key − Uniquely identifies each record.
FOREIGN Key − Identifies a record in a related table.
INDEX − Create and retrieve data from the database very
quickly.
RELATIONAL DATABASE KEYS
Relationships between tables in a relational database can be created
with relational keys. A relation key is an attribute which can uniquely
identify a particular row or record in a relation. There are two (2) main
relational keys

Primary Key: A primary key is a special relational database table


column designated to uniquely identify each table record. It is used as a
unique identifier to quickly parse data within the table. Every table has
a primary key however, cannot have more than one primary key.

Foreign Key: A foreign key is a column or columns of data in one table


that refers to the unique data values (often the primary key data) in
another table.
PRIMARY AND FOREIGN KEYS IN ACTION
DATA INTEGRITY
DATA INTEGRITY
• Data integrity is about ensuring the accuracy, completeness,
consistency, and validity of an organization's data.

• Designing databases to ensure data integrity guarantee accurate and


correct data.

• But why is data integrity important to us?

Source: https://data.cs.sfu.ca/QjZo/slides.pdf
ASPECTS OF DATA INTEGRITY

Entity Integrity Domain Integrity Referential Integrity

Ensure rows that are


Ensure there are no Enforce valid entries
used by other
duplicate rows in a in tables using
records cannot be
table constraints
deleted

User-Defined Integrity

Enforces specific business rules that do


not fall within entity, domain, or
referential integrity
DATABASE NORMALIZATION
Consider the class table below:

1. Do you see that students, courses, and instructors information are put
together in a single table?

2. What happens to the records if instructor “Peter” changes his name to “John
Doe” ?

3. Can you spot redundancy?


DATABASE NORMALIZATION

To efficiently organize data in a database, we need to normalize it.

Database normalization is a database design technique which organizes


tables in a manner that reduces redundancy and dependency of data. It
divides larger tables to smaller tables and links them using relationships

Reasons for database normalization

1. Eliminate redundant data (e.g., storing same data in multiple tables)

2. Ensure data dependencies make sense.


NORMAL FORMS (NF)

• Normal forms are guidelines that help in designing database that


are efficient, organized, and free from data anomalies.
1NF
FIRST NORMAL FORM (1NF)
A table is in 1NF when it has unique values and no
repeating groups

ORIGINAL TABLE

Source: https://byjus.com/gate/first-normal-form-in-dbms/
2NF
SECOND NORMAL FORM (2NF)
A table is in 2NF when it is in 1NF with no partial
dependency (an attribute in a table depends on only a part
of the primary key and not the whole key)

ORIGINAL TABLE

Source: https://byjus.com/gate/second-normal-form-in-dbms/
3NF
THIRD NORMAL FORM (3NF)
A table is in 3NF when it is in 2NF with transitive
dependency

ORIGINAL TABLE

Source: https://byjus.com/gate/third-normal-form-in-dbms/
DENORMALIZATION

• Denormalization refers to the process of deliberately introducing


redundancy into a relational database by grouping data that is
distributed across multiple tables.

• This is done to improve the performance of certain queries at the


cost of data redundancy.

• Data scientist work with denormalized databases often called data


warehouse
ENTITY RELATIONSHIP MODELS (ERD)
ENTITY RELATIONSHIP DIAGRAM
ERD

• ER model forms the basis of an ER diagram


• ERD represents conceptual database as viewed by end user
• ERDs depict database’s main components:
• Entities (the object)
• Attributes (the characteristics and constraints)
• Relationships (links)
ERD - ENTITY

• ERD Refers to entity set and not to single entity occurrence


• Corresponds to table and not to row in relational environment
• The entity name, a noun, is written in capital letters
ERD - ATTRIBUTE

• Characteristics of entities
ERD - RELATIONSHIP

• Association between entities


• Participants are entities that participate in a relationship
• Relationships between entities always operate in both directions
• Relationship can be classified as 1:M
What are these
weird symbols?
ERD – CONNECTIVITY & CARDINALITY

• Connectivity describes the relationship classification


• Cardinality expresses minimum and maximum number of entity
occurrences associated with one occurrence of related entity
• Established by very concise statements known as business rules
SIMPLE HOSPITAL ERD
CLASS ERD
STRUCTURED QUERY LANGUAGE - I
STRUCTURED QUERY LANGUAGE
(SQL)

• SQL is a programming language designed for managing and


manipulating relational databases.

• It provides a standardized way to communicate with relational


database management systems (RDBMS) and perform various
operations such as querying, inserting, updating, and deleting data.

• SQL allows users to define the structure of databases, create tables,


specify relationships between tables, and set constraints to ensure
data integrity. It also provides a set of commands, known as SQL
statements, to perform operations on the data stored in the database.
SQL
• The standard SQL commands to interact with relational databases are
• CREATE <databases and tables>
• SELECT <tables>
• INSERT <tables>
• UPDATE <tables>
• DELETE <tables>
• DROP <databases and tables>

• These commands can be categorized based on their nature as:


• Data Definition Language
• Data Manipulation Language
• Data Control Language
• Data Query Language
DATA DEFINITION LANGUAGE

DATA MANIPULATION LANGUAGE

Source: tutorials point


DATA CONTROL LANGUAGE

DATA QUERY LANGUAGE

Source: tutorials point


STRUCTURED QUERY LANGUAGE - II
SQL – CREATE, USE & DROP DATABASE
The SQL CREATE DATABASE statement is used to create a new SQL database

Syntax: CREATE DATABASE DatabaseName;

Example: CREATE DATABASE KSB

The SQL SHOW DATABASE statement is used to list all SQL databases

Syntax & Example: SHOW DATABASES

The SQL USE command is used to select an SQL database

Syntax: USE DatabaseName;

Example: CREATE DATABASE KSB

The SQL DROP command is used to delete an SQL database

Syntax: DROP DATABASE DatabaseName;

Example: DROP DATABASE KSB


SQL – CREATE & DROP TABLES
The SQL CREATE TABLE statement is used to create a new table in an SQL database
Syntax:
CREATE TABLE TableName (
FirstColumnName datatype constraint,

SecondColumnName datatype constraint,


.....
LastColumnName datatype constraint,
PRIMARY KEY( ColumnName )
);

Example:

CREATE TABLE CUSTOMERS ( The SQL DROP command is also used to


delete tables just like the case of database
ID INT NOT NULL AUTO_INCREMENT,
Syntax: DROP TABLE TableName;
NAME VARCHAR (20) NOT NULL,

AGE INT NOT NULL, Example: DROP TABLE CUSTOMERS;


PRIMARY KEY (ID)
);
SQL – INSERT QUERY

The SQL INSERT statement is used to add new record of data to a table in the database

Syntax:

INSERT INTO TableName ( FirstColumnName, SecondColumnName,….LastColumnName )


VALUES ( FirstValue, SecondValue, ….LastValue );

Example 1: Example 2:

INSERT INTO CUSTOMERS (NAME, AGE) INSERT INTO CUSTOMERS VALUES

VALUES ('Esther Ama Amoh', 23), ('Esther Ama Amoh', 23),

('John Doe', 30), ('John Doe', 30),

('Jane Smith', 25); ('Jane Smith', 25);


SQL – SELECT QUERY
The SQL SELECT statement is used to fetch data from a database table which returns data in
the form of result table.

Syntax:

SELECT FirstColumnName, SecondColumnName,…,LastColumnName FROM TableName;

OR

SELECT * FROM TableName;

Example 1: Example 2:

SELECT ID, NAME, AGE FROM CUSTOMERS; SELECT * FROM CUSTOMERS;

Example 3:

SELECT * FROM CUSTOMERS WHERE ID > 1;


SQL – UPDATE QUERY
The SQL UPDATE statement is used to modify the existing record in a table

Syntax:

UPDATE TableName SET


FirstColumnValue = NewValue1,
SecondColumnValue = NewValue2,
….
LastColumnValue = NewValueN,
WHERE [condition];

Example 2:
Example 1:
UPDATE CUSTOMERS SET
UPDATE CUSTOMERS SET
NAME=“Emmanuel Ackah”
NAME=“Esther Nana Ama Amoh”
WHERE (AGE > 20) AND (AGE < 40);
WHERE ID=1;
SQL – DELETE QUERY

The SQL DELETE statement is used to delete existing record from a table

Syntax:

DELETE FROM TableName WHERE [condition];

Example 1: Example 2:

DELETE FROM CUSTOMERS WHERE ID=1; DELETE FROM CUSTOMERS WHERE AGE < 18
STRUCTURED QUERY LANGUAGE - III
ADVANCED CONCEPTS IN SQL

In this lecture, we’re focusing on the basics. For a mastery and a more
advanced concepts like stored procedures, using JOINS, conditions,
aggregate functions, etc., these platforms present impressively free
tutorials for your attention.

PERSONAL ASSIGNMENT

o https://www.programiz.com/SQL
o https://www.w3schools.com/sql/
o https://www.codecademy.com/learn/learn-sql
o https://www.tutorialspoint.com/sql/index.htm
o https://www.sqltutorial.org/
DATABASE MANAGEMENT SYSTEMS
DATABASE MANAGEMENT SYSTEMS

A DBMS serves as an interface between an end-user and a database,


allowing users to create, read, update, and delete data in the database.

They also allow for database performance monitoring and tuning.


Examples include SQL server, MySQL workbench, DBeaver, etc.
NEXT STEPS
TO THIS END…
We know that;

• Databases are simply a collection of organized and interconnected


data that is stored electronically.

• In relational databases, data is stored in tables.

• Before we set out to create our databases, we leverage ERDs to


define the structure and rules of the database and related tables.

• Finally, we leverage the SQL programming language to interact


and manage databases. Most often, we use database management
systems – GUI applications that allow users to interact with
databases with ease.

• Now that we know how to design databases, and data pipelines


using data engineering techniques, we’re ready to learn the
fundamentals of the python programming language – the primary
tool we will be using to create and manage our models.
NEXT STEPS

Introduction to
Python
Programming
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE

Pasty Asamoah
+233 (0) 546 116 102
[email protected]

Kwame Nkrumah University of Science and Technology


School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
DATABASES & SQL LAB WORK
DOWNLOAD MYSQL WORKBENCH

https://dev.mysql.com/get/Downloads/MySQLGUITools/mysql-
workbench-community-8.0.34-winx64.msi
MYSQL WORKBENCH INSTALLATION
CREATE CONNECTION
START MYSQL SERVER
INTERFACE
LAB ACTIVITIES
CONTINUITY
DATA ENGINEERING LAB WORK
DOWNLOAD JAVA SE

https://dev.mysql.com/get/Downloads/MySQLGUITools/mysql-
workbench-community-8.0.34-winx64.msi
JAVA SE INSTALLATION

https://www.youtube.com/watch?v=SQykK40fFds
DOWNLOAD PENTAHO

https://privatefilesbucket-community-edition.s3.us-west-
2.amazonaws.com/9.4.0.0-343/ce/client-tools/pdi-ce-9.4.0.0-343.zip
LAB ACTIVITIES
CONTINUITY
NEXT STEPS

Introduction to
Python
Programming
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE

Pasty Asamoah
+233 (0) 546 116 102
[email protected]

Kwame Nkrumah University of Science and Technology


School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
PYTHON PROGRAMMING
PYTHON – Of course not a snake!
Python is a popular general-purpose programming language created by
Guido van Rossum, and released in 1991.

Python can be used for:

Software & Game Data Analytics,


Scientific
Development Visualizations, and
Computing
Machine Learning
DOWNLOAD & INSTALLATION

Python Anaconda

https://www.python. https://repo.anaconda
org/ftp/python/3.12. .com/archive/Anacond
1/python-3.12.1- a3-2023.09-0-
amd64.exe Windows-x86_64.exe
PYTHON SYNTAX

The set of rules that defines how a Python program will be written and
interpreted.

Indentation

Defines a block of
code

Syntax Error

We’ll discover other important rules as we progress

Credit: w3schools
HELLO WORLD! – Programming Tradition

!!! Notice the absence of a semi-colon after the bracket close


VARIABLES
• Variables are containers for storing data values.
• Python has no command for declaring a variable.
• A variable is created the moment you first assign a value to it.

DO YOU REMEMBER BACK IN HIGH SCHOOL?

These are variables in python


VARIABLES
• Variables are containers for storing data values.
• Python has no command for declaring a variable.
• A variable is created the moment you first assign a value to it.

Notes:

Emmanuella is wrapped
in a quotation mark.

Print(username) is not in
a quotation mark

Print(‘Female’) is
wrapped in a quotation
mark
RULES IN NAMING VARIABLES
• A variable name:

• must start with a letter or the underscore character


• cannot start with a number
• can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
• is case-sensitive (age, Age and AGE are three different variables)
• cannot be any of the Python keywords.

Error but why?

Credit: w3schools
PYTHON KEYWORDS
Keywords are predefined, reserved words used in Python programming that
have special meanings to the compiler.

!!! These keywords are reserved for the python programming languages
Credit:programiz
COMMENTS
• In computer programming, comments are hints that we use to make our
code more understandable. They are completely ignored by the
interpreter. In python, we use the # symbol for commenting.

Notes:

Anything after the # symbol is ignored Comment

Credit: programiz
DATA TYPES
DATA TYPES
• In computer programming, data type refers to the type of value a variable
holds. The data type of a variable ensures that mathematical, relational or
logical operations can be applied without causing an errors. Python
supports the following:

Credit: w3schools
DATA TYPES

Credit: w3schools
NUMERIC DATA TYPES

Notes:
Concatenation. We
But how do I know the data type of a variable ?
can also use the +
symbol
Credit: programiz
CHECK DATA TYPE
• In python programming, to know the data type of a variable, we use the
type() function. Of course you don’t know what functions are. We’ll talk
about them in moments. For now, understand that we use the type()
function to get the data type of a variable.

But how do I convert between data types? Well… Notice ‘complex’,’float’, ‘int’
Credit: programiz
DATA TYPE CONVERTION
• We can easily switch between data types. Pay close attention to the results
after we checked the data types. They returned something like:
<class ‘int’>. Now, to convert any numeric value to an integer, we use int()

Wrapped in float()
function

This applies to converting to int(), float(), complex().


Credit: programiz
Hands-on: 5 mins maximum
Create:

variable X = 23
Variable Y = “12”

Tasks:
1. Print the data type of X
2. Print the data type of Y
3. Compute Z = X+Y such that the result of Z = 35
STRING DATA TYPE
• In python programming, to know the strings are enclosed in quotation
marks “”. For instance, 22 is an integer but “22” is a string. We can use
the single or double quotes to create string variables in python

Not in quote

In quote

Credit: programiz
STRING MANIPULATION: LEN()
• There are several operations on strings. For instance, we can get the
length of the string, slice parts of the string, check values in a string, etc.

18 characters

Len() function counts the number of


characters including white spaces

Credit: programiz
CHECK STRING EXISTENCE: IN
• To check if a set of characters are present in the string, we use the IN
keyword. The result is a Boolean: True / False

Boolean result

The in keyword to check if the set of


characters in a sequential order “prog”
exists in the text

Credit: programiz
CHECK STRING NOT EXIST: NOT IN
• To check if a set of characters are not present in the string, we use the
NOT keyword. Basically the NOT keyword is for negation. The result is
a Boolean: True / False

Boolean result

The not keyword to check if the set of


characters is not in a text.

Credit: programiz
STRING SLICING
• Sometimes you may want to slice a portion of a string. Imagine you have
the string “Hello world”, but your interest is the text “world”. To slice
those characters from the string, we leverage the slicing technique

• Text = “Hello world”

-i -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
+i 0 1 2 3 4 5 6 7 8 9 10

char H e l l o w o r l d

• Variable[start index : end index] # default start and end is 0:0 respectively

• The length of your result will always be = end index – start index

Credit: programiz
STRING SLICING

From the back, the


index starts from -1
Credit: programiz
STRING METHODS
• There are other several operations we can carry on strings. For instance,
we can convert from upper to lower, check if a string is numeric, etc.
Kindly find a complete list for your reference at:

• https://www.w3schools.com/python/python_strings_methods.asp

Name of the string,


dot, method name

Credit: w3schools
STRING FORMATTING
• There are different approaches in formatting string.

Use + to
concatenate strings

Use placeholders for the


variable names. It matches the
positions of the variables used
in the format method

Injected the variable names

• I will be using the 3rd approach for examples

Credit: w3schools
LIST DATA TYPE
• In python programming, lists are used to store multiple items in a single
variable. Have you realized we have been creating single items per
variable? There are instances where you may need to store more than one
item Variable name

items List accepts


duplicates

0 1 2 3 4
apple banana cherry apple cherry

• We can get the length, count items, slice, etc.

Credit: w3schools
REPLACE & ADD TO LIST
• Replace

Banana will be replaced


with blackcurrant
Index of item to replace

• Add

Item to be added to the


list

Call the method append() on the name of the


list or variable

Credit: w3schools
REMOVE FROM LIST
• Remove

Call the method remove() on the name of the


list or variable

• We can also achieve that with the del keyword

Credit: w3schools
LIST METHODS
• Reference: https://www.w3schools.com/python/python_lists_methods.asp

Credit: w3schools
TUPLE DATA TYPE
• Tuples are used to store multiple items in a single variable. A tuple is a
collection which is ordered and unchangeable. Tuples are written with
round brackets.
Variable name

items

• How would you add “grapes” to the tuple?

• Remember we can convert a tuple to a list, add grapes, and convert it


back to tuple!

Tuple Methods Reference:


https://www.w3schools.com/python/python_tuples_methods.asp
Credit: w3schools
SET DATA TYPE
• Sets are unindexed data types used to store multiple items in a single
variable. Sets remove duplicates from list of items

Variable name

items

It removed all
duplicated values
Credit: w3schools
MANIPULATING SETS
• ADD

Use the add method

• UPDATE
Set 1

Set 2

Call the update method to


Credit: w3schools
merge items into a single set
SET METHODS
• Reference: https://www.w3schools.com/python/python_sets_methods.asp

Credit: w3schools
DICTIONARY DATA TYPE
• Dictionaries are used to store data values in key:value pairs. Dictionary
items are ordered, changeable, and do not allow duplicates.

Values

Use colon
Dictionary keys

• We can easily access the car brand as:

thisdict[“brand”]

Credit: w3schools, programiz


DICTIONARY METHODS
Methods Reference: https://www.w3schools.com/python/python_dictionaries_methods.asp

Credit: w3schools
OPERATORS
OPERATORS
• In programming, operators are used to perform operations on variables
and values. There are several of them in python
o Arithmetic operators
o Assignment operators
o Comparison operators
o Logical operators
o Etc.

Credit: w3schools
ASSIGNMENT OPERATORS
• Assignment operators are used to assign values to variables:

The statement
is not true
Credit: w3schools
COMPARISON OPERATORS
• Comparison operators are used to compare two values

The statement
is true
Credit: w3schools
LOGICAL OPERATORS
• Logical operators are used to combine conditional statements.

The statement
is not true

Credit: w3schools
CONDITIONAL CONTROLS
IF-STATEMENT
• In computer programming, the if statement is a conditional statement. It
is used to execute a block of code only when a specific condition is met.
For example, Suppose we need to assign different grades to students
based on their scores:

• If a student scores above 90, assign grade A


• If a student scores above 75, assign grade B
• If a student scores above 65, assign grade C

colon

Indentation Execute this


block of code if
and only if the
Credit: programiz condition is true
IF-STATEMENT
• The logic

Credit: programiz
IF-STATEMENT

Pay attention to the little space. Note that the print function is But what if the statement or condition returned False?
not directly under the “if”. This indicates that, the print We can handle that with an else block
statement belongs to the body of the if-statement

Credit: w3schools
IF-ELSE-STATEMENT

Notice that it did not print anything on the


screen. That is because the condition is
False. And hence, it did not execute the
block. Lets say we want to print some text if
the condition is false.

We included the else: block so


that if the condition fails, we
can still print some text on the
screen
Credit: w3schools
IF-ELIF-STATEMENT
• Sometimes you may want to test multiple conditions. In that case, we
employ the if-elif statement

We included the elif to check


two more conditions. You can
add as many as you wish

The else block

Credit: w3schools
LOOPS
LOOPS
• In programming, a loop is a control flow statement that is used to
repeatedly execute a group of statements as long as the condition is
satisfied. Such a type of statement is also known as an iterative statement

• Python has two (2) primitive loop commands:

• For loop
• While loop
FOR-LOOP
• A for loop is used for iterating over a sequence (that is either a list, a
tuple, a dictionary, a set, or a string). The python for-loop is less like for-
loops in several other programming languages

Credit: programiz
FOR-LOOP
Items to loop
through

Variable names. It
can be anything

Credit: w3schools
FOR-LOOP: BREAK AND CONTINUE
• The break and continue keywords are important for stepping out of the
loop and skipping items respectively.

The break keyword in action

If it gets to the condition where fruit is banana, Cherry??


it breaks our of the loop. Meaning, it will not
loop through any item after banana

Credit: w3schools
FOR-LOOP: BREAK AND CONTINUE
• The break and continue keywords are important for stepping out of the
loop and skipping items respectively.

The continue keyword in action

If it gets to the condition where fruit is banana, Banana was skipped


it skips it. And continue looping over the rest
of the items

Credit: w3schools
NESTED -FOR-LOOP

Credit: w3schools
WHILE-LOOP
• With the while loop we can execute a set of statements as long as a
condition is true. The implication is that, we need to keep track of an
updating element.

Credit: programiz
WHILE-LOOP
Execute as long as this
condition is true

Updates the value of


the conditioning item.
We will get into an
infinite loop without
this line of code

The break keyword in action The continue keyword in action

Credit: w3schools
FUNCTIONS
FUNCTIONS
• A function is a block of code which only runs when it is called. You can
pass data, known as parameters, into a function. A function can return
data as a result.

Credit: programiz
FUNCTIONS
• Defining and calling a function Function definition

Calling the function

NB: Until you make a function call, it will never get executed

Credit: w3schools
FUNCTIONS WITH ARGUMENT
• Defining functions with argument
Parameter

argument
FUNCTIONS WITH ARGUMENT
• Defining functions with argument
Placeholder/ parameter

We append the person


Function in action. name to the Good
morning string. This
will apply to any name
provided as an
argument to the
function
Credit: w3schools
FUNCTIONS WITH ARGUMENT
• Lets modify

Credit: w3schools
FUNCTIONS RETURN VALUES
• So far, our functions do not return values that are reusable. We can use
the return keyword to achieve that. Note that, functions that return values
do not automatically print to the screen.

Returning the sum of No output? But we


the two numbers called the function…

Credit: w3schools
FUNCTIONS RETURN VALUES
• So far, our functions do not return values that are reusable. We can use
the return keyword to achieve that. Note that, functions that return values
do not automatically print to the screen.

Assigning result to a
variable. This is possible
because we are returning the
result after the computation x
+y

Print the returned value

Credit: w3schools
OBJECTED-ORIENTED PROGRAMMING (OOP)
CLASS AND OBJECT
• Python is an object oriented programming language. Almost everything
in Python is an object, with its properties and methods. A Class is like an
object constructor, or a "blueprint" for creating objects.

• In OOP, everything is considered an Object with certain properties


(nouns) and functionalities/methods (verbs).

Credit: programiz
CLASS DEFINITION
• It is very simple to create a class in python

Class name

Class properties

Class methods

Credit: programiz
CLASS DEFINITION

Every function requires self. It can be


anything. It injects an instance of the class

We are trying to set the name value

We are trying to access the name property


using the dot

Credit: programiz
CLASS INSTANTIATION

We could decide to eliminate this code

Creating an instance of the class Human

Calling the method

Setting the human name

Credit: programiz
CLASS CONSTRUCTORS

constructor

We inject the name of the user when


instantiating the class

Credit: programiz
CASE

Credit: programiz
MODULES
MODULE
• Consider a module to be the same as a code library. It is a file containing
a set of functions you want to include in your application. To create a
module just save the code you want in a file with the file extension .py

• Create a file and name it: mathematics.py

function
MODULE
• Create another file in the same directory as the mathematics.py and name
it: use.py

Importing the mathematics.py file

Using the add function in the


mathematics.py file

• Note that we imported all the codes in the mathematics file but we used
only the add function.

• So lets see how to import only the add function


MODULE

See how we import the add function

See the use too

• Sometimes we can import and rename


MODULE

We introduce the “as” aliasing keyword

We access the functions in the


mathematics file with mt, the aliase
NEXT STEPS
TO THIS END…
We have learnt the basics of python programming specifically;

• Loops (e.g., for-loop and while-loop)


• Functions
• OOP
• Modules

• In the next session, we will learn about data manipulation using


pandas
READING ASSIGNMENT
Be sure to go through these tutorials for a practical experience on python
programming

• https://www.programiz.com/python-programming
• https://www.w3schools.com/python/
• https://www.javatpoint.com/python-tutorial
• https://www.youtube.com/watch?v=QXeEoD0pB3E&list=PLsyeobzWx
l7poL9JTVyndKe62ieoN-MZ3
NEXT STEPS

Data Pre-processing
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE

Pasty Asamoah
+233 (0) 546 116 102
[email protected]

Kwame Nkrumah University of Science and Technology


School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
DATA PREPROCESSING
REAL WORLD DATA CAN BE “MESSY”
Data preprocessing is the crucial first step in data analysis, where you
transform raw data into a clean and understandable format suitable for
further analysis.
DATA PREPROCESSING TECHNIQUES

Credit: w3schools
INTRODUCTION TO PANDAS
PANDAS
Pandas is a Python library used for working with data sets.

Data Visualization Data Cleaning Data Exploration

e.g., heatmap e.g., duplicates e.g., correlation

Data Manipulation

e.g., transformation
READING DATA
We need to import the pandas package to use it.

Pandas package and use as pd

Credit: w3schools
SERIES
In Pandas, series is just like a column in a table.

Column without label

Column with label


Credit: w3schools
DATAFRAME
DataFrame is like the whole table.

Table column values

Table column names

Credit: w3schools
LOCATING ROW VALUE
Locating row value is similar to indexing

First row value

First row or index zero. Notice one print(df.loc[0:1])


open and close square bracket

Credit: w3schools
DATA TYPE CONVERTION
Locating row value is similar to indexing

These are float data types

Convert column A to integer


data type

First row or index zero. Notice one


open and close square bracket
HANDLING DATA
IMPORT DATA
Often, you’d want to import and work with data other than creating
them manually. In pandas, we can import from array of sources
including CSV.

The directory and


name of the file

Reading file

Credit: w3schools
SNAPSHOT OF THE DATA
HEAD AND TAIL
We can use the head() and tail() methods to have a snapshot of the topn
and lastn details of our data

Top 3 records

Last 3 records

Displaying top 3 records


DATA SHAPE
Data shape refers to the number of rows and columns in our data. The
size is the product of the rows and columns df.size

Columns

Rows
DATAFRAME COLUMNS
Dataframe columns are the headings in the table.

Columns

All columns in the


dataframe
FEATURE SELECTION
Selecting table column(s) in pandas is quite easy. The column selection
is on the premise that, you may want to work with a fraction of the
table. For instance, you might be just interested in Temperature column
and not the others.

For instance these two queries are valid and will return the temperature
column. However, the first approach can be used only when the column
name does not contain space:

df.Temperature Returns the top 3 rows


of the temperature
df['Temperature'] column
Selecting
multiple
columns
ALL vs UNIQUE VALUES
Selecting table column(s) in pandas is quite easy. The column selection
is on the premise that, you may want to work with a fraction of the

All values in temperature column

Unique values in
temperature column
CONDITIONAL SELECTION
Sometimes, you may want to select columns or rows based on some conditions

Selecting all columns where


temperature is >= 40

Selecting specific columns


where temperature is >= 40
CONDITIONAL SELECTION

What is going on here?


GET TO KNOW YOUR DATA
Basic descriptive statistics using pandas: Minimum and Maximum values

Maximum value in each


column?
MEAN / AVERAGE

Mean value of the temperature column

Means of each column


STANDARD DEVIATION

Standard deviation value of the temperature


column

Standard deviation of each


column
CORRELATION AND COVARIANCE

Covariance matrix
Correlation matrix

Correlation matrix
EVERYTHING AT A GLANCE
HANDLING MISSING VALUES
ANY MISSING VALUES?
Missing values can be handled in several ways. For instance, we may want to
drop, impute, interpolate, or even fill it with some values (e.g., average)

Checks if any column has missing values

Have missing
values
WHEN ENTIRE ROW IS MISSING
We may want to drop

Drop row

Drop row if the entire row is missing


WHEN ENTIRE ROW IS MISSING
We may want to drop but not the best approach

Drop row

Drop row if any of the columns has missing value

Its not the best solution. Our data is finished!


FILL ALL NULL VALUES

Fill every missing value with 12

These are supposed to be categories.


We got it wrong!
FILL NULL VALUES BY COLUMNS
Fill respective columns with specific value

Looks a bit better


FORWARD & BACKWARD FILL

forward

backward

Looks a better with the backward fill


INTERPOLATE

forward

backward

Looks a better with the backward fill


HANDLING DUPLICATES
DUPLICATES
Duplicated rows affect results. We handle them by deleting them.

Check if any row is


duplicated

No row is duplicated
DUPLICATES
Duplicated rows affect results. We handle them by deleting them.

Check if any row is


duplicated

No row is duplicated

Replace the original


copy of the data

Drop duplicates
NEXT STEPS
TO THIS END…
We know;

• The basics for creating and handling dataframes

• Basic data cleaning techniques

• In our next session, we will learn about building basic machine


learning models
NEXT STEPS

Introduction to
Machine Learning
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE

Pasty Asamoah
+233 (0) 546 116 102
Kwame Nkrumah University of Science and Technology
School of Business
Supply Chain and Information Systems Dept.
DATA VISUALISATION & STORY TELLING
INTRODUCTION

Data is only as good as your ability to understand and communicate it.

It is essential to choose the right visualization to communicate and tell the story
behind the data.

Key insights and understandings are lost when data is ineffectively presented,
which affects the story behind the data.

VISUALISATION? WHAT IS IT??


DATA VISUALISATION
Data visualization is the process of representing data in visual
or graphical formats to facilitate understanding and
communication of complex information.

It involves transforming raw data into charts, graphs, maps,


and other visual elements that convey patterns, trends, and
insights.
IMPORTANCE OF
DATA VISUALISATION
Data visualization is important for several reasons:

1. Enhances data understanding: Enables users to grasp information


quickly and comprehend complex relationships.
IMPORTANCE OF DATA
VISUALISATION’

2. Identification of patterns and trends: Helps uncover hidden


patterns, trends, and relationships in data that may not be apparent
in raw form.
IMPORTANCE OF DATA
VISUALISATION’’

3. Effective communication: Facilitate the communication of data


insights to a broad audience, making it easier to present findings,
tell stories, and convey messages.

4. Data-driven decision making: Well-designed visualizations


empower users to make informed decisions based on data analysis
and identification of trends, outliers, and correlations.

5. Data exploration: Data visualization helps in exploratory


analysis, and story telling or reporting.
CHALLENGES OF DATA
VISUALISATION
• Data quality and preprocessing: Ensuring data accuracy,
consistency, and completeness is crucial before creating
visualizations.
• Interpretating and avoiding biases: Designing visualizations that
are intuitive and free from misleading interpretations or biases to
ensure accurate understanding by users.

??
CHALLENGES OF DATA
VISUALISATION

• Choosing appropriate visual representations: Selecting the right


chart types, graphs, or maps that effectively represent the data
and align with the message or analysis goals

• Handling large and complex data: Visualizing big datasets or


complex data structures can pose challenges in terms of
scalability, performance, and usability.
STEPS IN DATA VISUALISATION
1. Know the data: Understand the category and type of data.
2. Understand the information needed from the data.
3. Select the appropriate technique and visualization tool.

The steps can be complex in real-life implementation


1. KNOW THE DATA
Data can be quantitative or qualitative in nature. Let’s focus on
what’s measurable.
2. UNDERSTANDING THE INFORMATION
NEEDED FROM THE DATA’
Commonly, the questions asked ranges from nominal comparisons,
time-series, correlations, ranking, deviations, distributions, and
part-to-whole relationships.

Nominal Comparisons

Simple comparison of quantitative values


(e.g.) Total employees, Average salary, etc.

Time-series

Changes in values of a consistent metric over equal


time-spaced interval. (e.g.) Monthly sales.
2. UNDERSTANDING THE INFORMATION
NEEDED FROM THE DATA’’

Correlations

Determines whether there exist a relationship between


variables and the extent of the relationship. (e.g.)
Relationship between employee salary and
performance.

Ranking

Comparison between two or more values relative


magnitude. (e.g.) Performance ranked from educational
levels of employees from Degree holders to PhD.
2. UNDERSTANDING THE INFORMATION
NEEDED FROM THE DATA’’’
Deviations

Dispersion of the data points from each other


especially the average. (e.g.) Performance of
employees this year versus last year.

Distribution

Data distribution, often around a


central value. It shows how often values occur in a
dataset (e.g.) Age distribution of employees

Part-to-whole relationship

Subset of data compared to the larger whole. It shows a


breakdown of elements that add up to a whole. (e.g.)
Number of employees who were absent today.
3. SELECT APPROPRIATE TECHNIQUE
AND VISUALISATION TOOL
Understanding data and knowing the kind of information needed
(answers) influence the visualization technique and tool applied. We
explore commonly used tools, techniques, and visualizations in
industries.

Data Visualization Tools

Graphical User Interface Tools Programming Languages


TECHNIQUES AND VISUALISATIONS

Column and Bar Charts


Used to show change over time, compare categories or compare parts of a whole. Each bar
or column represent a category and the length of the bar or column being proportional to the
value it represents.

It’s a number line

Ideal for visualizing chronological data Ideal for visualizing chronological data
with long category names
TECHNIQUES AND VISUALISATIONS

Stacked Bar Charts and 100% Stacked Bar Charts

Stacked bar chart 100% stacked bar chart


Ideal for visualizing chronological data Ideal when total categories is not the
while comparing multiple part-to-whole focus. It focuses on the composition of
relations each subcategory over time
TECHNIQUES AND VISUALISATIONS

Double Bar Charts

Double bar chart


Ideal for comparing categories over- What do you think about this column chart??
time
TECHNIQUES AND VISUALISATIONS

Pie and Donut Charts


Used to compare categories or compare parts of a whole. It is ideal for small data sets with
fewer categories.

Pie Chart Donut Chart


TECHNIQUES AND VISUALISATIONS

Pie and Donut Charts

What do you think about this Pie Chart??


TECHNIQUES AND VISUALISATIONS

Line Charts
Used to show changes over time (time-series) by using data points represented by dots that are
connected by a straight line. Put differently, it shows time-series relationships with continuous data.
They help show trend, acceleration, deceleration, and volatility.

Shows changes in data over time while


comparing different categories
TECHNIQUES AND VISUALISATIONS

Scatter Plots
Used to show the relationship between two variables. They are best used to show correlation in large
data sets and identifying outliers.

Scatter plot with an outlier


TECHNIQUES AND VISUALISATIONS

Funnel
Used to visualize a linear process that has connected sequential stages. The value of each stage in the
process is indicated by the funnel's width as it gets narrower.
TECHNIQUES AND VISUALISATIONS

Cards
Mostly used to display KPIs. (e.g.) turnover
TECHNIQUES AND VISUALISATIONS

Guage
A gauge consists of a circular arc which shows a singular value that measures progress towards a KPI or
goal. The line on the arc represents the target or goal and the shading represents the progress made
towards it. The value inside of the arc shows the progress value.
TECHNIQUES AND VISUALISATIONS

Map
Used for visualizing data across different locations and distances. (e.g.) Answer questions on cities or
countries and the related data such as number of employees, sales, etc.
TECHNIQUES AND VISUALISATIONS

Treemap
Used to display large quantities of hierarchically structured data, using nested rectangles. The chart
shows different perspectives of the data by displaying the rectangles as different sizes and colors based
on the frequency of occurrence. It is not ideal for visualising large categories

What do you think about this


treemap??
DOs AND DON’Ts IN DATA
VISUALISATION

1. Avoid slated labels if possible


DOs AND DON’Ts IN DATA
VISUALISATION

2. Include a Zero baseline


DOs AND DON’Ts IN DATA
VISUALISATION

3. Order your data


DOs AND DON’Ts IN DATA
VISUALISATION

4. Use suitable proportions


DOs AND DON’Ts IN DATA VISUALISATION
5. Keep it simple but insigtful
PYTHON DATA VISUALISATION LIBRARIES
DATA VISUALISATION WITH PYTHON
• Python is widely used for data visualization due to its
simplicity, versatility, and rich ecosystem of libraries.
• Basic plots: Matplotlib or Seaborn.
• Interactive dashboards: Plotly
• Quick declarative visuals: ggplot
DATA VISUALISATION WITH PYTHON
• Python is widely used for data visualization due to its
simplicity, versatility, and rich ecosystem of libraries.
• Basic plots: Matplotlib or Seaborn.
• Interactive dashboards: Plotly
• Quick declarative visuals: ggplot
DATA VISUALISATION WITH MATPLOTLIB
INTRODUCTION TO MATPLOTLIB
• Matplotlib is “a comprehensive library for creating static,
animated, and interactive visualizations in Python. Matplotlib
makes easy things easy and hard things possible” (matplotlib,
2025).

• Most of the Matplotlib utilities lies under the pyplot submodule

Accessing just the Alias


Credit: w3schools pyplot module
BASIC PLOTS
x-axis values
y-axis values
method for plotting
display the graph

Output/graph

Credit: google colab


LINE CHARTS

Notice that we have


just single points

Credit: google colab


MARKERS

Markers

Credit: w3schools
MARKERS

Full list - https://www.w3schools.com/python/matplotlib_markers.asp

Credit: w3schools
LINES

Lines

Credit: w3schools
LINES

Full list - https://www.w3schools.com/python/matplotlib_line.asp

Credit: w3schools
LINES
• Lines have other properties that allows for modifying colors,
line width, etc.

color = ‘red’
linewidth = '20.5'

Credit: w3schools
MULTIPLE LINES

Credit: w3schools
LABELS
• Have you noticed that all our visuals do not communicate any
specific insight?
• Pyplot allow users to set labels to define the information
communicated. E.g., Title, x-axis, y-axis

What is this visual


communicating?

Credit: w3schools
ADDING LABELS

xlabel
Credit: w3schools
Title
ADDING TITLE
Y-values

Title

Credit: w3schools X-values


MULTIPLE PLOTS
• Sometimes, you may want to display multiple visuals on the
same graph. Matplotlib provide the subplot function to achieve
this.

• plt.subplot(rows, columns, position)

• rows indicate the number of rows (integer)


• columns indicate the number of columns (integer)
• Position indicate whether the current graph should be displayed
first, second,…,nth

Credit: w3schools
MULTIPLE COLUMNS PLOT 1 row, 2 columns subplots

First subplot Second subplot

First subplot

Second subplot

Credit: w3schools
MULTIPLE ROWS PLOT 2 rows, 1 column subplots

First subplot

2 rows, 1 column

Second subplot
2 rows, 1 column

Credit: w3schools
MULTIPLE ROWS AND COLUMNS PLOT

Credit: w3schools Try this


SCATTER PLOTS
• We can use the scatter method to achieve this.

Credit: w3schools
SCATTER PLOTS

2 rows, 1 column

Credit: w3schools
SCATTER PLOTS
• Note that the scatter method allows for setting marker colors as
well

Red color for the first plots

Blue color for the first plots

Credit: w3schools
BAR PLOTS
• We can use the bar method for bar charts
• The bar plots has properties that allow for changing colors, bar
width, height, etc.
• Color = (string)
• Width = (float)
• Height = (float)

Credit: w3schools
BAR PLOTS

Using the bar function

Credit: w3schools
BAR PLOTS

What will this graph look like?

Credit: w3schools
HORIZONTAL BARS (barh)

Notice we used the barh


method

Credit: w3schools
PIE CHARTS
• We can use the pie method visualize data using a pie chart

The pie method

Credit: w3schools
PIE CHARTS
• We can override the default colors

List of colors for each label

Credit: w3schools
CHART LEGEND
• We can easily define a legend for the chart

Legend

Credit: w3schools
NEXT STEPS
TO THIS END…
We know;

• The basics for creating and handling visualizations

• In our next session, we will learn about building basic predictive


machine learning models
NEXT STEPS

Introduction to
Machine Learning
ANY QUESTIONS??
INTRODUCTION TO PYTHON FOR DATA SCIENCE

Pasty Asamoah
+233 (0) 546 116 102
[email protected]

Kwame Nkrumah University of Science and Technology


School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to
the respective creators and owners. I apologize for any omission in attribution, and
appreciate the work of the original content creators.
INTRODUCTION TO MACHINE LEARNING
MACHINE LEARNING
Machine learning is a field of AI that involves the development of
algorithms and statistical models that enable computers to learn and
improve their performance on a specific task without being explicitly
programmed.

learns learns from labeled data


from
unlabeled
data

learns to make
decisions by
interacting with an
environment
MACHINE LEARNING MODELS
Machine learning models can range from simple linear regression to
complex deep neural networks.

Simple linear regression


SIMPLE LINEAR REGRESSION MODEL

Data preprocessing Build Model Evaluate

Clean data
Select model Check accuracy
Split data
OUR FIRST MACHINE LEARNING MODEL

Snapshot of the housing


dataset
DATA INGESTION

Import
packages

Load data
DATA CLEANING

Handle duplicates

There are no
missing values
DATA CLEANING

Column data
types

We will be
working with
the integer data
types at this
stage.
FEATURE SELECTION
Predictors

What we want
to predict
MODEL SELECTION

Define: What type of model will it be? A decision tree? Some other
type of model? Some other parameters of the model type are
specified too.

Fit: Capture patterns from provided data. This is the heart of


modeling.

Predict: Just what it sounds like

Evaluate: Determine how accurate the model's predictions are.

In this case we want to build a very basic


linear regression model using the scikit
learn library
Importing the
MODEL SELECTION linear regression
model

Create the model

Train the model


Importing the
MODEL SELECTION linear regression
model

Create the model

Train the model


We predict with a
MAKING PREDICTIONS set of predictors

The predictions
DECISION TREE
SIMPLE DECISION TREE MODEL

Data preprocessing Build Model Evaluate

Clean data
Select model Check accuracy
Split data
DECISION TREE MODEL
Machine learning models can range from simple linear regression to
complex deep neural networks.

Decision Tree
DECISION TREE Import decision tree from sklearn

model
Train model

Make predictions

Predicted VS
Actual are the
same. That is a
100% accuracy.
BUT WHY??
LETS MODIFY OUR MODEL BY
INTRODUCING TRAINING AND TEST
DATASETS
We realized that our model performed well with an accuracy of 100%.
This is unlikely in real-world scenerios.

The reason for the 100% accuracy is that, we were trying to predict Y
values with X values that the model has seen before. The model saw
it in the Training Stage

What about testing our model on data that the model has not seen
before??

Let’s give it a shot!!!


INGESTION, CLEANING, AND SELECTING
VARIABLES
We import the
decision tree
model

Dependent Independent variable


variable
SPLIT DATA

The method for


splitting the data
SPLIT DATA

data 80% for training and 20%


for testing

Dataset for
training

Dataset for
testing
MODEL SELECTION

Train dataset
Test dataset
MODEL PERFORMANCE

Checks error
margin

Error margin
LETS MODIFY THE MODEL A BIT BY
SPECIFYING LEAVES

Error margin before updating parameter

Error margin after updating


parameter
PROBLEM OF UNDERFITTING AND
OVERFITTING
DIFFERENT LEVELS OF LEAVES

Error margin is high for 50 leaves


HANDLING CATEGORICAL DATA
CATEGORICAL DATA
Have you realized that we couldn’t include these attributes in the model?
HANDLE CATEGORICAL COLUMNS

Label Encoder One-Hot-Encoder Dummies


LABEL ENCODERS

Importing LabelEncoder
LABEL ENCODERS’

Columns of interest. We
believe that these columns
predict house prices. We
need to convert them to
numerical forms
TRANSFOMING CATEGORICAL COLUMNS
Instantiate Label encoder Transform values Categorical column
to convert
ADD TRANSFORMED COLUMNS TO
DATAFRAME

New column name Transformed values


ADD TRANSFORMED COLUMNS TO
DATAFRAME

New column name Transformed values


SNAPSHOT OF TRANSFORMED COLUMNS

New columns added


INDEPENDENT & DEPENDENT VARIABLES

Select columns based on data types. Drop the price column. By default, it will be included because
Exclude columns with data type object we are selecting all columns other than objects.
DUMMIES columns
Pandas method to handle
categorical columns

Note that it create multiple columns for each of them


based on the number of unique values in the column
DUMMIES columns
Pandas method to handle
categorical columns

Note that it create multiple columns for each of them


based on the number of unique values in the column
INDEPENDENT & DEPENDENT VARIABLES

Select columns based on data types. Drop the price column. By default, it will be included because
Exclude columns with data type object we are selecting all columns other than objects.
Task 1: Build a model with either linear
regression or decision tree and report on the
best model. Remember to apply all skills and
knowledge you have acquired especially
splitting data set into training and testing, and
encoding categorical columns
ENSEMBLE MODELS
RANDOM FOREST MODEL
Ensemble models combine multiple individual models to improve predictive
performance. A popular ensemble method is RandomForest, but there are
others like Gradient Boosting and AdaBoost.
ANY QUESTIONS??

You might also like