0% found this document useful (0 votes)
17 views62 pages

Chapter 1

The document outlines a course on Data Science Fundamentals with Python, focusing on K-means clustering and basic Python programming. It introduces key concepts of data science, including types of data, analytics methods, and machine learning techniques. The course aims to equip learners with the skills to analyze data and derive actionable insights for real-world applications.

Uploaded by

yosefmuluye42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views62 pages

Chapter 1

The document outlines a course on Data Science Fundamentals with Python, focusing on K-means clustering and basic Python programming. It introduces key concepts of data science, including types of data, analytics methods, and machine learning techniques. The course aims to equip learners with the skills to analyze data and derive actionable insights for real-world applications.

Uploaded by

yosefmuluye42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Data Science

Fundamentals with Python


Code:ITec 5032

By : Tinsae D.
2 Course description

 The objective of this course is to introduce you to one of


the emerging technology - data science. There are different
data science techniques and this course is prepared to
introduce you to one of the commonly used basic data
science technique – K-means clustering from different
perspectives (ML, Mathematics, Statistics and
Programming).
 Parallelly you will be introduced to python programming
basics in data science.
3

Chapter 1:
Foundations of Data Science
4 1.1. Introduction - Data Science
Chapter Objectives:
The objective of this chapter material is to gently introduce
you to Data Science through some real-world examples of
where Data Science is used, and also by highlighting some
of the main concepts involved.

 How Data Science might be applied to real-world situations.


 How the K-Means clustering algorithm works.
 What we mean by data.
 What we mean by machine learning.
 How to use python programming for data science
5 Data science
 We are living in the age of “data science and
advanced analytics”, where almost everything in
our daily lives is digitally recorded as data . e.g.
 The data can be structured (?), semi-structured (?),
or unstructured (?), which increases day by day.
 Data science is typically a “concept to unify
statistics, data analysis, and their related
methods” to understand and analyze the actual
phenomena with data, which can be a discovery,
prediction, service, suggestion, insight into
decision-making, thought, model, paradigm, tool,
or system.
6 … Data science
 With our lives been increasingly digitized, constantly
on smartphones, electronic payments, we've all
become data-generating machines. This data says a
lot about you.
 when gathered from large populations, this data
becomes even more valuable. It gives the companies
and governments collecting it the ability to analyze
not only what we've done, but also the ability to
predict what you're about to do in future.
7 Data science

 Data science is all about extracting intelligence


from data.
 Data science draws on methods from statistics and
computer science to process the vast amounts of
data that we generate in our daily lives and to turn
this into something meaningful and valuable.
 whatever you do in today's world, you generate
data.
example:
8 … Data science
 Most of us have no idea about exactly how our data
is being used. The algorithms used by social media
companies, banks, and health insurers can seem like
mysterious black boxes.
 What are the Different applications of data science ?
one of the most commonly used applications of data
science: data clustering, using the K-means algorithm .
9 what do we mean by data?
 The word "data" is often used interchangeably
with information.
 With respect to data science
“data can refer a collection of factual
information that can be used by a computer as a
basis for calculation.”
 Data can be represented either as a number
such as age or a price, or as a category such
as hair color or type of fruit.
10 … data

 Data becomes particularly interesting when we're


dealing with large volumes of it. Take information on the
hair color and age of a handful of people. With only a
few examples, not much can be said. But if this is scaled
up to 10,000 people, then patterns emerge in the data
which can be useful to, for example, a marketing
company that's trying to market new hair care products.
 They might use this to decide which age group they
should market their products to.
11 Types of data
 There are two broad types of data,
i. Categorical and
ii. Numeric.
i. Categorical
 is when a textual description or label is used
to represent specific categories of objects.
Categories include things like
primary colors: red, green, and blue; or
fruit: banana, apple, orange.
12 ii. Numerical
 Numeric data can be continuous. Something that is measured on a
continuous scale, like air temperatures at the North Pole or the weight
of a cow.
 Numeric data can also be discreet. Something that is countable like the
number of beans in a jar, or the number of people who survived the
sinking of the Titanic.
 Ultimately, all data is stored on a computer as a discrete binary number.
This means that when we're working with categorical data or
continuous numeric data, the computer is actually mapping this to some
discrete value behind the scenes
13 Check point

Note: You can select more than one.


Question 1
 Which of the following is best represented
as categorical data?
o Volvo, Citroen, Honda, Toyota
o 1.02, 0.30, 0.03, 12.10
o The length of a swimming pool
o A person's age in months
14 … Check point
Question 2
 Which of the following might best be
represented by continuous numeric data?
o Types of fruit
o Cost of coffee
o A, B, C, D ,E
o Height of a bird in flight
o 1.03, 2.01, 13/2, 19, 101.10, 1/3
15 Data science
 Data science is typically a “concept to unify statistics, data
analysis, and their related methods” to understand and
analyze the actual phenomena with data, which can be a
discovery, prediction, service, suggestion, insight into
decision-making, thought, model, paradigm, tool, or
system.
 The popularity of “Data science” is increasing day-by-day,
which is shown in the next figure of google trends data
over the last 5 years.
16

In the fig the average is 71 for machine learning,


60 for data science, 30 for data analytics, and 12
for data mining. This shows how data science has
become popular when it comes to the popularity of
data science using recent advanced data analytics
technology such as Machine Learning is more
popular.
17 Data science
 Usually, data science is the field of applying
advanced analytics methods and scientific
concepts to derive useful business information
from data.
 Advanced analytics is a step forward in
offering a deeper understanding of data and
helping to analyze granular data while , Basic
analytics offer a description of data in general.
18 Data science: Types of analytics

 In the field of data science, several types of analytics are popular


i. Descriptive analytics" which answers the question of what happened;
ii. "Diagnostic analytics" which answers the question of why did it happen;
iii. "Predictive analytics" which predicts what will happen in the future; and
iv. "Prescriptive analytics" which prescribes what action should be taken,

Although the area of “data science” is huge, mainly


focus on analytics is deriving useful insights through advanced analytics,
where the results are used to make smart decisions in various real-world
application areas.
19 …. Types of analytics
20 Data science
 Neural network, or deep learning analysis can
provide deeper knowledge about data, and thus can
be used to develop data-driven intelligent
applications.
 More specifically, regression analysis, classification,
clustering analysis, association rules, time-series
analysis, sentiment analysis, behavioral patterns,
anomaly detection, factor analysis, log analysis, and
deep learning which is originated from the artificial
neural network, are playing major role.
21 Data science related terms
 Data analysis” refers to the processing of data by
conventional (e.g., classic statistical, empirical, or
logical) theories, technologies, and tools for
extracting useful information and for practical
purposes .
 Data analytics”, on the other hand, refers to the
theories, technologies, instruments, and processes
that allow for an in-depth understanding and
exploration of actionable data insight. Statistical
and mathematical analysis of the data is the major
concern in this process.
22 Data science related terms
 Data mining” also referred as knowledge mining
from data, knowledge extraction, knowledge
discovery from data (KDD), data/pattern analysis,
data archaeology, and data dredging. It is the
process of discovering interesting patterns and
knowledge from large amounts of data.
 Big data: massive, high dimensional,
heterogeneous, complex, unstructured,
incomplete, noisy, and erroneous” . Several
unique features including volume, velocity,
variety, veracity, value (5Vs), and complexity are
used to understand and describe big data.
23 Data science related terms

 Machine learning”, a branch of artificial intelligence


(AI), is one of the major techniques used in advanced
analytics which can automate analytical model
building. This is focused on the premise that systems
can learn from data, recognize trends, and make
decisions, with minimal human involvement.
 Deep Learning” is a subfield of machine learning that
discusses algorithms inspired by the human brain’s
structure and the function called artificial neural
networks
…… Data science
24
 Unlike the above data-related terms, “Data science” is an
umbrella term that encompasses advanced data analytics, data
mining, machine, and deep learning modeling, and several
other related disciplines like statistics, to extract insights or
useful knowledge from the datasets and transform them into
actionable business strategies.
 data science from the disciplinary perspective can be defined as
“a new interdisciplinary field that synthesizes and builds on
statistics, informatics, computing, communication,
management, and sociology to study data and its environments
to transform data to insights and decisions by following a data-
to-knowledge-to-wisdom thinking and methodology.
25 Data science
 How data science can play a significant role in the real-
world business process?
26 How data science can play a significant role …
 Understanding business problems: getting a clear
understanding of the problem , o understand and identify
the business problems, the data scientists formulate
relevant questions while working with the end-users and
other stakeholders.
 Understanding data: real-world data sets are often noisy,
missing values, have inconsistencies, or other data issues,
which are needed to handle effectively. what data is
available and how it aligns to the business problem could
be the first step in data understanding, what data would be
best needed and the best ways to acquire it.
27 How data science can play a significant role ..
 Data pre-processing and exploration: examines a broad data collection to
discover initial trends, attributes, points of interest, etc. in an unstructured
manner to construct meaningful summaries of the data. visualizing and
interpreting the data through graphical representation such as a chart, plot,
histogram. use data summarization and visualization to audit the quality of the
data .
 Machine learning modeling and evaluation: Once the data is prepared for
building the model, data scientists design a model, algorithm, or set of models,
to address the business problem. Model building is dependent on what type of
analytics. Data scientists typically separate training and test subsets of the given
dataset usually dividing in the ratio of 80:20. This is to observe whether the
model performs well or not on the data, to maximize the model performance.
(Model validation and assessment metrics: error rate, accuracy, true positive,
false positive, true negative, false negative, precision, recall, f-score, ROC
(receiver operating characteristic curve) analysis, applicability analysis)
28 More on ML

 advanced analytics” can be defined as the autonomous or semi-


autonomous analysis of data or content using advanced techniques and
methods to discover deeper insights, make predictions, or produce
recommendations, where machine learning-based analytical modeling is
considered as the key technologies in the area.
 wide range of methods such as regression and classification analysis,
association rule analysis, time-series analysis, behavioral analysis, log
analysis, and so on can be applied.
29 ML

Fig. A general
structure of a
machine
learning based
predictive
model .
30 … Machine Learning
 Machine learning is the term used to describe
a series of processes in which a computer
learns from evidence or learns from lots of
examples of data to help it to certain data-
based tasks.
 Common to all machine learning algorithms is
a training step. Training is where the computer
learns something about the world or a
particular problem, based on data drawn from
that world.
.
31 … Machine Learning

 Training may allow the computer to build some


internal representation, or model, about that world
 Alternatively, the computer can be trained to search
for patterns in the data to help structure the world.
 The outcome of training is an algorithm that can be
used for a variety of tasks such as predicting future
events, automatically recognizing objects, or
structuring data in a manageable way.
32 examples of machine learning
Example1 (regression)
 Machine learning can be used to predict the
future. Take earthquake prediction. This is
important applications in predicting natural
disasters and helping people plan to minimize the
impact.
 When trained on information such as time,
location, and magnitude of historical earthquakes
in a region, a machine learning algorithm may be
used by geologists to work out the probability of
another earthquake occurring at a certain time
and place in the future. This sort of machine
33 Example 2-(classification)
 Machine learning can also be used to recognize and
classify previously unseen objects.
 Given a large database of possible road signs, a
machine learning algorithm in a self-driving car may
be able to correctly recognize a stop sign in an
unfamiliar location or recognize a stop sign that is
drawn slightly different from the others it's seen.
 This information can be used to direct the car to take
appropriate action, such as stop. This sort of machine
learning where the output is a discrete label, a stop
sign, is known as classification.
34 Example 3-(Clustering)

 Machine learning can also be used to structure


lots of data into manageable chunks or clusters.
 Information from thousands of people's shopping
habits for example, can be used to link some
groups of those people into specific clusters for
more targeted marketing. Clustering is an
example of unsupervised learning. It allows us to
find patterns of behavior in data, based on
similarities in that data.
35  The same approach may make use of data drawn
from social media posts to automatically cluster
people into groups of similar political affiliation, or
use sequences of information from samples of
DNA to cluster people into groups with similar
genetics.
 In summary, machine learning is really just about
building algorithms that can help a computer to
learn from a body of data so that may make
sense or make predictions about new and
previously unseen data.
 Simply Machine learning is when a machine
learns from examples.
36 1.3. Supervised Vs. Unsupervised
Learning
 Machine learning algorithms fall into two
categories,
i. supervised learning and
ii. unsupervised learning.
 What do I mean by supervised?
Consider a parent teaching an infant the difference
between dogs and cats. Every time the child sees a
dog, the parent will point out "dog", and similarly
for a cat.
37 …. i. supervised learning

 All that is needed are lots of examples of


data, images of animals that had been pre-
labelled as either cat or dog by the child
supervisor or parent.
 Supervised learning is essentially this,
providing the computer with lots of training
data images of dogs and cats in this
example, alongside the class labels that we
have assigned either dog or cat.
38 …. i. supervised learning

 When completely new data is then presented, the


trained algorithm can infer the most likely
corresponding class or value. Trained algorithm,
will make a decision on whether it's either a dog or
a cat. This is an example of classification.
 Alternatively, the algorithm can give some
measure indicating the degree of dogishness, or
catishness. This is an example of regression, or a
continuous valued output or a probability is given.
39 ii. unsupervised learning.
 Unsupervised learning is when we don't know beforehand the
structure of the data.
 We don't use any labels. In our child learning example, the
parent never tells the child what it is they're looking at.
 The child may be exposed to lots of dogs and cats, but they
have to learn the similarities and differences between
examples of the creatures based on some other criteria. Over
time, the child may well learn traditional dichotomy dog versus
cat. But equally, they could form a different clustering. For
example, degree of furriness versus non furry. They might not
even be limited to just two classes or clusters. Perhaps, the
data suggests that three or more clusters may be appropriate.
40 … ii. unsupervised learning.
 For example, if the similarity criteria was color of
fur. This division of data into groups based on some
measure of similarity is why this type of
unsupervised learning is referred to as data
clustering.
 Data clustering is a very powerful tool and
exemplifies many of the most important aspects of
machine learning and data science.
 In this course, we will explore clustering in greater
detail, and in particular, the most common and
useful clustering algorithm, K-means clustering.
41 Check point
Select all the can apply (Note you can select more
than one)
Question 1
 Which of the following are commonly
associated with supervised learning?
o classification
o clustering
o regression
42 Check point

Question 2
 Which of the following is true:
o Clustering organizes data using pre-selected
labelling information.
o Regression is a supervised method for modelling
and predicting continuous valued data.
o Classification is the process of making a decision
based on data, and returning a categorical or
discrete output.
43 Neural Networks and Deep Learning

 Deep learning is a form of machine learning that uses artificial neural


networks to create a computational architecture that learns from data
by combining multiple processing layers, such as the input, hidden, and
output layers.
 The key benefit of deep learning over conventional machine learning
methods is that it performs better in a variety of situations, particularly
when learning from large datasets
 The most common deep learning algorithms are: multilayer perceptron
(MLP) , convolutional neural network (CNN or ConvNet) , long short term
memory recurrent neural network (LSTM-RNN).
44 Neural Networks …..

Fig. An
artificial
neural
network
modeling with
multiple
processing
layers.
45 How data science can play a significant role

 Data product and automation: is typically the output of


any data science activity . A data product, or data-enabled
or guide, which can be a discovery, prediction, service,
suggestion, insight into decision-making, thought, model,
paradigm, tool, application, or system that process data
and generate results. Businesses can use the results of such
data analysis.
46
K-Means
loading ……………………………….
47

K-Means …..
48 1.4. K-Means Clustering

 Data clustering is a method of unsupervised


machine learning, where data is separated into
groups or clusters based on some similarity
measure.
 K-means clustering is probably the most common
example of data separation into groups or clusters
based on some similarity measure.
 To show how K-Means Clustering works, lets start
with example.
… k-Means clustering (Example)
49

• Imagine a one-dimensional axis, a line representing income. Each dot


shown in this line represents a population of people with that level of
income.
• Say we want to uncover some pattern in this data. Specifically, can
we find out something about the relationships between these people
based only on their level of income? Using traditional statistics, we
can see something that the overall average income.
50  But the average income information fails to capture the fact that there
are clearly two groupings of income here, which we might label as
wealthy and everyone else.

 The K-means algorithm is pretty good at finding such patterns without


us having to tell it beforehand.
51 steps of K-means
 K-means has five basic steps and works as follows.
 Step one.
First, we select the number of clusters we want to look for. This is the k in K-
means. Here we choose k equals two. The algorithm then randomly selects k
points on our data axis. Note that it doesn't matter that these points do not
necessarily correspond to existing data. These points are called our data centers
or centroids.
 Step two.
The distance from each data point to each of our k centroids is calculated. In this
case, distance is simply a measure of the difference in income between the points.
 Step three.
52
Clusters are formed by assigning each data point to either centroid one or two,
depending on which is closest.
 Step four.
This is the update step. The average value calculated over the members of each
cluster is then set as the new centroid. We ignore and dispense with the
previous centroid value.
 Finally, step five.
We then recursively run steps two to four recalculating the centroids until
eventually the centroid positions do not change. When the centroids remains
stable like this, the algorithm is said to have converged. In our example, we
were able to discover two clusters. But if we were to say k equals to three
looking for three clusters, we can find another pattern in the data, which we can
then map to wealthy, average, and poor.
53  K-mean steps are Most useful cases involve more than one dimension or
feature.
 The same basic principle can be applied to two-dimensions.
 The distance measure between points here might be a simple Euclidean
distance.
 It turns out that K-means can be applied to any number of dimensions,
provided there is sufficient data to train the algorithm.
 K-means converges to what is known as a local minimum. This basically
means that although the algorithm seems to have found the best
groupings. A better result may yet be found if the algorithm were to be
started again with different initial centroids positions. It turns out that
the selection of initial centroids in step one is crucial to finding a good
solution.
54 standardization
 Having good data to begin with is crucial to the
success of any data science analysis.
 If the data going in is bad, then the algorithms won't
work as well as you'd like.
 Much of the work carried out by data scientists is spent
cleaning and adjusting the data to make it usable.
 For K-means, it's particularly important that the data
used is compatible between different features.
Continuous value data such as income, times, weights,
can be using the Euclidean distance.
55 … standardization
 However, for more than two dimensions of features with
different ranges, for example, if you had income levels
between 1,000 and a million , and weight from 20
kilograms to a 150 kilograms, it's important to scale that
data so that the two things can be compatible. Usually,
this means adjusting all values to fall between zero and
one. This scaling of the data is sometimes referred to as
standardization.
 Categorical feature data like oranges, apples, or cat, dog,
are not as easily handled by K-means. If the categories fall
in some kind of scale like very dog versus slightly dog, less
dog, these may be converted into a number range like 1,
0.8, 0.6, 0.4 and K-means can then be used.
56 Check point
 Question 1
These are the final cluster assignments from a run of K-means on 12 data
points (the red marks). What value of K was used in the algorithm?
57  Question 2
 K-means clustering is run on the data shown below.
During the first round of the algorithm, the cluster
centroids are placed in the positions indicated by the
large blue and green crosses.

o(a,b,c,d) and (e,f,g)


o(a,b,c) and (d,e,f,g)
o(a,b,c) and (d) and (e,f,g)
o(a,b) and (c,d) and (e,f) and (g)
58 1.5. Real world data set

 Publicly available real-world dataset. This is based on two sources of


information. The World Bank income inequality index and the
Gallup poll of happiness, covering over a 120 countries.

Assignment (Group)
List some Real world data set available for data scientist with their
location
59 Assignment #1- 5% individual-select
all that can apply
Question 1
 The K-means algorithm is an example of:
o unsupervised learning
o supervised learning
o data clustering
o classification
Question 2
60  The "k" in K-means represents:
o the number of data points in a cluster
o the number of clusters to find in a dataset
o none of the above
o the number of steps in the K-means algorithm
Question 3
 Before running K-means, data scaling is applied
to the data in order to:
o standardize features to make them more comparable.
o make the plots look nicer.
o remove noisy data.
61 Question 4 (select two that applies)
 Supervised machine learning includes:
o clustering
o regression
o none of the above
o classification
Question 5
 K-means is run on a dataset. One of the clusters
contains 6 items of data with the following values:
2, 3, 2, 3, 1, 1. What number corresponds to the
data center, or centroid, of this cluster? (Answer this
with justification)
62

End of Chapter One

You might also like