Data Science Notes - Hamza
Data Science Notes - Hamza
NOTES
SUBJECT :
DATA SCIENCE CLASS :
BSCS 6th Semester COURSE
WRITTEN BY :
(CR) KASHIF MALIK OUTLINE DATA
SCIENCE
Hamza zahoor Whatsapp 0341-8377-917
Example:
Let suppose we want to travel from station A to station B by car.
Now, we need to take some decisions such as which route will
be the best route to reach faster at the location, in which route
there will be no traffic jam, and which will be cost-effective. All
these decision factors will act as input data, and we will get an
appropriate answer from these decisions, so this analysis of data
is called the data analysis, which is a part of data science.
1. Statistics:
Statistics is one of the most important components of data
science. Statistics is a way to collect and analyze the numerical
data in a large amount and finding meaningful insights from it.
2. Domain Expertise:
In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills of a
particular area. In data science, there are various areas for which
we need domain experts.
3. Data engineering:
Hamza zahoor Whatsapp 0341-8377-917
6. Mathematics:
Hamza zahoor Whatsapp 0341-8377-917
science projects are not built the same, so their life cycle varies
as well. Still, we can picture a general lifecycle that includes
some of the most common data science steps. A general data
science lifecycle process includes the use of machine learning
algorithms and statistical practices that result in better
prediction models. Some of the most common data science steps
involved in the entire process are data extraction, preparation,
cleansing, modeling, evaluation etc. The world of data science
refers to this general process as the “Cross Industry Standard
Process for Data Mining”.
We will go through these steps individually in the subsequent
sections and understand how businesses execute these steps
throughout data science projects. But before that, let us take a
look at the data science professionals involved in any data
science project.
Who Are Involved in The Projects?
Hamza zahoor Whatsapp 0341-8377-917
• Domain Expert:
The data science projects are applied in different domains or
industries of real life like Banking, Healthcare, Petroleum
industry etc. A domain expert is a person who has experience
working in a particular domain and knows in and out about the
domain.
• Business analyst:
A business analyst is required to understand the business needs
in the domain identified. The person can guide in devising the
right solution and timeline for the same.
Data Scientist:
Hamza zahoor Whatsapp 0341-8377-917
2. Business Understanding
Understanding what customer exactly wants from the business
perspective is nothing but Business Understanding. Whether
customer wish to do predictions or want to improve sales or
minimize the loss or optimize any particular process etc forms
the business goals. During business understanding, two
important steps are followed:
KPI (Key Performance Indicator)
For any data science project, key performance indicators define
the performance or success of the project. There is a need to be
an agreement between the customer and data science project
team on Business related indicators and related data science
project goals. Depending on the business need the business
indicators are devised and then accordingly the data science
project team decides the goals and indicators. To better
understand this let us see an example. Suppose the business need
is to optimise the overall spendings of the company, then the
data science goal will be to use the existing resources to manage
double the clients. Defining the Key performance Indicators is
very crucial for any data science projects as the cost of the
solutions will be different for different goals.
SLA (Service Level Agreement)
Once the performance indicators are set then finalizing the
service level agreement is important. As per the business goals
the service level agreement terms are decided. For example, for
any airline reservation system simultaneous processing of say
1000 users is required. Then the product must satisfy this service
requirement is the part of service level agreement.
Hamza zahoor Whatsapp 0341-8377-917
4. Pre-processing data
Large data is collected from archives, daily transactions and
intermediate records. The data is available in various formats
and in various forms. Some data may be available in hard copy
formats also. The data is scattered at various places on various
servers. All these data are extracted and converted into single
format and then processed. Typically, as data warehouse is
constructed where the Extract, Transform and Loading (ETL)
process or operations are carried out. In the data science project
this ETL operation is vital and important. A data architect role is
important in this stage who decides the structure of data
warehouse and perform the steps of ETL operations.
5. Analyzing data
Now that the data is available and ready in the format required
then next important step is to understand the data in depth. This
understanding comes from analysis of data using various
statistical tools available. A data engineer plays a vital role in
analysis of data. This step is also called as Exploratory Data
Analysis (EDA). Here the data is examined by formulating the
various statistical functions and dependent and independent
variables or features are identified. Careful analysis of data
Hamza zahoor Whatsapp 0341-8377-917
On the surface, the simple answer is “the use of data for decision
making or problem solving”. That answer, however, leaves us
looking for context, tangible examples, or implications that the
post exists to address.
These are common questions either you or others have asked and
tasked data teams with building out. You’ll likely find that these
questions will come up again and again, increasing the
importance of these reports and dashboards created to address
them, and creating necessary maintenance work down the line.
Much like any other product, updates are needed to keep pace
with the market, potentially including:
• Adding or replacing data sources, such as new business
applications
• Creating filters for drilldowns that stakeholders are
interested in
• Deprecating or updating aspects of your dashboards to
reflect the changing nature of questions
Go-to-Market Strategy
Data products represent a substantial investment, signaling an
internal commitment by a company to drive adoption. This shifts
the paradigm from neglected dashboards with minimal usage to
actively promoting data products to prevent wasted investments.
Iterative Updates
Similar to ad-hoc reports addressing new questions, existing
dashboards often become neglected. Treating data products as
products emphasizes the need for regular review and
improvement, ensuring their continued usefulness to the
business.
What is data?
Like many topics, data practice has its own language. Here are
some of terms it is useful to know:
There are several types of data sets. What determines the type of
the data set is the information within it. Below are the types of
data sets you may see:
Numerical
A numerical data set is one in which all the data are numbers.
You can also refer to this type as a quantitative data set, as the
numerical values can apply to mathematical calculations when
necessary. Many financial analysis processes also rely on
numerical data sets, as the values in the set can represent
numbers in dollar amounts. Examples of a numerical data set
may include:
• The number of cards in a deck.
• A person's height and weight measurements.
• The measurements of interior living spaces.
• The number of pages in a book
Categorical
Categorical data sets contain information relating to the
characteristics of a person or object. Data scientists also refer to
categorical data sets as qualitative data sets because they contain
information relating to the qualities of an object. There are two
types of categorical data sets: dichotomous and polytomous.
Multivariate
Unlike a bivariate data set, a multivariate data set contains more
than two variables. For example, the height, width, length and
weight of a package you ship through the mail requires more
than two variable inputs to create a data set. Since each value is
unique, you can use different variables to represent each one.
For the dimensions of the example package, the values for each
measurement represent the variables.
Correlation
When there is a relationship between variables within a data set,
it becomes a correlation data set. This means that the values
depend on one another to exhibit change. For example, a
restaurant may find a correlation between the number of iced
teas customers purchase in a day and the high temperatures
Hamza zahoor Whatsapp 0341-8377-917
• DATA QUALITY
What is data quality?
Data quality, data integrity and data profiling are all interrelated
with one another. Data quality is a broader category of criteria
that organizations use to evaluate their data for accuracy,
completeness, validity, consistency, uniqueness, timeliness, and
fitness for purpose. Data integrity focuses on only a subset of
these attributes, specifically accuracy, consistency, and
completeness. It also focuses on this more from the lens of data
security, implementing safeguards to prevent against data
corruption by malicious actors.
Data profiling, on the other hand, focuses on the process of
reviewing and cleansing data to maintain data quality standards
within an organization. This can also encompass the technology
that support these processes.
Data cleansing
Techniques for cleaning up messy data include the following:
Feature engineering
Feature engineering, as noted, involves techniques used by data
scientists to organize the data in ways that make it more efficient
to train data models and run inferences against them. These
techniques include the following:
Data aggregation
Data aggregation is the process where raw data is gathered and
expressed in a summary form for statistical analysis.
For example, raw data can be aggregated over a given time
period to provide statistics such as average, minimum,
maximum, sum, and count. After the data is aggregated and
written to a view or report, you can analyze the aggregated data
to gain insights about particular resources or resource groups.
There are two types of data aggregation:
Time aggregation
All data points for a single resource over a specified time period.
Spatial aggregation
All data points for a group of resources over a specified time
period.
What is Sampling?
1 – NumPY
NumPY: Also known as Numerical Python, NumPY is an
open source Python library used for scientific computing.
NumPy gives both speed and higher productivity using
arrays and metrics. This basically means it’s super useful
when analyzing basic mathematical data and calculations.
This was one of the first libraries to push the boundaries
for Python in big data. The benefit of using something like
NumPY is that it takes care of all your mathematical
problems with useful functions that are cleaner and faster
Hamza zahoor Whatsapp 0341-8377-917
2 – SciPY
Learn Programming & Development with a Packt
Subscription
SciPY: Also known as Scientific Python, is built on top of
NumPy. SciPy takes scientific computing to another level.
It’s an advanced form of NumPy and allows users to carry
out functions such as differential equation solvers, special
functions, optimizers, and integrations. SciPY can be
viewed as a library that saves time and has predefined
complex algorithms that are fast and efficient. However,
there are a plethora of SciPY tools that might confuse users
more than help them.
3 – Pandas
Pandas is a key data manipulation and analysis library in
Python. Pandas strengths lie in its ability to provide rich
data functions that work amazingly well with structured
data. There have been a lot of comparisons between pandas
and R packages due to their similarities in data analysis,
but the general consensus is that it is very easy for anyone
Hamza zahoor Whatsapp 0341-8377-917
4 – Matplotlib
Matplotlib is a visualization powerhouse for Python
programming, and it offers a large library of customizable
tools to help visualize complex datasets. Providing
appealing visuals is vital in the fields of research and data
analysis. Python’s 2D plotting library is used to produce
plots and make them interactive with just a few lines of
code. The plotting library additionally offers a range of
graphs including histograms, bar charts, error charts,
scatter plots, and much more.
5 – scikit-learn
scikit-learn is Python’s most comprehensive machine
learning library and is built on top of NumPy and SciPy.
One of the advantages of scikit-learn is the all in one
resource approach it takes, which contains various tools to
carry out machine learning tasks, such as supervised and
unsupervised learning.
6 – IPython
Hamza zahoor Whatsapp 0341-8377-917
Fundamental Operators
These are the basic/fundamental operators used in
Relational Algebra.
• Selection(σ)
• Projection(π)
• Union(U)
• Set Difference(-)
• Set Intersection(∩)
• Rename(ρ)
• Cartesian Product(X)
• SQL
What is SQL
SQL is a standard database language used to communicate
with databases. It allows easy access to the database and
is used to manipulate database data.
Hamza zahoor Whatsapp 0341-8377-917
• Discovery
• Transformation
• Validation
• Publishing
1. Discovery
In the discovery stage, you'll essentially prepare yourself
for rest of the process. Here, you'll think about the
questions you want to answer and the type of data you'll
need in order to answer them. You'll also locate the data
you plan to use and examine its current form in order to
Hamza zahoor Whatsapp 0341-8377-917
Data structuring
When you structure data, you make sure that your various
datasets are in compatible formats. This way, when you combine
or merge data, it's in a form that's appropriate for the analytical
model you want to use to interpret the data.
Data cleaning
During the cleaning process, you remove errors that might
distort or damage the accuracy of your analysis. This
includes tasks like standardizing inputs, deleting duplicate
values or empty cells, removing outliers, fixing
inaccuracies, and addressing biases. Ultimately, the goal is
to make sure the data is as error-free as possible.
Enriching data
Once you've transformed your data into a more usable
form, consider whether you have all the data you need for
your analysis. If you don't, you can enrich it by adding
values from other datasets. You also may want to add
metadata to your database at this point.
3. Validation
During the validation step, you essentially check the work
you did during the transformation stage, verifying that
your data is consistent, of sufficient quality, and secure.
This step may be completed using automated processes
and can require some programming skills.
Hamza zahoor Whatsapp 0341-8377-917
4. Publishing
After you've finished validating your data, you're ready to
publish it. When you publish data, you'll put it into
whatever file format you prefer for sharing with other team
members for downstream analysis purposes.
1. *Prediction:*
- *Definition:* Prediction involves using existing
data to forecast future data points. It focuses on
building models that can make accurate
predictions on new, unseen data.
•
- *Algorithms:*
1. *K-Means Clustering:* Partitions the data
into K distinct clusters based on feature similarity.
2. *Hierarchical Clustering:* Builds a tree of
clusters.
3. *Principal Component Analysis (PCA):*
Reduces the dimensionality of the data while
retaining most of the variation in the data.
4. *t-Distributed Stochastic Neighbor
Embedding (tSNE):* Reduces dimensionality for
visualization purposes.
5. *Anomaly Detection Algorithms:* Identifies
data points that deviate significantly from the
majority of the data.
What is variance?
Variance stands in contrast to bias; it measures how much
a distribution on several sets of data values differs from
each other. The most common approach to measuring
variance is by performing cross-validation experiments
and looking at how the model performs on different
random splits of your training data.
A model with a high level of variance depends heavily on
the training data and, consequently, has a limited ability to
generalize to new, unseen figures. This can result in
excellent performance on training data but significantly
higher error rates during model verification on the test
data. Nonlinear machine learning algorithms often have
high variance due to their high flexibility.
A complex model can learn complicated functions, which
leads to higher variance. However, if the model becomes
too complex for the dataset, high variance can result in
overfitting. Low variance indicates a limited change in
the target function in response to changes in the training
data, while high variance means a significant difference.
High-variance model features
• Low testing accuracy. Despite high accuracy on
training data, high variance models tend to perform
poorly on test data.
• Overfitting. A high-variance model often leads to
overfitting as it becomes too complex.
• Overcomplexity. As researchers, we expect that
increasing the complexity of a model will result in
improved performance on both training and testing
data sets. However, when a model becomes too
complex and a simpler model may provide the same
level of accuracy, it’s better to choose the simpler
one.
Precision
The precision metric is used to overcome the limitation of
Accuracy. The precision determines the proportion of
positive prediction that was actually correct. It can be
calculated as the True Positive or predictions that are
actually true to the total positive predictions (True Positive
and False Positive).
F-Scores
F-score or F1 Score is a metric to evaluate a binary
classification model on the basis of predictions that are
made for the positive class. It is calculated with the help
of Precision and Recall. It is a type of single score that
represents both Precision and Recall. So, the F1 Score
can be calculated as the harmonic mean of both
precision and Recall, assigning equal weight to each of
them.
The formula for calculating the F1 score is given below:
• MAP REDUCE PARADIGM
What is MapReduce?
A MapReduce is a data processing tool which is used to
process the data parallelly in a distributed form. It was
developed in 2004, on the basis of paper titled as
"MapReduce: Simplified Data Processing on Large
Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the
mapper phase, and the reducer phase. In the Mapper, the
input is given in the form of a key-value pair. The output
of the Mapper is fed to the reducer as input. The reducer
runs only after the Mapper is over. The reducer too takes
input in key-value format, and the output of reducer is the
final output.
Steps in Map Reduce
oThe map takes data in the form of pairs and returns a
list of <key, value> pairs. The keys will not be unique
in this case. o Using the output of Map, sort and shuffle
are applied by the Hadoop architecture. This sort and
shuffle acts on these list of <key, value> pairs and
sends out unique keys and a list of values associated
with this unique key <key, list(values)>. o An output of
sort and shuffle sent to the reducer
phase. The reducer performs a defined function on a
list of values for unique keys, and Final output <key,
value> will be stored/displayed.
Sort and Shuffle
The sort and shuffle occur on the output of Mapper and
before the reducer. When the Mapper task is complete, the
results are sorted by key, partitioned if there are multiple
reducers, and then written to disk. Using the input from
each Mapper <k2,v2>, we collect all the values for each
unique key k2. This output from the shuffle phase in the
form of <k2, list(v2)> is sent as input to reducer phase.
Usage of MapReduce
o It can be used in various application like document
clustering, distributed sorting, and web link-graph
reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning. o
It was used by Google to regenerate Google's index
of the World Wide Web. o It can be used in multiple
computing environments such as multi-cluster, multi-
core, and mobile environment.
• INTRODUCTION T0 R
What is R Programming Language?
R programming is a leading tool for machine learning,
statistics, and data analysis, allowing for the easy creation
of objects, functions, and packages. Designed by Ross
Ihaka and Robert Gentleman at the University of
Auckland and developed by the R Development Core
Team, R Language is platform-independent and
opensource, making it accessible for use across all
operating systems without licensing costs. Beyond its
capabilities as a statistical package, R integrates with
other languages like C and C++, facilitating interaction
with various data sources and statistical tools. With a
growing community of users and high demand in the Data
Science job market, R is one of the most sought-after
programming languages today. Originating as an
implementation of the S programming language with
influences from Scheme, R has evolved since its
conception in 1992, with its first stable beta version
released in 2000.
Why Use R Language?
The R Language is a powerful tool widely used for data
analysis, statistical computing, and machine learning.
Here are several reasons why professionals across various
fields prefer R:
1. Comprehensive Statistical Analysis:
• R language is specifically designed for statistical
analysis and provides a vast array of statistical
techniques and tests, making it ideal for data-driven
research.
2. Extensive Packages and Libraries:
• The R Language boasts a rich ecosystem of packages
and libraries that extend its capabilities, allowing
users to perform advanced data manipulation,
visualization, and machine learning tasks with ease.
3. Strong Data Visualization Capabilities:
• R language excels in data visualization, offering
powerful tools like ggplot2 and plotly, which enable
the creation of detailed and aesthetically pleasing
graphs and plots.
4. Open Source and Free:
• As an open-source language, R is free to use, which
makes it accessible to everyone, from individual
researchers to large organizations, without the need
for costly licenses.
5. Platform Independence:
• The R Language is platform-independent, meaning it
can run on various operating systems, including
Windows, macOS, and Linux, providing flexibility in
development environments.
6. Integration with Other Languages:
• R can easily integrate with other programming
languages such as C, C++, Python, and Java,
allowing for seamless interaction with different data
sources and statistical packages.
7. Growing Community and Support:
• R language has a large and active community of
users and developers who contribute to its continuous
improvement and provide extensive support through
forums, mailing lists, and online resources.
8. High Demand in Data Science:
• R is one of the most requested programming
languages in the Data Science job market, making it
a valuable skill for professionals looking to advance
their careers in this field.
Features of R Programming Language
The R Language is renowned for its extensive features
that make it a powerful tool for data analysis, statistical
computing, and visualization. Here are some of the key
features of R:
1. Comprehensive Statistical Analysis:
• R langauge provides a wide array of statistical
techniques, including linear and nonlinear modeling,
classical statistical tests, time-series analysis,
classification, and clustering.
2. Advanced Data Visualization:
• With packages like ggplot2, plotly, and lattice, R
excels at creating complex and aesthetically pleasing
data visualizations, including plots, graphs, and
charts.
3. Extensive Packages and Libraries:
• The Comprehensive R Archive Network (CRAN)
hosts thousands of packages that extend R’s
capabilities in areas such as machine learning, data
manipulation, bioinformatics, and more.
4. Open Source and Free:
• R is free to download and use, making it accessible to
everyone. Its open-source nature encourages
community contributions and continuous
improvement.
5. Platform Independence:
• R is platform-independent, running on various
operating systems, including Windows, macOS, and
Linux, which ensures flexibility and ease of use
across different environments.
6. Integration with Other Languages:
• R language can integrate with other programming
languages such as C, C++, Python, Java, and SQL,
allowing for seamless interaction with various data
sources and computational processes.
7. Powerful Data Handling and Storage:
• R efficiently handles and stores data, supporting
various data types and structures, including vectors,
matrices, data frames, and lists.
8. Robust Community and Support:
• R has a vibrant and active community that provides
extensive support through forums, mailing lists, and
online resources, contributing to its rich ecosystem of
packages and documentation.
9. Interactive Development Environment (IDE):
• RStudio, the most popular IDE for R, offers a
userfriendly interface with features like syntax
highlighting, code completion, and integrated tools
for plotting, history, and debugging.
10. Reproducible Research:
• R supports reproducible research practices with tools
like R Markdown and Knitr, enabling users to create
dynamic reports, presentations, and documents that
combine code, text, and visualizations.
File reading in R
One of the important formats to store a file is in a text
file. R provides various methods that one can read data
from a text file.
Parameters:
file: the path to the file containing the data to be read into
R.
header: a logical value. If TRUE, read.delim() assumes
that your file has a header row, so row 1 is the name of
each column. If that’s not the case, you can add the
argument header = FALSE.
sep: the field separator character. “\t” is used for a
tabdelimited file. dec: the character used in the file for
decimal points.
DATA FRAME
A data frame is a table or a two-dimensional array-like
structure in which each column contains values of one
variable and each row contains one set of values from
each column.
Following are the characteristics of a data frame.
• The column names should be non-empty.
• The row names should be unique.
• The data stored in a data frame can be of numeric,
factor or character type.
• Each column should contain same number of data
items.
Pie Charts
Pie charts represent a graph in the shape of a circle. The
whole chart is divided into subparts, which look like a
sliced pie.
Donut Chart
Doughnut Charts are pie charts that do not contain any
data inside the circle.
Drill Down Pie charts
Drill down Pie charts are used for representing detailed
descriptions for a particular category.
Bar Charts
A bar chart is the type of chart in which data is
represented in vertical series and used to compare trends
over time.
Stacked Bar
In a stacked bar chart, parts of the data are adjacent to
each bar and display a total amount, broken down into
sub-amounts.
Gauges
The gauge (gauge) component renders graphical
representations of data.
Solid Gauge
Creates a gauge that indicates its metric value along a
180-degree arc.
Activity Gauge
Creates a gauge that shows the development of a task.
The inner rectangle shows the current level of a measure
against the ranges marked on an outer rectangle.
Heat and Treemaps
Heatmaps are useful for presenting variation across
different variables, revealing any patterns, displaying
whether any variables are related to each other, and
identifying if any associations exist in-between them.
3D Charts
Creating a 3D chart helps rotate and view a chart from
different angles, which supports in representing data.
3D Column
A 3D chart of type columns will draw each column as a
cuboid and create a 3D effect.