Data Science Unit 1
Data Science Unit 1
UNIT-1
INTRODUCTION
SYLLABUS: UNIT I
Introduction: Need for data science – benefits and uses – facets of data – data science
process – setting their search goal – retrieving data – cleansing, integrating, and transforming
data – exploratory data analysis – build the models – presenting and building applications.
Data Science
Data science involves using methods to analyze massive amounts of data
and extract the knowledge it contains.
Data science is an evolutionary extension of statistics capable of dealing
with the massive amounts of data produced today. It adds methods from
computer science to the repertoire of statistics.
Data Science is a discipline that merges concepts from computer science
(algorithms, programming, machine learning, and data mining),
mathematics (statistics and optimization), and domain knowledge
(business, applications, and visualization) to extract insights from data
and transform it into actions that have an impact in the particular domain
of application.
Data science is the field of study that combines domain expertise,
programming skills, and knowledge of math and statistics to extract
meaningful insights from data. Data science practitioners apply machine
learning algorithms to numbers, text, images, video, audio, and more to
produce artificial intelligence (AI) systems that perform tasks which
ordinarily require human intelligence. In turn, these systems generate
insights that analysts and business users translate into tangible business
value.
Data science is a field that studies data and how to extract meaning from it,
using a series of methods, algorithms, systems, and tools to extract insights
from structured and unstructured data. That knowledge then gets applied to
business, government, and other bodies to help drive profits, innovate
products and services, build better infrastructure and public systems, and
more.
In short, Data Science “uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from data in various
forms”.
Python is a great language for data science because it has many data
science libraries available, and it’s widely supported by specialized
software. For instance, almost every popular NoSQL database has a
Python-specific API. Because of these features and the ability to
prototype quickly with Python while keeping acceptable performance, its
influence is steadily growing in the data science world.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 1
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 2
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
The analysis of machine data relies on highly scalable tools, due to its
high volume and speed.
Examples of machine data are web server logs, call detail records,
network event logs, and telemetry.
Graph-based or Network Data:
“Graph data” can be a confusing term because any data can be shown in a
graph. “Graph” in this case points to mathematical graph theory.
In graph theory, a graph is a mathematical structure to model pair-wise
relationships between objects. Graph or network data is, in short, data that
focuses on the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and
store graphical data.
Graph-based data is a natural way to represent social networks, and its
structure allows calculating specific metrics such as the influence of a
person and the shortest path between two people.
Examples of graph-based data can be found on many social media
websites. For instance, on LinkedIn we can see who we know at which
company. Our follower list on Twitter is another example of graph-based
data.
The power and sophistication comes from multiple, overlapping graphs of
the same nodes. For example, imagine the connecting edges here to show
“friends” on Face book. Imagine another graph with the same people
which connect business colleagues via LinkedIn. Imagine a third graph
based on movie interests on Netflix. Overlapping the three different-
looking graphs makes more interesting questions possible.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 5
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
MLBAM (Major League Baseball Advanced Media) announced in 2014
that they’ll increase video capture to approximately 7 TB per game for
the purpose of live, in-game analytics. High-speed cameras at stadiums
will capture ball and athlete movements to calculate in real time, for
example, the path taken by a defender relative to two baselines.
Recently a company called DeepMind succeeded at creating an algorithm
that’s capable of learning how to play video games. This algorithm takes
the video screen as input and learns to interpret everything via a complex
process of deep learning.
It’s a remarkable feat that prompted Google to buy the company for their
own Artificial Intelligence (AI) development plans. The learning
algorithm takes in data as it’s produced by the computer game; it’s
streaming data.
Streaming Data
While streaming data can take almost any of the previous forms, it has an
extra property. The data flows into the system when an event happens
instead of being loaded into a data store in a batch. Although this isn’t
really a different type of data, we treat it here as such because we need to
adapt process to deal with this type of information.
Examples are the “What’s trending” on Twitter, live sporting or music
events, and the stock market.
Data Science Process
(Steps in Data science process or Flow of Data science Process or phases of data science process)
The data science process typically consists of six steps.
1. Setting the research goal
2. Gathering data (or) Retrieving data
3. Data preparation (or) Data Pre processing
4. Data exploration (or) Exploratory data analysis
5. Data Modeling (or) Model building
6. Presentation and Automation
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 6
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 7
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 8
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
The next step in data science is to retrieve the required data. This data is
either found within the company or retrieved from a third party.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 9
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
The mail goal of this phase is finding suitable data and getting access to
the data owner.
Data can be stored in many forms, ranging from simple text files to tables
in a database. The data can be stored in official data repositories such as
databases, data marts, data warehouses, and data lakes maintained by a
team of IT professionals.
Finding data even within own company can sometimes be a challenge. As
companies grow, their data becomes scattered around many places.
Knowledge of the data may be dispersed as people change positions and
leave the company
Getting access to data is another difficult task.. Getting access to the data
may take time and involve company politics.
Although data is considered an asset more valuable than oil by certain
companies, more and more governments and organizations share their
data for free with the world. This data can be of excellent quality.
Step 3: Cleansing, integrating, and transforming data
This phase sanitize and prepare the data received from the data retrieval
phase, for use in the modeling and reporting phase.
This phase involves, checking and remediating data errors, enriching the
data with data from other data sources, and transforming it into a suitable
format for models.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 10
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 11
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
But data collected by machines or computers isn’t free from errors either.
Errors can arise from human sloppiness, whereas others are due to
machine or hardware failure. Examples of errors originating from
machines are transmission errors or bugs in the extract, transform, and
load phase (ETL).
For small data sets we can check every value by hand. Detecting data
errors when the variables we study don’t have many classes can be done
by tabulating the data with counts.
Redundant Whitespace
Whitespaces tend to be hard to detect but cause errors like other
redundant characters.
If we know to watch out for them, fixing redundant whitespaces is luckily
easy enough in most programming languages. They all provide string
functions that will remove the leading and trailing whitespaces. For
instance, in Python we can use the strip() function to remove leading and
trailing spaces.
Impossible Values and Sanity Checks
Sanity checks are another valuable type of data check.
Here we check the value against physically or theoretically impossible
values such as people taller than 3 meters or someone with an age of 299
years.
Sanity checks can be directly expressed with rules: check = 0 <= age <= 120
Outliers
An outlier is an observation that seems to be distant from other observations
or, more specifically, one observation that follows a different logic or
generative process than the other observations.
The easiest way to find outliers is to use a plot or a table with the minimum
and maximum values.
Outliers can gravely influence the data modeling, so investigate them
first.
Dealing with Missing Values
Missing values aren’t necessarily wrong, but still need to handle them
separately.
Certain modeling techniques can’t handle missing values.
They might be an indicator that something went wrong in data collection
or that an error happened in the ETL process.
Common techniques data scientists use are listed in following table:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 12
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 13
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
The visualization techniques use in this phase range from simple line
graphs or histograms to more complex diagrams such as Sankey and
network graphs.
Sometimes it’s useful to compose a composite graph from simple graphs
to get even more insight into the data. Other times the graphs can be
animated or made interactive to make it.
Another technique called brushing and linking can also be used. With
brushing and linking we combine and link different graphs and tables (or
views) so changes in one graph are automatically transferred to the other
graphs. This interactive exploration of data facilitates the discovery of
new insights.
Tabulation, clustering, and other modeling techniques can also be a part
of exploratory analysis.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 14
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
The last step of the data science model is presenting results to the world
or stakeholders or business and automating the analysis, if needed.
The main aim of this phase is presenting the results to the stakeholders
and automating or industrializing analysis process for repetitive reuse and
integration with other tools
These results can take many forms, ranging from presentations to research
reports.
Sometimes it is need to automate the execution of the process because the
business will want to use the insights gained in another project or enable
an operational process to use the outcome from model.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 16
4-2 B.Tech IT Regulation: R19 Data Science-UNIT-1
Certain projects require performing the business process over and over
again, so automating the project will save time.
The last stage of the data science process is where soft skills will be most
useful, and they’re extremely important.
Unit-1 Tutorial Questions
1. What is data science? Explain briefly the need for data science with
real time applications.
3. Briefly explain the steps in the data science process with neat
diagram.
4. How do we set the research goal, retrieving data and data preparation
in data science process?
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 17