0% found this document useful (0 votes)
51 views61 pages

Emerging - 2021 - Module 2 PDF

The document provides an overview of the key concepts in data science including defining data and information, explaining the data value chain and basic data analysis processes. It also outlines the data science process from framing problems, collecting and processing data, exploring data through analysis, and communicating results. The document is intended to introduce foundational data science topics for a class on emerging exponential technologies.

Uploaded by

Green Bees
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views61 pages

Emerging - 2021 - Module 2 PDF

The document provides an overview of the key concepts in data science including defining data and information, explaining the data value chain and basic data analysis processes. It also outlines the data science process from framing problems, collecting and processing data, exploring data through analysis, and communicating results. The document is intended to introduce foundational data science topics for a class on emerging exponential technologies.

Uploaded by

Green Bees
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

SRI KRISHNADEVARAYA EDUCATIONAL TRUST

SIR M. VISVESVARAYA INSTITUTE OF TECHNOLOGY


BENGALURU

DEPARTMENT OF MBA

Workshops
& Seminars Sports /
Certification Cultural &
courses Literary
Clubs

Forum Projects /
Activities & Consultancy
Events &
Placements

Academics Overall Industry &


& Skill Development Research
Training exposure
MBA 3RD SEMESTER

Sub: EMERGING EXPONENTIAL TECHNOLOGIES


Sub code: 20MBA301
Module -2 data science
Overview for data science; definition of data and information; data types
and representation; data value chain; data acquisition; data analysis; data
curating; data storage; data usage; basic concepts of big data.
Subject faculty name: Prof. Deepthi J R
Designation: Assistant Professor
Email id: [email protected]
• Data: It is raw, unorganized facts that need to be
processed. Data can be something simple and random
useless until it is organized.
• Units of Information that are collected through
observation.
• Technical sense, data are set of values of qualitative or
quantitative variables about persons or objects.
• Science: Hunt and applications of knowledge and
understanding of the natural and social world following a
methodical approach grounded on evidence.
DATA SCIENCE
• Area of study that combines domain expertise in programming skills, and
knowledge of statistics and mathematics to obtain meaningful insights
from the data.
• Gives insights that the analysts and business users van use to generate
value for the business.
✓E-marketing websites suggesting where, what to buy
✓Filtering e-mails in the spam and non- spam categories.
✓Websites/Apps predicting the shows of your liking
OVERVIEW OF DATA SCIENCE
• Data science uses techniques such as machine learning and
artificial intelligence to extract meaningful information and to
predict future patterns and behaviors.
• Advances in technology, the internet, social media, and the
use of technology have all increased access to big data.
• The field of data science is growing as technology advances
and big data collection and analysis techniques become more
sophisticated
• Data science is the study of data. Like biological sciences is a study of
biology, physical sciences, it’s the study of physical reactions. Data is real,
data has real properties, and we need to study them if we’re going to work
on them. Data science involves data and some signs.
• It is a process, not an event. It is the process of using data to understand
too many different things, to understand the world. Let suppose when you
have a model or proposed explanation of a problem, and you try to validate
that proposed explanation or model with your data.
• It is the skill of unfolding the insights and trends that are hiding (or
abstract) behind data. It’s when you translate data into a story. So use
storytelling to generate insight. And with these insights, you can make
strategic choices for a company or an institution.
HISTORY OF DATA SCIENCE
• The term DATA SCIENCE has emerged only recently to specifically designate
a new profession that is expected to make sense of the vast stores of Big
Data.
• Over the past two decades, tremendous progress has been made in the
field of Information and technology.
• Data Science was officially accepted as a study since the year 2011, the
different or related names were being used since 1962.
DATA SCIENCE DEFINITION
• Data science is the field of study that combines domain expertise,
programming skills, and knowledge of mathematics and statistics
to extract meaningful insights from data.
• Data science practitioners apply machine learning algorithms to
numbers, text, images, video, audio, and more to produce artificial
intelligence (ai) systems to perform tasks that ordinarily require
human intelligence. In turn, these systems generate insights which
analysts and business users can translate into tangible business
value.
PROCESS OF
DATA
SCIENCE
DATA SCIENCE PROCESS

1.Frame the problem


2.Collect the raw data needed for your problem
3.Process the data for analysis
4.Explore the data
5.Perform in-depth analysis
6.Communicate results of the analysis
Step 1: Framing the
Problem
Before solving a problem, the pragmatic thing to do is to
know what exactly the problem is. Data questions must be
first translated to actionable business questions. People will
more than often give ambiguous inputs on their issues. And,
in this first step, you will have to learn to turn those inputs
into actionable outputs.
A great way to go through this step is to ask questions like:

• Who the customers are?


• How to identify them?
• What is the sale process right now?
• Why are they interested in your products?
• What products they are interested in?
• You will need much more context from numbers for them to
become insights. At the end of this step, you must have as
much information at hand as possible.
Step 2: Collect the raw data need for
your problem
• After defining the problem, you will
need to collect the requisite data to
derive insights and turn the business
problem into a probable solution.
The process involves thinking
through your data and finding ways
to collect and get the data you need.
It can include scanning your internal
databases or purchasing databases
from external sources.
• Many companies store the sales data
they have in customer relationship
management systems. The CRM data
can be easily analyzed by exporting
it to more advanced tools using data
pipelines.
• After the first and second steps, when you have all the data you
need, you will have to process it before going further and
analyzing it. Data can be messy if it has not been appropriately
maintained, leading to errors that easily corrupt the analysis. These
issues can be values set to null when they should be zero or the
exact opposite, missing values, duplicate values, and many more.
You will have to go through the data and check it for problems to
get more accurate insights.
STEP 3: • The most common errors that you can encounter and should look
PROCESSING out for are:
THE DATA TO 1. Missing values
ANALYZE
2. Corrupted values like invalid entries
3. Time zone differences
4. Date range errors like a recorded sale before the sales even
started

• You will have to also look at the aggregate of all the rows and
columns in the file and see if the values you obtain make sense. If
it doesn’t, you will have to remove or replace the data that doesn’t
make sense. Once you have completed the data cleaning process,
your data will be ready for an exploratory data analysis (EDA).
STEP 4: EXPLORING THE
DATA
• In this step, you will have to develop
ideas that can help identify hidden
patterns and insights.
• You will have to find more interesting
patterns in the data, such as why sales
of a particular product or service have
gone up or down.
• You must analyze or notice this kind of
data more thoroughly. This is one of the
most crucial steps in a data science
process
STEP 5: PERFORMING IN-
DEPTH ANALYSIS

• This step will test your mathematical, statistical, and technological


knowledge. You must use all the data science tools to crunch the data
successfully and discover every insight you can.
• You might have to prepare a predictive model that can compare your
average customer with those who are underperforming.
• You might find several reasons in your analysis, like age or social
media activity, as crucial factors in predicting the consumers of a
service or product.
• You might find several aspects that affect the customer, like some
people may prefer being reached over the phone rather than social
media.
• These findings can prove helpful as most of the marketing done
nowadays is on social media and only aimed at the youth.
• How the product is marketed hugely affects sales, and you will have
to target demographics that are not a lost cause after all. Once you
are all done with this step, you can combine the quantitative and
qualitative data that you have and move them into action.
STEP 6: COMMUNICATING RESULTS OF THIS
ANALYSIS
• After all these steps, it is vital to convey your insights and findings to the sales head and make them
understand their importance. It will help if you communicate appropriately to solve the problem you have
been given.
• Proper communication will lead to action. In contrast, improper contact may lead to inaction.
• You need to link the data you have collected and your insights with the sales head’s knowledge so that
they can understand it better.
• You can start by explaining why a product was underperforming and why specific demographics were not
interested in the sales pitch.
• After presenting the problem, you can move on to the solution to that problem.
• You will have to make a strong narrative with clarity and strong objectives.
DATA ANALYTICS LIFE CYCLE

➢The data analytic lifecycle is designed for big data problems and data
science projects. The cycle is iterative to represent real project.
➢To address the distinct requirements for performing analysis on big
data, step – by – step methodology is needed to organize the
activities and tasks involved with acquiring, processing, analyzing, and
repurposing data.
PHASE 1:
DISCOVERY
• The data science team learn and
investigate the problem.
• Develop context and understanding.
• Come to know about data sources
needed and available for the project.
• The team formulates initial hypothesis
that can be later tested with data
• Steps to explore, preprocess, and condition data prior to
modeling and analysis.
• It requires the presence of an analytic sandbox, the team
execute, load, and transform, to get data into the sandbox.
• Data preparation tasks are likely to be performed multiple
times and not in predefined order.
• Several tools commonly used for this phase are – hadoop,
alpine miner, open refine, etc.
PHASE 2: DATA
PREPARATION
Team explores data to learn about relationships
between variables and subsequently, selects key
variables and the most suitable models.

PHASE 3: In this phase, data science team develop data sets


for training, testing, and production purposes.

MODEL
PLANNING Team builds and executes models based on the work
done in the model planning phase.

Several tools commonly used for this phase are –


matlab, stastica
Team develops datasets for testing, training, and
production purposes.

Team also considers whether its existing tools will suffice


PHASE 4: for running the models or if they need more robust
environment for executing models.
MODEL
BUILDING Free or open-source tools – rand pl/r, octave, weka.

Commercial tools – matlab , stastica


PHASE 5: COMMUNICATION
RESULTS
• After executing model team need to compare outcomes of
modeling to criteria established for success and failure.
• Team considers how best to articulate findings and
outcomes to various team members and stakeholders,
taking into account warning, assumptions.
• Team should identify key findings, quantify business value,
and develop narrative to summarize and convey findings to
stakeholders.
PHASE 6: OPERATIONALIZE
• Analytics operationalization is the process of bringing the
right data, at the right time, for the right users —all in a
repeatable and collaborative fashion, where the data can be
trusted for business insights and actions..
• This approach enables team to learn about performance
and related constraints of the model in production
environment on small scale , and make adjustments before
full deployment.
• The team delivers final reports, briefings, codes.
1. Prescriptive analytics:
▪The motivation behind prescriptive analytics is to prescribe what
move to make to eliminate a future issue or take full advantage of
a promising trend. Prescriptive analytics utilizes advanced tools
and technologies, similar to machine learning, business rules, and
algorithms, which makes it modern to actualize and manage.
2. Predictive analytics
▪Giving hints that it is something related to future prediction. Yes, it
is as it tells about what is going to happen. It uses the discoveries
of descriptive and diagnostic analytics to identify bunches and
special cases and to predict future trends, which makes it a
significant device for estimating.
3. Diagnostic analytics
▪ At this stage, historical information can be classified against other data to
acknowledge the topic of why something happened. Diagnostic analytics
provides top to bottom bits of knowledge into a specific issue.
4. Descriptive Analytics
▪ Descriptive analytics simply describes the answer to what happened and it
alters raw information from numerous data sources to give important
knowledge into the past. Though, these outcomes barely signal that
something is wrong or right, without clarifying why.
DATA TYPES AND REPRESENTATION
• As numerical calculations would be performed
using various compliers, it is necessary to
understand various types.
• Most import5ant types of data used are
characters, integers and float.
1. Char Type
➢It stores strings of numbers, letters and symbols.
➢Data type “CHAR” and “VARCHAR” is referred to
as “character string types” and their
corresponding values.
2. Int Type

An integer is a whole number (not a


fractional number) that can be
positive, negative, or zero. Examples
of integers are: -5, 1, 5, 8, 97, and
3,043. Examples of numbers that
are not integers are: -1.43, 1 3/4,
3.14, . 09, and 5,643.1..
3. Float Type
➢A float data type in java stores a
decimal value with 6-7 total
digits of precision. So, for
example, 12.12345 can be saved
as a float, but 12.123456789
can't be saved as a float.
➢When representing a float data
type in java, we should append
the letter f to the end of the data
type; otherwise it will save as
double
THE DATA VALUE
• Perspective of big data and the private sector, AS THE DATA
SOURCES ARE PROPERLY discovered, ingested, analyzed, stored
and eventually exploited by these organizations to add value.
• Exploring high-value usage and building a procedure to convert
raw data into useful information is the core of the data value
chain.
• When data are used, influences decision and affects someone's
well-being.
1. Collection:
• Initial phase of the data value chain process is a collection. The Collection
stage begins by probing into question.
• What kind of data is required to solve a problem?
• The answer to this question leads to the collection of appropriate data for
subsequent process.
2. Publication
• The second phase of the data value chain process.
• Once data is collected, this can be published or presented in a way the
users can use.
• This stage involves three activities: publishing data in appropriate format,
distributing the data to the eventual users, and finally analyzing the data to
derive the required information.
3. Uptake
• This stage is of three activities: linking the data to the prospective users,
encouraging the users to integrate data into the decision- making process
and persuading them to value data.
• The linking of the data to prospective users can be done in numerous ways
like press release, online broadcasting, training, seminars or other
educational events.

4. Impact
This is the final phase of the data value chain process involves three activities:
using data to comprehend a problem and take a decision, altering the
consequence of a project or improving a situation, and reusing the data by
combining them with other data and sharing them freely.
• Data acquisition
Gathering ,filtering and cleaning data before it is put in a data warehouse or any other
storage solution.
• Data analysis
Exploring, transforming, and modelling data with the goal of highlighting relevant data
synthesizing and extracting useful hidden information with high potential from a business
point of view.
• Data curation
Data curation is the organization and integration of data collected from various sources. It
involves annotation-more of explanation, publication and presentation of the data such
that the value of the data is maintained over time, and the data remains available for
reuse and preservation.
• Data storage:
✓Persistence and management of data in a scalable way
✓Provides fast access to the data.
✓The parallel data storage solutions like hadoop framework, no sql, cloud
✓Storage, etc. Are some well-known big data storage.

• Data usage:
✓In business decision-making can enhance competitiveness through
✓Reduction of costs, increased added value, or any other parameter that
✓Can be measured against existing performance criteria.
• McDonald's mission is to provide customers with low-
priced food items
• McDonald's value chain is a component of the industry's
value system. The value system is composed of various
other value chains of the business units of all
organizations involved, such as the company's beverage
suppliers and the rest of the supply chain.
• With the advent of the internet of things (IOT), more
objects and devices are connected to the internet,
gathering data on customer usage patterns and
product performance. The emergence of machine
learning has produced still more data.
• While big data has come far, its usefulness is only
just beginning. Cloud computing has expanded big
data possibilities even further.
• The cloud offers truly elastic scalability, where
developers can simply spin up ad hoc clusters to
test a subset of data.
• And graph databases are becoming increasingly
important as well, with their ability to display
massive amounts of data in a way that makes
analytics fast and comprehensive.
BIG DATA
Big data can help you address a range of business
activities, from customer experience to analytics.
Here are just a few.
• Product development
• Predictive maintenance
• Customer experience
• Fraud and compliance
• Machine learning
• Operational efficiency
• Drive innovation
BIG DATA EXAMPLES
• Personalized e-commerce shopping experiences
• Financial market modelling
• Compiling trillions of data points to speed up cancer research
• Media recommendations from streaming services like spotify, hulu and netflix
• Predicting crop yields for farmers
• Analyzing traffic patterns to lessen congestion in cities
• Data tools recognizing retail shopping habits and optimal product placement
• Big data helping sports teams maximize their efficiency and value
• Recognizing trends in education habits from individual students, schools and districts
BIG DATA STATISTICS
❑Google gets over 3.5 billion searches daily
❑Data interactions went up by 5000%
between 2011 and 2021
❑ In 2020, every person generated 1.7
megabytes per second
❑ Whatsapp users exchange up to 65 billion
messages daily.
❑ 95% of businesses cite the need to manage
unstructured data as
❑A problem for their business.
❑ Data interactions went up by 5000%
between 2011 and 2021
❑ Twitter users send over half a million tweets
CHARACTERISTICS
OF BIG DATA
1. Volume
• The most characteristic property of big data, volume, highlights the amount of data that passes through the
business day in and day out and how each of the data items needs to be captured to make a holistic sense
of the business to derive value out of it.
• Data here will refer to anything that can be captured, structured, unstructured, semi-structured, arriving in
batches or real-time.
• Without the huge volume, it would not be fair to call the data ecosystem big data, even if it captures every
business aspect. The whole premise of value in big data is based on this first V, volume.
• To give you a small example of how big the data should be to classify it as big data, sample this.
• The year 2016 saw global mobile traffic of 6.2 exabytes per month. It is estimated that by the year 2020,
this will easily top 40000 exabytes
2. Velocity
• Big data refers to the crucial characteristic of capturing data coming in at any speed. In
today’s world, data is almost stateless.
• It is coming and leaving the business ecosystem at great speeds.
• Big data systems are equipped to capture this data at the rate at which it is coming in.
If the speed does not match up to the speed at which data is coming in, there will be
frequent backlogs, ultimately choking the system.
• Big data systems are designed to handle a massive and continuous flow of data—
methods like sampling help in dealing with velocity issues in a big data system.
• As an example of the velocity that a big data system has to endure, more than 3.5 billion
searches per day are made through the google search engine. With an ever-increasing
number of active accounts on facebook, the number of likes, updates, shares, and
comments coming into facebook increases by 22% every year.
3. Variety

• It is characteristic of big data to capture anything and everything of value in the business
ecosystem.
• This includes data with no immediate value to derive but can be processed further with advanced
tools to gain insights into building intelligence into the system.
• Apart from the structured data that a business is used to, unstructured data buckets like images,
videos, sounds, flat files, email bodies, log files, and more.
• These contain data that can be mined out with advanced tools.
• Big data system is designed to capture the unstructured and semi-structured data passing
through the business in a timely and efficient manner.
• This also means that apart from storing a variety of data or heterogeneous data, the big data
system should hook onto these various data types of data sources efficiently without
compromising on speed.
4. Veracity
• With the volume, variety, and velocity that big data allows for, the models built on the data will not be of
true value without this characteristic.
• Veracity is the trustworthiness of the source’s data, the quality of the data derived after processing.
• The system should allow mitigation against data biases, abnormalities or inconsistencies, volatility,
duplication, among other factors.

5. Value
▪ Value is one of the first properties discussed in business, and a certain degree of value will be projected at
the outset of a big data project.
▪ Big data helps to build the infrastructure that machine learning and artificial intelligence can be based on.
▪ Businesses that start today into big data can tomorrow easily transition to machine learning and artificial
intelligence to augment their decision-making processes.
BIG DATA
LIFECYCLE
TOP 5 SOURCES OF BIG DATA

You might also like