0% found this document useful (0 votes)

7 views96 pages

Mod 3

Module 3 provides an introduction to data science, focusing on the benefits and uses of big data, the data science process, and the various types of data involved. It outlines the characteristics of big data, including volume, variety, velocity, veracity, and variability, and discusses the importance of data cleansing, integration, and transformation in the data science process. The module emphasizes the significance of structured approaches to maximize success in data science projects and highlights the applications of data science across various industries.

Uploaded by

abiat2246

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views96 pages

Mod 3

Uploaded by

abiat2246

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

Module 3

Introduction to Data Science

Benefits and uses of data science and big data, Facets of data, The data
science process, The data Science Process- Defining goals, Retrieving Data,
Cleansing and Transforming data, Exploratory Data analysis, Build Models,
Visualization
Big Data & Data Science
• Big Data is a collection of data that is huge in volume, yet growing
exponentially with time.
• It is a data with so large size and complexity that none of traditional
data management tools such as RDBMS can store it or process it
efficiently.
• Big data is also a data but with huge size that cannot be managed by
common software tools.
• Data science involves methods to analyze massive
amounts of data and extract the knowledge it contains.
• It focuses on extracting knowledge from data sets which
are typically huge in amount.
Big Data….
• Big data is a term that describes the large volume of data – both structured
and unstructured.
• But it’s not the amount of data that’s important. It’s what organizations do
with the data that matters.
• Big data is a big deal for industries. The emerging of IoT and other
connected devices has created a massive growth in the amount of
information organizations collect, manage and analyze. Along with big data
comes the potential to unlock big insights – for every industry, large to
small.
• For a business organization, it enables
1) Cost reductions
2) Time reduction
3) New product development and optimized offerings
4) Smart decision making.
Examples of Big Data
• NYSE: About 1 TB of new trade data in a single day is generated alone by
The New York Stock Exchange.

• Social Media: People add more than 500 TB of new data on various social
media applications such as Facebook, Instagram every single day in the
form of videos, photos, messages, comments.

• Blackbox data: More than 10 TB of data is generated during a 30 minutes

flight of a single Jet engine and there are several Jet engine flights every
single day. Black box data includes flight crew voices, microphone
recordings, and aircraft performance information.
Characteristics Of Big Data – 5 Vs
Big data can be described by the following characteristics- 3 V’s
Volume, Variety, Velocity
Volume: Organizations collect data from a variety of sources, including
business transactions, smart (IoT) devices, industrial equipment, videos,
social media and more. In the past, storing it would have been a problem –
but cheaper storage on platforms like data lakes and Hadoop have eased the
burden.
Velocity: refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential
in the data. RFID tags, sensors and smart meters are driving the need to deal
with these torrents of data in near-real time.
Variety: Data comes in all types of formats – from structured, numeric data
to unstructured text documents, emails, videos, audios, stock ticker data
,photos, monitoring devices, PDFs and financial transactions.
2 extra V’s- Veracity & Variability
Often these characteristics are complemented with 2 more
V’s – Veracity, Variability:
Veracity: How accurate is the data? It refers to the quality of data.
Because data comes from so many different sources, it’s difficult to
link, match, cleanse and transform data across systems.
Variability: Data flows are unpredictable – changing often and varying
greatly. It’s challenging.

These properties make big data different from the data

found in traditional data management tools.
The challenges they bring can be felt in almost every
aspect: data capture, curation, storage, search, sharing,
transfer, and visualization. In addition, big data needs
specialized techniques to extract the insights.
Benefits and uses of Data science and big data
• Data science and big data are used almost everywhere in both
commercial and noncommercial settings.
Benefits & uses..

• Commercial companies in almost every industry use data

science and big data to gain insights into their
customers, processes, staff, completion, and products.
Many companies use data science to offer customers a
better user experience, as well as to cross-sell, up-sell,
and personalize their offerings.
• A good example of this is Google AdSense, which collects
data from internet users so relevant commercial
messages can be matched to the person browsing the
internet.
• People Analytics: Data science and big data are used to
monitor millions of individuals by collecting data records
from widespread applications such as Google Maps,
Angry Birds, email, and text messages, among many
other data sources. Then data science techniques applied
• Governmental organizations are also aware of data’s
value. Open Government Data (OGD) Platform
India or data.gov.in is a platform for supporting Open
data initiative of Government of India. This portal is a
single-point access to datasets, documents, services,
tools and applications published by ministries,
departments and organizations of the Government of
India.

• Universities use data science in their research but also to

enhance the study experience of their students.
The rise of massive open online courses (MOOC)
produces a lot of data, which allows universities to study
how this type of learning can complement traditional
classes.
Facets of big data
In data science and big data there are different types of data, and each
of them tends to require different tools and techniques.
The main categories of data are these:
• Structured
• Unstructured
• Semi structured
• Natural language
• Machine-generated
• Graph-based/Network data
• Audio, video, and images
• Streaming
Structured data
The data stored in a fixed format is known as structured data.
For example, the data stored in relational database management
systems(DBMS) or Tables.
SQL- To query and manage structured data.
Unstructured Data
The data which is not stored in a fixed format is called unstructured data,
for example, a combination of text files, images, videos, numbers stored
within the database of the organizations.
• Google Search in its raw form or unstructured format.
Semi-structured data - Email
Semi-structured data can contain both the forms of data.
• The data stored in the files, such as XML files, seems to be structured but in
actuality, it is present in an unstructured format. For example, personal data
stored in the XML file.
Natural language
• Natural language is a special type of unstructured data.
• It’s challenging to process because it requires knowledge
of specific data science techniques and linguistics.
• A human-written email is a perfect example of natural
language data.
• Although email contains structured elements such as the
sender, title, and body text, it’s a challenge to analyze the
contents as there are thousands of different languages
and dialects and ways to refer things.
Machine-generated data
• Machine-generated data is information that’s automatically created
by a computer, process, application, or other machine without human
intervention. Machine-generated data is becoming a major data
resource.
• Now a days there are more connected things than people -
IoTs
• The analysis of machine data relies on highly scalable
tools, due to its high volume and speed.
• Examples of machine data are web server logs, call detail
records etc.
Graph-based or Network data
• Graph-based data is a natural way to represent social
networks.
• Graph databases are used to store graph-based data and
are queried with specialized query languages such as
SPARQL.
• Example: Friends in social networks
Audio, image, and video
• Audio, image, and video are data types that pose specific
challenges to a data scientist.
• Tasks that are trivial for humans, such as recognizing
objects in pictures, turn out to be challenging for
computers.
• Satellites, sensors, surveillance devices data etc.
Streaming data
Streaming data can take almost any of the
Video/Audio/Image forms, it has an extra property - The
data flows into the system continuously.
Examples are the live sporting or music events,
surveillance camera and the stock market.
Data Science Process- The flow of Data Science Process
Following a structured approach helps…

• To maximize success in Data science project

• Increased impact of research results
• Can work on a prototype mode- attains better business value
• Reduce cost
• Reduce time
• Makes it possible to take up a project as a team of experts in various fields.
A typical data science project consists of 6 steps.
Data science projects also uses
Agile methodology
The data science life cycle is essentially comprised of data
collection, data cleaning, exploratory data analysis, model
building and model deployment.
6 steps of Data Science Process
Step 1: Setting Research Goal – What, How, Why & Project Charter.

• Data science is mostly applied in the context of an organization

when the business requires to perform a data science project.
• The main purpose here is making sure all the stakeholders
understand the what, how, and why of the project. In every serious
project this will result in a project charter.
• A project starts by understanding the what, the why, and the how of
your project.

• What does the company expect you to do?

• And why does management place such a value on your research?
• Is it part of a bigger strategic picture or an independent one?
• Answering these three questions (what, why, how) is the goal of
the first phase.
• The outcome should be a clear research goal, a good understanding
of the context, well-defined deliverables, and a plan of action with
a timetable.
• This information is then best placed in a project charter.
1. Spend time to understand the goals & context of research
• Continue questioning and devise examples to grasp exact business
expectations.
• Ensure project fits in a bigger picture
• Encourage stakeholders by understanding the project results

2. Project Charter includes-

• A clear research goal
• Mission and context
• Modus operandi of analysis
• Resource details(data)
• Proof of concepts/ achievability of project
• Deliverables and success measures
• Time line for operations
Helps to make an estimation of cost, data,people.
Step 2: Retrieving Data

• Sometimes you need to go into the field and design a data collection
process yourself, but most of the time you won’t be involved in this step.
Many companies will have already collected and stored the data for you,
and what they don’t have can often be bought from third parties.
• Data can be stored in many forms, ranging from simple text files to tables
in a database. The objective now is acquiring all the data you need
• Finding and getting access to data needed in your project.
• This data is either found within the company or retrieved from a
third party.
• Project Charter states which data you need and where you can
find it.
• Data takes many forms ranging from text files, Excel
spreadsheets to different types of databases.
1. Start with data stored within the company –
• This data can be stored in official data repositories such as databases,
data marts, data warehouses, and data lakes.
• A data mart is a subset of the data warehouse and geared toward
serving a specific business unit.
• While data warehouses and data marts are home to preprocessed
data, data lakes contains data in its natural or raw format.
• Finding data even within your own company can sometimes be a
challenge.
• Getting access to data is another difficult task. Organizations
understand the value and sensitivity of data and often have policies in
place so everyone has access to what they need and nothing more.
2. Don’t be afraid to shop around
• If data isn’t available inside your organization, look outside your organization’s walls. Many
companies specialize in collecting valuable information. Other companies provide data so that you, in
turn, can enrich their services and ecosystem. Such is the case with Twitter, LinkedIn, and Facebook.
Data.gov.in The home of the India Government’s open data
https://open-data.europa.eu/ The home of the European Commission’s open data
Freebase.org An open database that retrieves its information
from sites like Wikipedia, MusicBrains, and the SEC
archive
Data.worldbank.org Open data initiative from the World Bank
Aiddata.org Open data for international development
Open.fda.gov Open data from the US Food and Drug

3. Do data quick quality checks now to prevent problems later.

• Expect to spend a good portion of your project time doing data correction and cleansing,sometimes
up to 80%. The retrieval of data is the first time you’ll inspect the data in the data science process.
• During data retrieval, you check to see if the data is equal to the data in the source document and
look to see if you have the right data types.
Step 3: Data Preparation - Cleansing , Integration & Transforming Data - ETL

To enhance the quality of the data collected in Retrieval phase.

This phase consists of three subphases:
Data cleansing removes false values from a data source and
inconsistencies across data sources- Data entry errors, Impossible
values, missing values, typos, Outliers, white spaces etc.
Data integration enriches data sources by combining information from
multiple data sources.
Joining tables(enriching an observation from one table with
information from another table), Appending/Stacking tables(adding the
observations of one table to those of another table.)
Data transformation ensures that the data is in a suitable format for use
in your models- Reducing Features, Dummy variables
Cleansing-
• Data cleansing is a subprocess of the data science process that focuses on removing errors in data so data
becomes a true and consistent representation of the processes it originates from.
• By “true and consistent representation” means at least two types of errors exist.
• The 1st type is the interpretation error,ex:age=300
• The 2nd type of error points to inconsistencies between data sources or against your company’s standardized
values. An example of this class of errors is putting “Female” in one table and “F” in another when they
represent the same thing: that the person is female.
1. Data entry errors – Human sloppiness, Transmission(ETL phase)
2. Redundant whitespace (leading or trailing phases)
3. Capital letter mismatches. Ex: “brazil” and ‘brazil’
4. Impossible values (Height – 10 meter) – Sanity checks. Ex:check if 0 <= age <= 120
5. Outliers
6. Missing values- Omit values, Set null, put 0 or mean value, put estimated value
7. Deviation from code book (meta data)- standard values
8. Different measurement units
9. Different levels of aggregation
7. DEVIATIONS FROM A CODE BOOK
• Detecting errors in larger data sets against a code book or against standardized
values can be done with the help of set operations. A code book is a description of
your data, a form of metadata. It contains things such as the number of variables per
observation, the number of observations, and what each encoding within a variable
means.
• 8. DIFFERENT UNITS OF MEASUREMENT
• When integrating two data sets, you have to pay attention to their respective units
of measurement.
• 9. DIFFERENT LEVELS OF AGGREGATION
• Having different levels of aggregation is similar to having different types of
measurement. An example of this would be a data set containing data per week versus
one containing data per work week. This type of error is generally easy to detect, and
summarizing (or the inverse, expanding) the data sets will fix it.
• Correct errors as early as possible
• A good practice is to mediate data errors as early as possible in the data
collection chain and to fix as little as possible inside your program while
fixing the origin of the problem.
• Data should be cleansed when acquired for many reasons:
• ■ Not everyone spots the data anomalies. Decision-makers may make costly
mistakes on information based on incorrect data from applications that fail
to correct for the faulty data.
• ■ If errors are not corrected early on in the process, the cleansing will have
to be done for every project that uses that data.
• ■ Data errors may point to a business process that isn’t working as
designed.
• ■ Data errors may point to defective equipment, such as broken
transmission lines and defective sensors.
• ■ Data errors can point to bugs in software or in the integration of software
that may be critical to the company.
Integration
• For integrating data from different sources- Data varies in size, type, and structure,
ranging from databases and Excel files to text documents.
Different ways of combining data
• The first operation is joining: enriching an observation from one table with
information from another table.
• The second operation is appending or stacking: adding the observations of one
table to those of another table.
USING VIEWS TO SIMULATE DATA JOINS AND APPENDS
• To avoid duplication of data, you virtually combine data with views.
ENRICHING AGGREGATED MEASURES
• Data enrichment can also be done by adding calculated information to the table,
such as the total number of sales or what percentage of total stock has been sold in
a certain region
Transformation
• Certain models require their data to be in a certain shape
• Relationships between an input variable and an output variable aren’t always
linear.

REDUCING THE NUMBER OF VARIABLES

TURNING VARIABLES INTO DUMMIES

Step 4 4: Data Exploration (EDA – Exploratory Data Analysis)

• Information becomes much easier to grasp when shown in a picture,

therefore you mainly use graphical techniques to gain an understanding
of your data and the interactions between variables.
• The goal isn’t to cleanse the data, but it’s common that you’ll still
discover anomalies you missed before, forcing you to take a step back
and fix them.
• Brushing and linking- to combine and link different graphs
and tables so changes in one graph are automatically
transferred to the other graphs.
• Tabulation, clustering and other non graphical modeling
techniques can also be a part of exploratory analysis
Step 5: Data modelling
• With clean data in place and a good understanding of the content, models
can be designed with the desired goals.
• This phase use models, domain knowledge, and insights about the data
found in the previous steps to answer the research question.
• A technique is selected from the fields of statistics, machine learning,
operations research, and so on.
• Building a model is an iterative process that involves selecting the
variables for the model, executing the model, and model diagnostics.
Most models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the model - choosing
the right model for a problem requires judgment.
2. Execution of the model - need to implement it in code using languages
such as Python.
3. Diagnosis and model comparison - build multiple models from which you then
choose the best one based on multiple criteria. The model should work on
both hold out data(test data) and unseen data. Error measures are
calculated to evaluate it.
• Model diagnostics- verifying that the assumptions and requirements are
indeed met.
Step 6: Visualization

• After successfully analyzed the data and built a well-

performing model, present the results to business
stakeholders.
• These results can take many forms, ranging from
presentations to research reports.
• The stage of the data science process where soft skills are
needed to be explored. Needed to convince the stakeholders
that this model will indeed change their business process.
• Also need to automate the model in business sites.
• These six steps based on structured approach pays off in
terms of a higher project success ratio and increased
impact of research results.
• This process ensures you have a well-defined research
plan, a good understanding of the business question, and
clear deliverables before you even start looking at data.
• The first steps of your process focus on getting high-
quality data as input for your models. This way your
models will perform better later on.
• Livebook.manning
• https://livebook.manning.com/book/introducing-data-
science/chapter-2
Understanding Big Data
What is big data; why big data – convergence of key trends – unstructured data
– Industry examples of big data – web analytics – big data and marketing –
fraud and big data – risk and big data – credit risk management – big data and
algorithmic trading – big data and healthcare – big data in medicine –
advertising and big data – big data technologies.
What is Big Data
• Big Data is a collection of data that is huge in volume, yet growing exponentially
with time.
• Big data is a term that describes large, hard-to-manage volumes of data – both
structured and unstructured – that inundate businesses on a day-to-day basis.
• Big data analytics examines large amounts of data to uncover hidden patterns,
correlations and other insights.
• The new benefits that big data analytics brings - Speed and Efficiency.
• Few years ago a business would have gathered information, run analytics and
unearthed information for making future decisions.
• Today that business can identify insights for immediate decisions.
• The ability to work faster – and stay agile – gives organizations a competitive edge
they didn’t have before.
Why it is important
Helps organizations to use their data for
More efficient operation
For smarter business moves
Higher profits and happier customers.

1.Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring
significant cost advantages when it comes to storing large amounts of data – plus they
can identify more efficient ways of doing business.
2.Faster, better decision making. With the speed of Hadoop and in-memory analytics,
combined with the ability to analyze new sources of data, businesses are able to
analyze information immediately – and make decisions based on what they’ve learned.
3.New products and services. With the ability to gauge customer needs and satisfaction
through analytics comes the power to give customers what they want.
• Walmart handles more than one million customer transactions every
hour.
• Facebook handles 40 billion photos from its user base
• Decoding human genome originally took 10 years to process, but now
it is achieved in one week.
Why Big Data
Companies use big data in their systems to improve operations,
provide better customer service, create personalized marketing
campaigns and take other actions that, ultimately, can increase
revenue and profits.
Several factors have contributed to the current interest in Big Data.

Computing perfect storm

Data perfect storm
Convergence perfect storm
• Computing perfect storm: Big Data analytics are the natural result of four
major global trends:
•Moore's Law (technology always gets cheaper)
As microprocessors and sensors have become smaller and cheaper,
they are being incorporated into many common devices, such as appliances,
cars, and even light bulbs.
Moore's Law refers to Moore's perception that the number of transistors on a microchip doubles every two years,
though the cost of computers is halved.

Major data contributors – IoT devices

•Mobile computing(that smart phone or mobile tablet in your hand)
•Social networking(Facebook, Instagram, Pinterest, etc.)
•Cloud computing(you don't even have to own hardware or software
anymore;) SaaS, On demand service.
Computer memory has become much cheaper and easier to search.
Data perfect storm: Volumes of transactional data have been around for
decades for most big firms, but now
•Gates have now opened with more volume, and the velocity and variety—
the three Vs
•Three Vs makes it extremely complex and cumbersome with the current
data management and analytics technology and practices.

Convergence perfect storm: New alternatives for IT and business executives

to address Big Data analytics. Merging of
•Traditional DBMS and analytics software and hardware technologies,
•Open-source technology
•Commodity hardware (affordable devices that are generally compatible
with other such devices.).
Convergence of Key trends
• Change in the way we access the data and use it to create value.

• New technologies like Hadoop, Cloud computing, Machine learning, IoT, Artificial
Intelligence etc. – access a tremendous amount of data and extract value from it.

• So, there is now more data and less expensive faster hardware.

• Ability to do real time analysis on complex data sets.

• Now real time analytics have become affordable.

• Companies are using big data analytics to improve sales revenue, increase profits
and give a better service to customers.
3 V’s of Big data: V3

CS 40003: Data Analytics 9

V3 : V for Volume
• Volume of data, which needs to be
processed is increasing rapidly
• More storage capacity
• More computation
• More tools and techniques

Exponential increase in
collected/generated data

CS 40003: Data Analytics 10

V3: V for Variety
• Various formats, types, and
structures
• Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-
dimensional arrays, etc…

• Static data vs. streaming data

• A single application can be

generating/collecting many types
of data
To extract knowledge➔ all these types of
data need to be linked together

CS 40003: Data Analytics 11

V3: V for Velocity
• Data is being generated fast and need to be
processed fast
• For time-sensitive processes such as
catching fraud, big data must be used as it
streams into your enterprise in order to
maximize its value

• Scrutinize 5 million trade events created

each day to identify potential fraud

• Analyze 500 million daily call detail records in

real-time to predict customer churn faster

• Sometimes, 2 minutes is too late!

• The latest we have heard is 10 ns (nano
seconds) delay is too much

CS 40003: Data Analytics 12

Types of Data
• Structured
• Un structured
• Semi structured
Unstructured data
• Unstructured data is information that is not arranged according to a
pre-set data model or schema, and therefore cannot be stored in a
traditional relational database or RDBMS.
• Text and multimedia are two common types of unstructured content.
• Many business documents are unstructured, as are email messages,
videos, photos, webpages, and audio files.
• From 80 to 90 percent of data generated and collected by
organizations, is unstructured,, and its volumes are growing rapidly.
Human-generated variety Unstructured data
• Email: Email message fields are unstructured and cannot be parsed by traditional
analytics tools. That said, email metadata affords it some structure, and explains why
email is sometimes considered semi-structured data.
• Text files: This category includes word processing documents, spreadsheets,
presentations, email, and log files.
• Social media and websites: Data from social networks like Twitter, LinkedIn, and
Facebook, and websites such as Instagram, photo-sharing sites, and YouTube.
• Mobile and communications data: Text messages, phone recordings, collaboration
software, Chat, and Instant Messaging.
• Media: Digital photos, audio, and video files.

Unstructured data generated by machines

• Scientific data: Oil and gas surveys, space exploration, seismic imagery, and atmospheric data.
• Digital surveillance: Reconnaissance photos and videos.
• Satellite imagery: Weather data, land forms, and military movements.
How is unstructured data structured?
• Unstructured types of data can actually have internal structural elements.
• They’re considered “unstructured” because their information doesn’t lend
itself to the kind of table formatting required by a relational database.
• Unstructured data can be textual or non-textual (such as audio, video, and
images), and generated by people or by machines.
• Non-relational databases such as MongoDB are the preferred choice for
storing many kinds of unstructured data.
• Unstructured data can be stored in a number of ways: NoSQL (non-
relational) databases, data lakes, and data warehouses.
• Platforms like MongoDB Atlas are especially well suited for housing,
managing, and using unstructured data.
Industry Examples of Big data
• Digital Marketing:
Engage the customers at the right moment with the right message is the
biggest issue for marketers.
Big data helps marketers to create targeted and personalized campaigns.

Reasons why big data is important for digital marketers

• Real-time customer insights
• Personalized targeting
• Increasing sales
• Improves the efficiency of a marketing campaign
• Budget optimization
• Measuring campaign's results more accurately
New terms…
• Sentimental Analysis (Opinion analysis)
• Predictive Analytics &Visualization- Whom to contact? When to contact? How to
contact? and what to offer?
• Predictive analytics helps marketers to determine which customer or customer
segments to target and the right content for each customer. It also helps to
discover the right channels and the right timings for the campaign.
• Data visualization is the process of presentation of information or data in visual
formats such as graphs, charts, tables, diagrams, and maps. It helps to understand
huge data sets more easily and fast because humans are visual by nature.
• CTR stands for click-through rate: a metric that measures the number of clicks
advertisers receive on their ads per number of impressions. Percentage of people
who view your ad (impressions) and then actually go on to click the ad (clicks).
• Digital Marketing channels- Email Marketing, Pay-Per-Click Advertising (PPC),
Search Engine Optimization (SEO) , Display Advertising, Facebook, Linkedin etc.
• PPC – Facebook Ads, Twitter Ads etc.
Web Analytics (Digital Analytics)
• Web analytics is a term that applies to all forms of online measurement.
• Web Analytics is the measurement, collection, analysis, and reporting of
website data for the purposes of understanding and optimizing Web usage.
• Allow you to understand exactly how people engage with your website.
Web analytics allows you to understand things like
How people find your website
How people engage with your website
What your website’s strengths and weaknesses are.
This is incredibly valuable information, which allows you to improve your
website, as well as improve your digital marketing efforts.
• Analyse, understand, decide
• Analyse your web & mobile traffic. Understand user behavior. Boost your
business by making quick and effective decisions.
Benefits of Web Analytics
1. Measure online traffic
• How many users and visitors you have on your website at any given time.
• Where do they come from?
• What are they doing on the website?
• How much time are they spending on the website?
2. Tracking Bounce Rate
• Bounce Rate in analytics means that a user who has visited the website leaves
without interacting with it.
• A high bounce rate indicates a weak user experience.
• When a high bounce rate occurs on a website, it’s hard to expect a website to
produce quality leads or sales.
3. Finding the Right Target Audience
4. Improves and Optimizes Website and Web Services
Web analytics parameters
• Web traffic sources
• Total number of visits or sessions
• New/returning visitors
• Online conversion rate is the percentage of all user sessions in which
a website visitor takes a desired action on your website like subscribing to a
newsletter, signing up for a membership etc. By analyzing your online
conversion rate, you can see how effective your internet marketing strategy
is.
• Value per visit - This parameter shows how much value you get out of each
website visit.
• Top pages – most popular pages in your website
• Interactions per visit
• Exit pages -This parameter shows which pages your visitors leave your site
from.
Big Data & New school of Marketing
Old school marketing is fading. And a New School is on the rise. Because consumers have changed.
Customer & Marketers Changed
• Previously Television & Direct mail – Major marketing channels
• Now – Internet changed the way people sell and buy things.
• Customer is king – Have more options & expectations. Uses more interactive
channels.
Right Approach
• So Marketers need to send the Right Message to the Right People at the Right
Time through Right channel
• Marketers need to find many interaction points.
• Cross channel Marketing - is the strategy of using multiple channels to reach
consumers. This could include email marketing, social media, television, online
video, podcast ads, and any number of other marketing channels.
Cross Channel Life cycle Marketing
• A new approach called “Cross-Channel Lifecycle Marketing,” guided
by a set of strategic “loops.”
• The approach is integrated, customer-driven, and help marketers be
ready with the right marketing messages.
• Marketers always have to be “listening” – collecting actionable
customer data – so they know which loop a customer is in at any
given time and can respond accordingly with the best marketing.
• Loops: conversion, repurchase, stickiness, win-back, and re-
permission.
• Acquisition Phase: Cross-Channel Lifecycle Marketing really starts
with the capture of customer permission, contact information, and
preferences for multiple channels to send marketing messages
through mail or call. Now marketers categorize the contacted
customers as Interested and Not interested customers.
• Conversion Loop Permission.
Permission capture creates an immediate opportunity to recapture the
97 to 99 percent of people who were not convinced by generic
acquisition-marketing tactics.
• Repurchase Loop
Any single purchase creates more immediate and recurring
opportunities to influence additional purchases – without spending
more dollars on acquisition..
• Stickiness Loop
Even after marketing, customer is not interested in buying the products,
keeping them brand-engaged is paramount. In the “stickiness” loop, ask
customers to participate in polls and surveys or provide ratings or reviews for
products and services. Identify brand enthusiasts and enable them to easily
share content via multiple channels. These types of “sticky” programs lead to
increased brand affinity.
• Win-back Loop
If interactions with a customer over a certain period of time haven’t resulted in
conversion or purchase, move the customer to the “win-back” loop. Here, discount,
free-shipping, bonus, limited-time, or other special-incentive promotions are
proven tactics used to accelerate conversion.
• Re-permission Loop
If a customer has been inactive for an extensive period of time – hasn’t clicked on
links in emails, taken advantage of special offers, or responded to campaigns in any
way – we recommend asking again for permission to continue communication. Re-
permission campaigns should be sent via multiple channels to ensure response.
Fraud & Big data Analytics
• Big data fraud detection is a cutting-edge way to use consumer trends to detect and
prevent suspicious activity.
• Fraud is wrongful or criminal activities for the economic and personal benefits.
• Fraud detection is finding actual or expected fraud which takes place in an
organization. Analyzing crimes related to fraudulent activities is difficult where
traditional data mining techniques fail to address all of them.
• Big data analytics is used to identify an unusual pattern to detect and prevent fraud .
Various predictive analytics tools are used to handle massive data and their pattern.
• Today, fraud is prevalent across almost all industries and has become more
complicated. Enterprises are constantly struggling to implement effective and efficient
fraud detection systems. Businesses lose around 5% of revenue to fraud every year.
• Increase in transactional channels (online, mobile, etc.) , there is a pressing need for
real-time fraud detection solutions that are able to detect patterns over multiple
channels.
• One of the most challenging aspect of Big Data analytics is real time monitoring of
data.
Effective data analysis requires:
• Translating knowledge of organization and common fraud indicators
into analytics tests
• Effectively using technological tools
• Resolving errors in data output due to incorrect logic or scripts
• Applying fraud investigation skills to the data analysis results in order
to detect potential instances of fraud
Fraud detection…
• Data to be analysed – Structured & Unstructured Data
• Types of data to be analysed- 3 V’s
• Population Analytics - Although testing a sample of data is a valid
audit approach, it is not as effective for fraud detection purposes.
To detect fraud, data analysis techniques must be performed
on the full data population. Need to define population boundaries,
including amount of historical data to include.
Use cases
• Frauds are everywhere — wherever a transaction especially online is
involved.
• Credit card fraud is probably the most known case: Stealing or using
stolen cards, to aggressive forms such as account takeover.
• Insurance frauds: Some estimates suggest that as much as 10% of
health insurance claims can be attributed to fraud.
• Retail store frauds, Telecom frauds, Real estate frauds, Banking frauds,
health care frauds, e- auction..
There is no specific tool. The nature of the problem is different in every
case and every industry. Therefore every solution is carefully tailored
within the domain of each industry.
Big data analytics solution
• Outlier detection techniques. Outlier detection tools have their own way
of tackling the problem, such as time series analysis, cluster analysis, real-
time monitoring of transactions etc.
• Descriptive Statistical methods such as Mean/Median, Standard deviation
etc
• SNA - The study of networks of social relationships, typically to extract
useful information, such as patterns and anomalies.
• SNA is also used for fraud detection. Fraud is often organized by groups of
people loosely connected to each other. Such a network mapping will
enable financial institutions to identify customers who may have relations
to individuals or organizations on their criminal watchlist (network) and
take precautionary measures.
• Sentimental analysis, Web analytics etc.
Big data techniques - Fraud detection
• Apache Hadoop. Apache Hadoop, the popular data storage and analysis platform, an open-
source software framework for distributed storage of very large datasets. Besides storage, it
provides sophisticated analysis of that data easily and quickly.
• Online businesses that are vulnerable to fraud and theft use Hadoop to monitor and fight
criminal behavior..
• Hadoop is a powerful platform for dealing with fraudulent and criminal activity like this. It is
flexible enough to store all of the data-message content, relationships among people and
computers, patterns of activity etc.
• MapReduce is a programming paradigm for processing large datasets in distributed
environments.
• MapReduce takes large datasets, extracts and transforms useful data, distributes the data to the
various servers where processing occurs, and assembles the results into a smaller, easier to
analyze file. MapReduce is also used as a feasible technique to detect credit card frauds
effectively.
• Apache Spark is an open source cluster computing system that can be programmed quickly and
runs fast. Alternate to Hadoop and map reduce. Combining Spark with HDFS provides
opportunities to solve credit card fraud like using big data analytics.
• Apache Flink is an open source stream processing framework developed by the Apache suitable
for Fraud detection.
Big Data & Risk Management
Risk management is the process of identifying, assessing and controlling threats to an organization.
Big data helps to identify and forecast risks that can harm your business. It can also detect patterns that indicate a potential
threat to your business.

Risk Management Applications of Big Data

Vendor Risk Management (VRM)
Third-party relationships can produce regulatory, reputational, and operational risk nightmares. VRM allows you to select
vendors, assess the severity of risks, establish internal controls to mitigate the risk,
Fraud and Money Laundering Prevention
Predictive analytics supply an accurate and detailed method to prevent and minimize fraudulent or suspicious activity.
Identifying Churn
A significant risk to organizations is churn. The loss of customers deeply affects.
Customer loyalty can be identified using big data as a risk management tool. Based on the data, companies can expedite
measures to decrease churn and prevent customer defections.
Credit Risk
Risk in credit management can be mitigated by analyzing data pertaining to recent and historical spending, as well as
repayment patterns.
Operational Risk in Manufacturing Sectors
Big data can supply metrics that assess supplier quality levels and dependability. Internally, costly defects in production can
be detected early using sensor technology data analytics.
Market Risk - Market risk is defined as any shift in the valuation of a portfolio due to a change in any of four market
parameters: interest rates, foreign exchange, commodity prices and equity. Should we sell this product now?
Big Data & Credit Risk
Credit Cards
Credit risk assessment is key to the success of financial companies. They
need to evaluate the creditworthiness of individuals as well as
the corporations to whom they provide credit.
Credit risk assessment uses Big data for the
• Analysis of the credit history of the applicant
• Evaluating the capacity to pay back the borrowed capital
• Putting into perspective the amount of capital to be borrowed
• Taking into account governmental and organizational regulations
• The worth of collaterals, if any.
Big data & Algorithmic trading
• Algorithm trading is the use of computer programs for entering trading
orders, in which computer programs decide on almost every aspect of the
order, including the timing, price, and quantity of the order etc.
Role of Big Data in Algorithmic Trading
1.Technical Analysis : Technical Analysis is the study of prices and price
behavior, using charts as the primary tool.
2. Real Time Analysis : The automated process enables computer to execute
financial trades at speeds and frequencies that a human trader cannot.
3. Machine Learning : With Machine Learning, algorithms are constantly fed
data and actually get smarter over time by learning from past mistakes,
logically deducing new conclusions based on past results and creating new
techniques that make sense based on thousands of unique factors.
Big data techniques in Algorithmic trading
• Various techniques are used in trading strategies to extract actionable
information from the data, including rules, fuzzy rules, statistical methods,
time series analysis, machine learning, as well as text mining.
⦁ Uses Technical Analysis and decision Rules
⦁ Uses of Statistics
⦁ Artificial Intelligence, Machine Learning
⦁ Text Mining
Sentiment analysis - Algorithms scrape the language millions of people use
on Twitter and in Google searches, determining whether people are thinking
positively or negatively about a company or product.
These algorithms also watch the sentiment of real-time news coverage. Bad
press can trigger machines to sell stocks off in a flurry.
Big Data & Health Care
Diversity of Big data in health care industry.

1.Web and social media data - such as interaction data

from Facebook, Twitter, LinkedIn, blogs, health plan websites,
and smartphone apps
2.Machine-to-machine data - such as information from sensors,
meters, and other devices.
3.Transaction data - such as healthcare claims and billing
records in both semi-structured and unstructured formats
4.Biometric data - such as fingerprints, genetics, retinal scans, X-
rays and other medical images
5.Human-generated data - such as Electronic Medical Records
(EMRs), physicians’ notes, email, and paper documents
Health care Data - Healthcare Big Data Lakes Become “Oceans”

Currently three characteristics

distinguish Health care data:
• It is available in extraordinarily
high volume
• It moves at high velocity and
spans the health industry’s
massive digital universe
• It is derived from many sources,
it is highly variable in structure
and nature.
This is known as the 3Vs of Big
Data.
• The application of big data analytics in healthcare has a lot of
positive and also life-saving outcomes.
• Applied to healthcare, it will use specific health data of a population
(or of a particular individual) and potentially help to prevent
epidemics, cure disease, cut down costs, etc.
• Data driven treatment model
• Dashboards for hospital management, Patients..
Big data & Medicine
• Experts believe that big data is going to increase the efficacy of
personal medicines significantly. A number of initiatives are under
way to find out ways to improve the effectiveness of personal
medicines.
• Precision Medicine - This also enables precision medicine, where
diagnosis and treatment of disorders are carried out using relevant data
about a patient’s genetic make-up, behavioral patterns, etc. With this
approach, pharma companies can develop personalized medicine that is
suitable for an individual patient’s genes and current lifestyle.
• Clinical Trails: With the help of big data, pharma companies can recruit the
right patients for clinical trials, using data such as genetic information,
personality traits, and disease status, which will, in turn, increase the
success rate of the drug.
• Drug discovery - big data analytics helps pharmaceutical industry in drug
discovery. Predictive modelling enables researchers to predict drug
interactions, toxicity, and inhibition and thus speeds up the whole process.
• Controlling Drug Reaction - Predictive modeling. Data Analytics on
social media platforms and medical forums are performed along with
sentiment analysis to gain insight into adverse drug reactions (ADRs).
Big Data & Advertising
• Big Data is changing the way advertisers address three related needs.
• (i) How much to spend on advertisements.
• (ii) How to allocate amount across all the marketing communication touch points.
• (iii) How to optimize advertising effectiveness. Given these needs advertisers need to measure their
advertising end to end in terms of Reach, Response & Reaction.
• Reach, Resonance, and Reaction
• Reach: First part of reach is to identify the people who are most volumetrically responsive to their
advertising and then answer questions such as what do those people watch? What do they do online? How
to develop media plan against intended audience. The second part of reach is delivering advertisements to
the right audience.
• Resonance: If we know whom we want to reach and we’re reaching them efficiently with your media spend,
the next question is, are our ads breaking through? Do people know they’re from our brand? Are they
changing attitudes? Are they making consumers more likely to want to buy our brand? This is what is called
―resonance.
• Reaction: Advertising must drive a behavioural ―reaction or it isn’t really working. We have to measure the
actual behavioural impact.
Big Data & Advertising Big Data = Big Opportunities

• Ads featuring products and services we might actually want and use to better our lives.
The predictive analytics marketing used in the advertising industry has bridged the
communication gap between advertisers and consumers.
• Personalized and targeted ads - And these more personalized and targeted ads are all
based on massive amounts of personal data we constantly provide about what we’re
doing, saying, liking, sharing.
• Hyper-localized advertising - ads to the right people at the right time through right
channels.
• Using Big Data to Optimize Advertising
The big promise of big data to advertising is improved accuracy of communication.
• Big Data and Branding- Branding campaigns frequently aim to improve brand image or
recognition. Socio demographics like age and gender determine the relevant segments.
Advertising is delivered only to its target group, driving down wastage significantly.
• Predict customer interest.
• Expanded Customer Acquisition & Retention
BIG DATA TECHNOLOGY
• Hadoop Parallel World
• Hadoop Distributed File System(HDFS)
• Map Reduce
• Old Vs New Approaches
• Data Discovery
• Open Source Technology for Big Data Analytics
• The cloud and Big Data
• Predictive Analytics
• Software as a Service BI
• Mobile Business Intelligence
• Crowdsourcing Analytics
• Inter and Trans Firewall Analytics
Hadoop
• Hadoop is
• An Apache project
• A distributed computing platform Cloud Applications
• Hadoop is an open-source platform for
storage and processing of diverse data types
that enables data-driven enterprises to rapidly MapReduce
derive value from all their data.

Hadoop Distributed File System (HDFS)

A Cluster of Machines
History (2002-2004)

• In 2003, Google released a

whitepaper called “The Google
File System.”
• Subsequently, in 2004, Google
released another whitepaper
called “MapReduce: Simplified
Data Processing on Large
Clusters.”

Data Science SPPU
No ratings yet
Data Science SPPU
115 pages
Introduction To Data Science - Students
No ratings yet
Introduction To Data Science - Students
237 pages
Unit 1 PPT 1
No ratings yet
Unit 1 PPT 1
27 pages
Unit I
No ratings yet
Unit I
262 pages
Foundation of Data Science
100% (2)
Foundation of Data Science
143 pages
1 Unit 1 Introduction To Data Science
No ratings yet
1 Unit 1 Introduction To Data Science
48 pages
Ids (R22) U1 PPT 03092024
No ratings yet
Ids (R22) U1 PPT 03092024
87 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-01-29 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-01-29 Reference-Material-I
53 pages
Data Science and Big Data Analytics Unit 1 Notes
No ratings yet
Data Science and Big Data Analytics Unit 1 Notes
13 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Data Science
No ratings yet
Data Science
54 pages
Module-1: Introduction To Data Science
No ratings yet
Module-1: Introduction To Data Science
98 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
70 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
Data Science Unit I
No ratings yet
Data Science Unit I
13 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
Unit 1
No ratings yet
Unit 1
137 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
Azure Cloud Framework
No ratings yet
Azure Cloud Framework
2,159 pages
CS 3353 FDS Unit 1 Notes JPR
No ratings yet
CS 3353 FDS Unit 1 Notes JPR
39 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
IDS - Lecture 1
No ratings yet
IDS - Lecture 1
52 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Unit 1
No ratings yet
Unit 1
11 pages
Data Science
No ratings yet
Data Science
244 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
15 pages
Lec 1 Data Science and Big Data
No ratings yet
Lec 1 Data Science and Big Data
3 pages
Unit 1
No ratings yet
Unit 1
76 pages
Unit 1
No ratings yet
Unit 1
25 pages
Fdsunit 1
No ratings yet
Fdsunit 1
27 pages
20IT501 BDA Unit1
No ratings yet
20IT501 BDA Unit1
18 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
DSUP Chapter 1 PDF
No ratings yet
DSUP Chapter 1 PDF
31 pages
Ds Unit 1
No ratings yet
Ds Unit 1
18 pages
Fods Notes
No ratings yet
Fods Notes
139 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
DS R Unit-1
No ratings yet
DS R Unit-1
41 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Data v2
No ratings yet
Data v2
25 pages
Introduction
No ratings yet
Introduction
21 pages
Data Science
No ratings yet
Data Science
40 pages
Entity Framework Learning Guide
100% (3)
Entity Framework Learning Guide
514 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Online Examination Project Report Documentation Only
No ratings yet
Online Examination Project Report Documentation Only
48 pages
Unit 1
No ratings yet
Unit 1
9 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
ADF Copy Data
No ratings yet
ADF Copy Data
85 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Data
No ratings yet
Data
43 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Employee Leave Management System Abstract
No ratings yet
Employee Leave Management System Abstract
2 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Transforming India Technology SMEs For A Digital Future Navigating The AI Frontier
No ratings yet
Transforming India Technology SMEs For A Digital Future Navigating The AI Frontier
64 pages
OpenText Content Server CE 22.1 - Installation Guide English (LLESCOR220100-IGD-EN-03)
No ratings yet
OpenText Content Server CE 22.1 - Installation Guide English (LLESCOR220100-IGD-EN-03)
162 pages
WiFi Speed Tracker App
No ratings yet
WiFi Speed Tracker App
55 pages
Data Science Learning Checklist
No ratings yet
Data Science Learning Checklist
1 page
Unit 6
No ratings yet
Unit 6
34 pages
R 2008 It Syllabus
No ratings yet
R 2008 It Syllabus
89 pages
MicroStrategy 931
No ratings yet
MicroStrategy 931
256 pages
Optimize Gen AI Implementation Costs With PibyThree 1721925466
No ratings yet
Optimize Gen AI Implementation Costs With PibyThree 1721925466
11 pages
Website Panel Installation Guide
No ratings yet
Website Panel Installation Guide
52 pages
Library Management System SQL Project Document
No ratings yet
Library Management System SQL Project Document
6 pages
Unit 07 - VTHung
No ratings yet
Unit 07 - VTHung
27 pages
Sharyu Bhadule
No ratings yet
Sharyu Bhadule
1 page
5 Flex Body Tutorial PDF
No ratings yet
5 Flex Body Tutorial PDF
24 pages
Upgradation and Transportation
No ratings yet
Upgradation and Transportation
68 pages
2nd AIDS DBMS Model Question Paper
No ratings yet
2nd AIDS DBMS Model Question Paper
2 pages
Geo-Property Tax Information System-A Case Study of The Tarkwa Nsuaem Municipality, Ghana
No ratings yet
Geo-Property Tax Information System-A Case Study of The Tarkwa Nsuaem Municipality, Ghana
16 pages
CV Alexandru Schiopu
No ratings yet
CV Alexandru Schiopu
3 pages
Zeba Loan Automation System
No ratings yet
Zeba Loan Automation System
4 pages
Undertaking A Literature Review A Step by Step Approach
No ratings yet
Undertaking A Literature Review A Step by Step Approach
7 pages
Anteneh T Girma Resume
No ratings yet
Anteneh T Girma Resume
3 pages
Code For VAULT
No ratings yet
Code For VAULT
6 pages
Theory Assignment - 1 Question
No ratings yet
Theory Assignment - 1 Question
3 pages
Munishwar
No ratings yet
Munishwar
2 pages
Sppu DBMS Que Paper 2
No ratings yet
Sppu DBMS Que Paper 2
3 pages
NihalRamTripathi2 Latest Resume
No ratings yet
NihalRamTripathi2 Latest Resume
2 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Mod 3

Uploaded by

Mod 3

Uploaded by

Module 3

Introduction to Data Science

• Blackbox data: More than 10 TB of data is generated during a 30 minutes

These properties make big data different from the data

• Commercial companies in almost every industry use data

• Universities use data science in their research but also to

• To maximize success in Data science project

• Data science is mostly applied in the context of an organization

• What does the company expect you to do?

2. Project Charter includes-

3. Do data quick quality checks now to prevent problems later.

To enhance the quality of the data collected in Retrieval phase.

REDUCING THE NUMBER OF VARIABLES

TURNING VARIABLES INTO DUMMIES

• Information becomes much easier to grasp when shown in a picture,

• After successfully analyzed the data and built a well-

Computing perfect storm

Major data contributors – IoT devices

Convergence perfect storm: New alternatives for IT and business executives

• Ability to do real time analysis on complex data sets.

• Now real time analytics have become affordable.

CS 40003: Data Analytics 9

CS 40003: Data Analytics 10

• Static data vs. streaming data

• A single application can be

CS 40003: Data Analytics 11

• Scrutinize 5 million trade events created

• Analyze 500 million daily call detail records in

• Sometimes, 2 minutes is too late!

CS 40003: Data Analytics 12

Unstructured data generated by machines

Reasons why big data is important for digital marketers

Risk Management Applications of Big Data

1.Web and social media data - such as interaction data

Currently three characteristics

Hadoop Distributed File System (HDFS)

• In 2003, Google released a

You might also like