0% found this document useful (0 votes)
60 views107 pages

Unit - Big - Data

Uploaded by

sudhanshu8m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views107 pages

Unit - Big - Data

Uploaded by

sudhanshu8m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Big Data

Extremely large data sets

1
Syllabus
Unit-1

2
Evolution of Technology

3
IOT Devices (Sources of Big Data)

4
Social Media (Sources of Big Data)

5
What is Big Data?

6
What is Big Data?
•Big data is term used to describe data that is too large and complex to store in traditional
databases or process it efficiently.
•Big Data is difficult to store, collect, maintain, analyze, and visualize.
•Big Data is a collection of data that is huge in volume, yet growing exponentially with time.

7
What is Big Data?
•It became difficult to manage and process the data using traditional data processing tool(s).
•Big data is the term for a collection of data sets which are so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data processing
application.

8
Data
•Data is numbers, alphabets or special symbols.
•The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
•Data growth has seen exponential acceleration since the advent of the computer and Internet.
•In fact, the computer and Internet duo has imparted the digital form to data.
Information: is a processed data.

9
Digital Data
•Digital data is information stored on a computer system as a series of 0’s and 1’s in a binary
language.

Example: Whenever we send an email, read a social media post, or take pictures with our digital
camera, we are working with digital data.

10
Types of Digital Data
•Digital data can be classified into three forms:
1. Unstructured
2. Semi-structured
3. Structured

11
Types of Digital Data
•Digital data can be classified into three forms:
1. Unstructured
2. Semi-structured
3. Structured

Digital Data

Unstructured Semi-structured Structured


12
1. Un-structured Data
•Any data with unknown form or the structure is classified as unstructured data.
•The data which does not confirm to a data model or is not in a form that can be used easily by a
computer program is categorized as unstructured data.
•About 80—90% data of an organization is in this format.
Example: Images, videos, letters, researches, white papers, the body of an email, Memos, chat
rooms, PowerPoint presentations etc.

13
1. Un-structured Data
•Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of rules. Its
arrangement is unplanned and haphazard.
•Photos, videos, text documents, and log files can be generally considered unstructured data.
•Even though the metadata accompanying an image or a video may be semi-structured, the actual
data being dealt with is unstructured.
•Additionally, Unstructured data is also known as “dark data” because it cannot be analyzed
without the proper software tools.

14
1. Un-structured Data
•Unstructured data is information that either does not organize in a pre-defined manner or not
have a pre-defined data model.
•Unstructured information is a set of text-heavy but may contain data such as numbers, dates, and
facts as well.
•Videos, audio, and binary data files might not have a specific structure. They’re assigned to
as unstructured data.

15
2. Structured Data
•Structured data is generally tabular data that is represented by columns and rows in a database.
Databases that hold tables in this form are called relational databases.
•Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured data'.
•Structured data is easy to store, manage, and analyze. All of the data follows the same format.
•Relationships exist between entities of data, such as classes and their objects.
Example: Data stored in databases.

16
2. Structured Data
•Structured data can be defined as the data that resides in a fixed field within a record.
•It is type of data most familiar to our everyday lives. for ex: birthday, address.
• A certain schema binds it, so all the data has the same set of properties.
•Structured data is also called relational data.
•It is split into multiple tables to enhance the integrity of the data by creating a single record to
depict an entity. Relationships are enforced by the application of table constraints.
•The business value of structured data lies within how well an organization can utilize its existing
systems and processes for analysis purposes.
•A Structured Query Language (SQL) programming language used for structured data.

17
2. Structured Data
•Examples of structured data include numbers, dates, strings, etc. The business data of an e-
commerce website can be considered to be structured data.

Name Class Section Roll No Grade

Ajay 11 A 1 A

Vijay 11 A 2 B

Ramesh 11 A 3 A

18
3. Semi-structured Data
•Semi-structured data is information that doesn’t consist of Structured data (relational database)
but still has some structure to it.
•The data is not in the relational format and is not neatly organized into rows and columns like that
in a spreadsheet.
•However, there are some features like key-value pairs that help in discerning the different entities
from each other.

•Semi-structured data consist of documents held in JavaScript Object Notation (JSON) format.

•It also includes key-value stores and graph databases.

19
3. Semi-structured Data
•Semi-structured data is information that does not reside in a relational database or any other data
table.
•Semi-structured data is not bound by any rigid schema for data storage and handling.

20
3. Semi-structured Data
•Since semi-structured data doesn’t need a structured query language, it is commonly
called NoSQL data.
•The data which does not confirm to a data model but has some structure is categorized as semi-
structured data.
•However, it is not in a form that can be used easily by a computer program.

21
3. Semi-structured Data
•Semi-structured content is often used to store metadata about a business process but it can also
include files containing machine instructions for computer programs.
•This type of information typically comes from external sources such as social media platforms or
other web-based data feeds.
Example: Emails, XML, markup languages like HTML, etc. Metadata for this data is available but
is not sufficient.

22
Structure Vs Unstructured

23
Structured Vs Semi-Structured Vs Unstructured
Differences between Structured, Semi-structured and Unstructured data
Properties Structured data Semi-structured data Unstructured data

Technology It is based on Relational It is based on It is based on character


database table XML/RDF(Resource and binary data
Description
Framework).

Transaction Matured transaction and Transaction is adapted No transaction


management various concurrency from DBMS not management and no
techniques matured concurrency

Version management Versioning over tuples, Versioning over tuples Versioned as a whole
row, tables or graph is possible

Flexibility It is schema dependent It is more flexible than It is more flexible and


and less flexible structured data but less there is absence of
flexible than schema
unstructured data

24
Structured Vs Semi-Structured Vs Unstructured
Differences between Structured, Semi-structured and Unstructured data
Properties Structured data Semi-structured data Unstructured data

Scalability It is very difficult to It’s scaling is simpler It is more scalable.


scale DB schema than structured data

Robustness Very robust New technology, not —


very spread

Query performance Structured query allow Queries over Only textual queries are
complex joining anonymous nodes are possible
possible

25
History of Big Data
•The 21st century is characterized by the rapid advancement in the field of information technology.
•IT has become an integral part of daily life as well as various other industries like: health,
education, entertainment, science and technology, genetics, or business operations and these
industries generate a lot of data, this can be called Big Data.
•Big Data consists of large datasets that cannot be managed efficiently by the common database
management systems.

26
History of Big Data
•These datasets range from terabytes to Exabyte’s.
•Mobile phones, credit cards, Radio Frequency Identification (RFID) devices, and social
networking platforms create huge amounts of data that may reside unutilized at unknown servers
for many years.
•And with the evolution of Big Data, this data can be accessed and analysed on a regular basis to
generate useful information.
•“Big Data” is a relative term depending on who is discussing it. For Example, Big Data to Amazon
or Google is very different from Big Data to a medium-sized insurance organization.

27
Introduction to Big Data Platform
•A big data platform is a type of IT solution that combines the features and capabilities of several
big data applications and utilities within a single solution, this is then used further for managing as
well as analysing Big Data.
•It is an enterprise class IT platform that enables organization in developing, deploying, operating
and managing a big data infrastructure environment.
•It focuses on providing its users with efficient analytics tools for massive datasets.
•The users of such platforms can custom build applications according to their use case like to
calculate customer loyalty (E-Commerce user case), and so on.

28
Introduction to Big Data Platform
Goal: The main goal of a Big Data Platform is to achieve: Scalability, Availability, Performance,
and Security.
Example: Some of the most commonly used Big Data Platforms are:
• Hadoop Delta Lake Migration Platform
• Data Catalog Platform
• Data Ingestion Platform
• IOT Analytics Platform

29
Drivers for Big Data
Big Data has quickly risen to become one of the most desired topics in the industry.
The main business drivers for such rising demand for Big Data Analytics are:
1. The digitization of society
2. The drop in technology costs
3. Connectivity through cloud computing
4. Increased knowledge about data science
5. Social media applications
6. The rise of Internet-of-Things(IOT)
Example: A number of companies that have Big Data at the core of their strategy like:
Apple, Amazon, Facebook and Netflix have become very successful at the beginning of the 21st
30
Big Data Architecture
•Big data architecture is designed to handle the ingestion, processing, and analysis of data that is
too large or complex for traditional database systems.

31
Big Data Architecture
The big data architectures include the following components:
Data sources: All big data solutions start with one or more data sources.
Example
• Application data stores, such as relational databases.
• Static files produced by applications, such as web server log files.
• Real-time data sources, such as IoT devices.
Data storage: Data for batch processing operations is stored in a distributed file store that can hold
high volumes of large files in various formats (also called data lake).
Example: Azure Data Lake Store or blob containers in Azure Storage.

32
Big Data Architecture
Batch processing: Since the data sets are so large, therefore a big data solution must process data
files using long-running batch jobs to filter, aggregate, and prepare the data for analysis.

Real-time message ingestion: If a solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream processing.

Stream processing: After capturing real-time messages, the solution must process them by
filtering, aggregating, and preparing the data for analysis. The processed stream data is then written
to an output sink. We can use open-source Apache streaming technologies like Storm and Spark
Streaming for this.
33
Big Data Architecture
Analytical data store: Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical tools. Example: Azure
Synapse Analytics provides a managed service for large-scale, cloud-based data warehousing.

Analysis and reporting: The goal of biggest data solutions is to provide insights into the data
through analysis and reporting. To empower users to analyse the data, the architecture may include
a data modelling layer. Analysis and reporting can also take the form of interactive data exploration
by data scientists or data analysts.

34
Big Data Architecture
Orchestration: Most big data solutions consist of repeated data processing operations, that
transform source data, move data between multiple sources and sinks, load the processed data into
an analytical data store, or push the results straight to a report. To automate these workflows, we
can use an orchestration technology such as Azure Data Factory.

35
Characteristics of Big-Data
Also known as 5Vs of Big Data
Volume: Means large amount of data.
Velocity: Means the rate at which data is generated.
Variety: Means the type of data.
Value: Raw data are processed and we get meaningful data.
Veracity: Consistency of Data generates

36
Characteristics of Big-Data
1. Volume:
• Incredible amount of data.
• The amount of data which generated every seconds.
• Volume is a factor which define data is big data or not.
• Big data technologies can handle large amounts of data.
• Big Data is a vast “volumes” of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human interactions, and so on.
Example: From cell phones, social media, online transaction etc.
Facebook generates approximately a billion messages, 4.5 billion times the “Like” button is
recorded, and more than 350 million new posts are uploaded each day.
37
Characteristics of Big-Data
2. Velocity:
• Speed at which data is
• Generated
• Collected
• Analyzed

• Velocity refers to the speed or rate by which data is generated in real-time.


• Velocity plays an important role compared to others.
• It contains the linking of incoming data sets speeds, rate of change, and activity bursts.
• The primary aspect of Big Data is to provide demanding data rapidly.
Example of data that is generated with high velocity - Twitter messages or Facebook posts.
38
Characteristics of Big-Data
3. Variety:
• Big Data can be structured, unstructured, and semi-structured that
are being collected from different sources.
• 80% unstructured data.
• Data were only collected from databases and sheets in the past, But these days the data will
come in an array of forms ie.- PDFs, Emails, audios, Social Media posts, photos, videos, etc.
• Types of data are:
1. Structured
2. Semi-Structured
3. Un-Structured
39
Characteristics of Big-Data
4. Value:
• Raw data are processed and we get meaningful data.
• Value is an essential characteristic of big data.
• It is not the data that we process or store, it is valuable and reliable data that we store, process
and analyse.

40
Characteristics of Big-Data
5. Veracity
• Trust worthiness of data.
• Veracity refers to the quality of the data that is being analysed.
• It is the process of being able to handle and manage data efficiently.
Example: Facebook posts with hashtags.

41
Characteristics of Big-Data
9 Vs of Big Data
1. Volume
2. Velocity
3. Variety
4. Value
5. Veracity
6. Validity
7. Variability
8. Volatility
9. Visualization

42
Big Data Technology Components

1. Ingestion:
• The ingestion layer is the very first step of pulling in raw data.
• It comes from internal sources, relational databases, non-relational databases, social media,
emails, phone calls etc.
• There are two kinds of ingestions:
Batch, in which large groups of data are gathered and delivered together.
Streaming, which is a continuous flow of data. This is necessary for real-time data analytics.
43
Big Data Technology Components
2. Storage:
• Storage is where the converted data is stored in a data lake or warehouse and eventually
processed.
• The data lake/warehouse is the most essential component of a big data ecosystem.
• It needs to contain only thorough, relevant data to make insights as valuable as possible.
• It must be efficient with as little redundancy as possible to allow for quicker processing.

44
Big Data Technology Components
3. Analysis:
• In the analysis layer, data gets passed through several tools, shaping it into actionable insights.
• There are four types of analytics on big data:
Diagnostic: Explains why a problem is happening.
Descriptive: Describes the current state of a business through historical data.
Predictive: Projects future results based on historical data.
Prescriptive: Takes predictive analytics a step further by projecting best future efforts.

45
Big Data Technology Components
4. Consumption:
• The final big data component is presenting the information in a format digestible to the end-user.
• This can be in the forms of tables, advanced visualizations and even single numbers if requested.
• The most important thing in this layer is making sure the intent and meaning of the output is
understandable.

46
Sources of Big-Data
•E- Commerce Website

•Social Media

•IOT devices

•Banks

•Stock Market

47
Uses of Big-Data
•Prediction systems

•Recommendation Engine

•Fraud Detection

•Sentimental Analysis Etc.

48
Big Data importance
•Cost Saving
•Time Reductions
•Understand the Market condition
•Social Media Listening
•Provide better customer service
•Solve advertisers problems
•Create personalized marketing , can increase revenue and profits
•Fast decision making
•Predicting future needs

49
Big Data importance
Big Data Importance:
Big Data importance doesn't revolve around the amount of data a company has but lies in the fact
that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the company uses its data,
more rapidly it grows.
By analysing the big data pools effectively the companies can get answers to:
Cost Savings:
• Some tools of Big Data like Hadoop can bring cost advantages to business when large
amounts of data are to be stored.
• These tools help in identifying more efficient ways of doing business.
50
Big Data importance
Big Data Importance:
Time Reductions:
• The high speed of tools like Hadoop and in-memory analytics can easily identify new
sources of data which helps businesses analysing data immediately.
• This helps us to make quick decisions based on the learnings.
Understand the market conditions:

• by analysing big data we can get a better understanding of current market conditions.
for example: By analysing customers’ purchasing behaviours, a company can find out the
products that are sold the most and produce products according to this trend. By this, it can get
ahead of its competitors.
51
Big Data importance
Big Data Importance:
Control online reputation:
• Big data tools can do sentiment analysis.
• Therefore, you can get feedback about who is saying what about your company.
• If you want to monitor and improve the online presence of your business, then big data tools
can help in all this.

52
Big Data importance
Big Data Importance:
Using Big Data Analytics to Boost Customer Acquisition(purchase) and Retention:

• The customer is the most important asset any business depends on.
• No single business can claim success without first having to establish a solid customer base.
• If a business is slow to learn what customers are looking for, then it is very likely to deliver
poor quality products.
• The use of big data allows businesses to observe various customer-related patterns and
trends.

53
Big Data importance
Big Data Importance:
Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights:

• Big data analytics can help change all business operations.


• Like the ability to match customer expectations, changing company’s product line, etc.
• And ensuring that the marketing campaigns are powerful.

54
Application or Uses of Big-Data
•In today’s world big data have several applications, some of them are listed below:
1. Tracking Customer Spending Habit, Shopping Behaviour
2. Recommendation
3. Smart Traffic System
4. Secure Air Traffic System
5. Auto Driving Car
6. Virtual Personal Assistant Tool
7. IOT
8. Education Sector Energy Sector
9. Media and Entertainment Sector
55
Application or Uses of Big-Data
1. Tracking Customer Spending Habit, Shopping Behaviour:
In big retails stores, the management team has to keep data of customer’s spending habits, shopping
behaviour, most liked product, which product is being searched/sold most, based on that data, the
production/collection rate of that product gets fixed.

56
Application or Uses of Big-Data
2. Recommendation:
By tracking customer spending habits, shopping behaviour, big retail stores provide
recommendations to the customers.

57
Application or Uses of Big-Data
3. Smart Traffic System:
Data about the condition of the traffic of different roads, collected through cameras, GPS devices
placed in the vehicle.
All such data are analysed and jam-free or less jam way, less time taking ways are recommended.
One more profit is fuel consumption can be reduced.

58
Application or Uses of Big-Data
4. Secure Air Traffic System:
At various places of flight, sensors are present.
These sensors capture data like the speed of flight, moisture, temperature, and other environmental
conditions.

59
Application or Uses of Big-Data
4. Secure Air Traffic System:
Based on such data analysis, an environmental parameter within flight is set up and varied.
By analysing flight’s machine-generated data, it can be estimated how long the machine can operate
flawlessly and when it can be replaced/repaired.

60
Application or Uses of Big-Data
5. Auto Driving Car:
In the various spots of the car camera, a sensor is placed that gathers data like the size of the
surrounding car, obstacle, distance from those, etc.
These data are being analysed, then various calculations are carried out.
These calculations help to take action automatically.

61
Application or Uses of Big-Data
6. Virtual Personal Assistant Tool:
Big data analysis helps virtual personal assistant tools like Siri, Cortana and Google Assistant to
provide the answer to the various questions asked by users.
This tool tracks the location of the user, their local time, season, other data related to questions
asked, etc.
Analysing all such data provides an answer.

62
Application or Uses of Big-Data
6. Virtual Personal Assistant Tool:
Example: Suppose one user asks “Do I need to take Umbrella?”The tool collects data like location
of the user, season and weather condition at that location, then analyses these data to conclude if
there is a chance of raining, then provides the answer.

63
Application or Uses of Big-Data
7. IOT :
Manufacturing companies install IOT sensors into machines to collect operational data.
Analysing such data, it can be predicted how long a machine will work without any problem when
it requires repair.
Thus, the cost to replace the whole machine can be saved.

64
Application or Uses of Big-Data
8. Education Sector Energy Sector:
Online educational courses conducting organization utilize big data to search candidates interested
in that course.
If someone searches for a YouTube tutorial video on a subject, then an online or offline course
provider organization on that subject sends an ad online to that person about their course.

65
Application or Uses of Big-Data
9. Media and Entertainment Sector:
Media and entertainment service providing company like Netflix, Amazon Prime, Spotify do
analysis on data collected from their users.
Data like what type of video, music users are watching, listening to most, how long users are
spending on site, etc. are collected and analysed to set the next business strategy.

66
Big Data features
Big Data features –security, compliance, auditing and protection
BIG DATA SECURITY :

• Big data security is the collective term for all the measures and tools used to guard both the
data and analytics processes from attacks, theft, or other malicious activities that could harm or
negatively affect them.

• For companies that operate on the cloud, big data security challenges are multi-faceted.

• When customers give their personal information to companies, they trust them with personal
data which can be used against them if it falls into the wrong hands.

67
Big Data features
Big Data features –security, compliance, auditing and protection
BIG DATA COMPLIANCE :

• Data compliance is the practice of ensuring that sensitive data is organized and managed in such
a way as to enable organizations to meet enterprise business rules along with legal and
governmental regulations.

• Organizations that don’t implement these regulations can be fined up to tens of millions of
dollars and even receive a 20-year penalty.

68
Big Data features
Big Data features –security, compliance, auditing and protection
BIG DATA AUDITING :

• Auditors can use big data to expand the scope of their projects and draw comparisons over larger
populations of data.

• Big data also helps financial auditors to streamline the reporting process and detect fraud.

• These professionals can identify business risks in time and conduct more relevant and accurate
audits.

69
Big Data features
Big Data features –security, compliance, auditing and protection
BIG DATA PROTECTION :

• Big data security is the collective term for all the measures and tools used to guard both the data
and analytics processes from attacks, theft, or other malicious activities that could harm or
negatively affect them. That’s why data privacy is there to protect those customers but also
companies and their employees from security breaches.

• When customers give their personal information to companies, they trust them with personal
data which can be used against them if it falls into the wrong hands.

• Data protection is also important as organizations that don’t implement these regulations can be
70
Features of Big-Data
•It should support variety of data format
•It should provide data analysis and reporting tools
•It should provide real-time data analysis software
•It should have tools for searching the data through large data set
•It should have capability for rapid development

71
Big Data privacy and ethics
• A state in which one is not observed or disturbed by other people.
• Privacy is the ability of an individual or group to keep away themselves (or information related
to them) from other group or individual.
• Digital Privacy refers to the protection of an individual’s information that is used or created
while using the internet on a computer or personal device.

72
Big Data privacy and ethics
• Most data is collected through surveys, interviews, or observation.

• When customers give their personal information to companies, they trust them with personal data which can
be used against them if it falls into the wrong hands.

• That’s why data privacy is there to protect those customers but also companies and their employees
from security breaches.

• One of the main reasons why companies comply with data privacy regulations is to avoid fines.

• Organizations that don’t implement these regulations can be fined up to tens of millions of dollars and even
receive a 20-year penalty.

73
Big Data privacy and ethics
• Reasons, why we need to take data privacy seriously, are :
• Data breaches could hurt your business.
• Protecting your customers’ privacy
• Maintaining and improving brand value
• It gives you a competitive advantage
• It supports the code of ethics

74
Big Data Analytics
•Big Data Analytics is the process of collecting large chunks of structured/unstructured data,
segregating and analysing it and discovering the patterns and other useful business insights from it.
•Big data analytics is a complex process of examining big data to uncover information, such as -
hidden patterns, correlations, market trends and customer preferences.
•This can help organizations make informed business decisions like risk management.
•Data Analytics technologies and techniques give organizations a way to analyse data sets and
gather new information.
•Big Data Analytics enables enterprises to analyse their data in full context quickly and some also
offer real-time analysis.

75
Big Data Analytics
Importance of Big Data Analytics :
• Organizations use big data analytics systems and software to make data-driven decisions that can
improve business-related outcomes.
• The benefits include more effective marketing, new revenue opportunities, customer
personalization and improved operational efficiency.
• With an effective strategy, these benefits can provide competitive advantages over rivals.
• Big Data Analytics tools also help businesses save time and money and aid in gaining insights to
inform data-driven decisions.
• Big Data Analytics enables enterprises to narrow their Big Data to the most relevant information
and analyse it to inform critical business decisions.
76
Challenges of Big-Data Analytics
•Complex to Store and Manage
•Complex to Analysis
•Integrating data from a variety of sources
•Low Quality and Inaccurate Data
•Hardware failure
•Searching
•Sharing
•Transfer
•Presentation or visualization

77
Challenges of Conventional Systems
•Big data is the storage and analysis of large data sets.
•They are so large that it is not possible to work on them with traditional analytical tools.
•These are complex data sets that can be both structured or unstructured.
•One of the major challenges of conventional systems was the uncertainty of the Data Management
Landscape.
•Big data is continuously expanding, there are new companies and technologies that are being
developed every day.

78
Challenges of Conventional Systems
•A big challenge for companies is to find out which technology works bests for them without the
introduction of new risks and problems.
•These days, organizations are realising the value they get out of big data analytics and hence they
are deploying big data tools and processes to bring more efficiency in their work environment.

79
Intelligent data analysis
•Intelligent Data Analysis (IDA) is one of the most important approaches in the field of big data
which discloses hidden facts that are not known previously and provide potentially important
information or facts from large quantities of data.
•It also helps in making a decision.
•Based on the basic principles of IDA and the features of datasets that IDA handles, the
development of IDA is briefly summarized from three aspects :
• Algorithm principle
• The scale
• Type of the dataset

• IDA is one of the major issues in artificial intelligence and information.

80
Intelligent data analysis
•Based on machine learning, artificial intelligence, recognition of pattern, and records and
visualization technology, IDA helps to obtain useful information, necessary data and interesting
models from a lot of data available online in order to make the right choices.
•IDA includes three stages:
(1) Preparation of data
(2) Data mining
(3) Data validation and Explanation

81
Nature of Data
Data
• Data are known facts or things used as basis for
inference or reckoning.
• Data can be categorized in two distinct ways:
1. Categorical : qualitative
▪ Nominal

▪ Binary

▪ Ordinal

2. Numerical : quantitative
▪ Interval-scaled

▪ Ratio-scaled
Big Data (KCS-061) Ratish Srivastava 83
83
Categorical Attribute Types
▪ Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, age
▪ Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
o e.g., gender
• Asymmetric binary: outcomes not equally important.
o e.g., medical test (positive vs. negative)
o Convention: assign 1 to most important outcome (e.g.,
COVID positive)
▪ Ordinal
• Values have a meaningful order (ranking) but magnitude between
successive values is not known.
• Size = {small, medium, large}, grades, army rankings
Big Data (KCS-061) Ratish Srivastava 84
84
Numeric Attribute Types
◼ Quantity (integer or real-valued)
◼ Interval
▪ Measured on a scale of equal-sized units
▪ Values have order
o E.g., temperature in C˚or F˚, calendar dates
▪ No true zero-point
◼ Ratio
▪ Inherent zero-point
▪ We can speak of values as being an order of
magnitude larger than the unit of measurement (10
Kg is twice as heavy as 5 Kg).
o e.g. length, counts, monetary quantities

Big Data (KCS-061) Ratish Srivastava 85


85
Analytic processes and tools

Storing, processing and analyzing big data


became difficult using traditional methods

86
Analytic processes and tools

Big data analytics is used to Big data analytics helps in


improve customer quicker and better decision
experience. making in organizations.

87
Analytic processes and tools
•These days, organizations are realising the value they get out of big data analytics and hence they
are deploying big data tools and processes to bring more efficiency in their work environment.
•Many big data tools and processes are being utilised by companies these days in the processes of
discovering insights and supporting decision making.
•Big data processing is a set of techniques or programming models to access large- scale data to
extract useful information for supporting and providing decisions.

88
Analytic processes and tools
stages in Big Data Analytics
These are the following stages involved in the Big Data Analytics
process:
Stages in Big Data Analytics

1. Identifying Problem: to find what is our problem that we need to solve.


2. Designing Data Requirements: we need to decide what kind of data is
required for analyzing the problem.
3. Pre-processing data: we need to prepare our data before actual analysis
begin
4. Performing analytics over data: we will analyze the processed data
using various methods
5. Visualizing data: Data visualization is the representation of data or
information in a graph, chart, or other visual format.

90
Analytic processes and tools
Below is the list of some of the data analytics tools used most in the industry :
• R Programming (Leading Analytics Tool in the industry)
• Python
• Excel
• SAS
• Apache Spark
• Splunk
• RapidMiner
• Tableau Public
• KNime
91
Analysis vs reporting
Analysis:
• Analytics is the process of taking the organized data and analysing it.
• This helps users to gain valuable insights on how businesses can improve their performance.
• Analysis transforms data and information into insights.
• The goal of the analysis is to answer questions by interpreting the data at a deeper level and
providing actionable recommendations.

92
Analysis vs reporting
Reporting :
• Once data is collected, it will be organized using tools such as graphs and tables.
• The process of organizing this data is called reporting.
• Reporting translates raw data into information.
• Reporting helps companies to monitor their online business and be alerted when data falls
outside of expected ranges.
• Good reporting should raise questions about the business from its end users.
Conclusion:
• Reporting shows us “what is happening”.
• The analysis focuses on explaining “why it is happening” and “what we can do about it”.
93
Analysis Vs Analytics
Analysis Analytics
• It is a way to interpret the data and derive meaningful • Analytics include data collection, preparation,
insights from the data. performing some processing and giving report.
• Analysis is the part of analytics. • When applying statistical tools and techniques to bring
• The process of examining in close details the out the hidden patterns, stories from the data. E.g. Data
components of a given data set- separating them out mining tools, APIs, ML algorithm.
and studying the parts individually and their • It is a broader term referring to a discipline that
relationship between one another. encompasses the complete management of data include
collection, cleaning, organizing, storing, governing and
analysing data as well as the tools and techniques used
to do so.

94
Modern data analytic tools
•Data Analytics tools are types of application software that retrieve data from one or more systems
and combine it in a repository, such as a data warehouse, to be reviewed and analysed.
•Most organizations use more than one analytics tool including spreadsheets with statistical
functions, statistical software packages, data mining tools, and predictive modelling tools.
•Together, these Data Analytics Tools give the organization a complete overview of the company to
provide key insights and understanding of the market/business so smarter decisions may be made.

95
Modern data analytic tools
• Data analytics tools not only report the results of the data but also explain why the results
occurred to help identify weaknesses, fix potential problem areas, alert decision-makers to
unforeseen events and even forecast future results based on decisions the company might make.
• Below is the list some of data analytics tools :
• R Programming (Leading Analytics Tool in the industry)
• Python
• Excel
• SAS
• Apache Spark
• Splunk
96
Modern data analytic tools
Below is the list of some of the data analytics tools used most in the industry:

97
Modern data analytic tools
Below is the list of some of the data analytics tools used most in the industry:

Hadoop
Hadoop helps in storing
and analyzing big data.

98
Modern data analytic tools
Below is the list of some of the data analytics tools used most in the industry :

MongoDB
MongoDB is used on
datasets that change
frequently.

99
Modern data analytic tools
Below is the list of some of the data analytics tools used most in the industry :

Talend
Talend is tool used for data
integration and
management.

10
0
Modern data analytic tools
Below is the list of some of the data analytics tools used most in the industry :

Cassandra
It is a distributed database
that is used for handling
chunks of data.

10
1
Modern data analytic tools
Below is the list of some of the data analytics tools used most in the industry :

Spark
It is used for real time
processing and analyzing
large amount of data.

10
2
Modern data analytic tools
Below is the list of some of the data analytics tools used most in the industry :

Storm
It is an open source real
time computational system.

10
3
Modern data analytic tools
Below is the list of some of the data analytics tools used most in the industry :

Kafka
It is a distributed streaming
platform that is used for
fault tolerant storage.

10
4
Modern data analytic tools

10
5
Modern data analytic tools

10
6
Thankyou

10
7

You might also like