0% found this document useful (0 votes)
14 views56 pages

$R3N9XOZ

The document outlines a course on Big Data Analysis, detailing its objectives, content, grading structure, and the significance of big data in today's digital landscape. It covers the characteristics of big data, its challenges, and various types of big data analytics, including descriptive, diagnostic, predictive, and prescriptive analytics. The course aims to equip students with both theoretical knowledge and practical skills necessary for effective big data analysis.

Uploaded by

petergamal126
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views56 pages

$R3N9XOZ

The document outlines a course on Big Data Analysis, detailing its objectives, content, grading structure, and the significance of big data in today's digital landscape. It covers the characteristics of big data, its challenges, and various types of big data analytics, including descriptive, diagnostic, predictive, and prescriptive analytics. The course aims to equip students with both theoretical knowledge and practical skills necessary for effective big data analysis.

Uploaded by

petergamal126
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Big Data Analysis

Dr. Maryam Hazman


Content

 Course Description
 Introduction to Big Data
Course Description

The aims of course:


The aim of this course is to provide the
students with theoretical and practical skills
related to big data analysis process.
Course Description
Course Consents:

 Basic concept in big data


 Cloud computing
 Introduction to big data analytics
 Introduction to Hadoop technology
 MapReduce
 Revision
 Final exam
Course Description

Grading (100%):
 Final exam 70
 Mid-term exam 10
 Practice exam 10
 Course work 10

Timing:
 Lecture 3
 Practice 3
Introduction

Buying Online
Introduction

What about offline buying???


Introduction
Personalized experience needs a lot of data must be collected

 Shopping cart
 Wish list and Previous purchases
 Items rated and reviewed
 Geo-location
 Time-on-site and Duration of views
 Links clicked & Text Searched
 Telephone inquiries
 Responses to marketing materials
 Social media posting
Introduction

Customer Data
Introduction

Customer Data
Introduction
Introduction
Introduction
Introduction

What has changed to make digital tech so useful today


What is a big data?

Big Data

 is a volume of both structured and unstructured data

 is so large, it is difficult to process using traditional database and software


techniques.

 its volume of data is too big.

 Moves too fast and exceeds current processing capacity


What is a big data?
 Big data is a term for a collection of data sets, so large and complex that
it becomes often difficult to process using traditional data processing
applications.

Large amounts of different types of data produced from various types of


sources, such as

 People,
 Machines or
 Sensors.
What is a big data?
The Big Data Framework organization attempts to categories the development of Big data to
three main phases;

 Phase 1.0 (1970-2000): Big data was mainly described by the data storage and analytics, and it
was an extension to the modern database management systems and data warehousing technologies;

 Phase 2.0 (2000-2010): with the uprising of Web 2.0, and the propagation of semi-structured and
unstructured content, the notion of Big data has changed to embody advanced technical solutions to
extract meaningful information from dissimilar and heterogeneous data formats;

 Phase 3.0 (2010-now): with the emergence of smartphones and mobile devices, sensor data,
wearable devices, Internet of Things (IoT), to many more data generators, Big Data has entered a new
era and has drawn a new horizon with a new range of opportunities
What is a big data?
Big Data Characteristics
Big Data Characteristics
Big data V-features
Big Data Characteristics

 The following is a brief discussion on the 10Vs of Big data.


1. Volume
2. Velocity
3. Variety
4. Veracity
5. Variability
6. Validity
7. Vulnerability
8. Volatility
9. Visualization
10.Value
Volume

 Amount of data

 Refers to the vast increase in the data growth.

 Size of data plays a very critical role in determining the value out of
data

 This is evident as more than 90% of the data was produced recently.

 In fact, more than 2.5 Exabyte (=1018) bytes are created daily since even
as earlier as 2013 from every post, share, search, click, stream, and
many more data producers. It is expected to be 463 Exabyte in 2025.

 People share 500 terabytes of data per day on Facebook. Also, there are
over 300 hours of video shared every minute on YouTube
Velocity

 The speed of generation of data


 Data flow is often vast and
continuous. So it requires
platforms and capacities which
can not only handle significant
volumes but deal with this
stream in real-time
 Represents the accumulation of
data in high speed, near real-
time and real-time from
dissimilar data sources.
Variety (Format)

 Different types of data


 Involves collecting data from various resources and in fuzzy and
heterogeneous types.
 This includes importing data in dissimilar formats, namely
 Structured (tables reside in relational databases – RDBMS, etc.),
 Semi-structured (email, XML, JSON, and other markup languages,
etc.) and
 Unstructured (text, pictures, audio files, video, sensor data, etc.).
Veracity
 Refers to the source, accuracy, and correctness of data. Is it information or
misinformation?

 Being able to identify the relevance and accuracy of data and apply it to the
appropriate purpose.

 There are multiple factors to ensure the veracity of Big data:

 Trustworthiness of data source

 Reliability and security of data store

 Data availability

 Correctness and

 Consistency
Variability

 Refers to variance in meaning, number of inconsistencies,


multitude of data dimensions, and inconsistent data
receiving speeds.
Validity
 Refers to the “data are shown (or known) to be an accurate indicator of the
claim being made”.

 It differs from the veracity in that the validity does “mean the correctness and
accuracy of data with regard to the intended usage”.

 In other word, data can be trustworthy, thus satisfy the veracity aspect. But,
poor interpretation to the data might lead to unintended use. Moreover, the
same truthful data can be valid to be used in one application and invalid for a
different one.
Vulnerability

 Refers to the security of the collected datasets that will be used for later
analysis.

 It also denotes the errors in the system which permits harmful activities
to be conducted on the collected datasets.

 Hence, the acquisition of datasets should ensure capacity to provide


safe systems able to protect the collected data from breaches.
Volatility

 Refers to time up which data is valid to be stored/used


before it becomes outdated or no longer relevant.

 It is crucial dimension since cost of storage and maintenance


extends with longer Big data store.
Visualization
 Refers to the ability to present Big data into a visual context, such as
diagrams, graphs, maps, etc. toward better understanding and
interpreting of data.

 It also assists people and organizations to discover patterns,


correlations, trends, relationships and dependencies.

 Big data visualization is a powerful tool for decision makers to access,


evaluate and interpret massive data in even real time and act upon it.
Value

 Represents the outcome product of Big data analysis (i.e. new


idea, insights).

 Understanding the potential to create revenue or unlock


opportunities through your data. This reflects the outcomes
of using the your data analyzing results. If it is not valuable
then question should be raised about why and when you
store it
Big data Challenges
Big data Challenges

 Storing and processing issue


 Privacy and Security
 Data access and sharing
 Analytical challenges
 Skills requirements
 Technical Issues
Storing and processing issue

 The rate of increase in data is much faster than the existing processing
systems.

 The current storage systems are not capable enough to store these data.

 There is a need to develop a processing system that not only satisfies to


today's needs but also future needs.
Privacy and Security

 New devices and technologies like cloud computing provide a gateway


to access and to store information for data analysis.

 This integration of IT architectures will lead to greater risks to data


security and intellectual property.
Data access and sharing

Generally data is used for making accurate decisions.

The data should be available in accurate complete and timely manner.


Analytical challenges

Traditional RDBMS are suitable only for structured data.

What if data volume gets so large that we do not know how to deal with it?

Does all data need to be store?

Does all data need to be analyzed?

Which data points are important?

How can data be used for best advantages?


Skills requirements

 With the increase in amount of (structured, semi-structured, and


unstructured) data generated, there is a need for talent.

 The demand for people with good analytical skills in big data is
increasing.
Technical Issues

 Fault Tolerance

 Scalability

 Quality of Data

 Heterogeneous Data
Technical Issues: Fault Tolerance

A system's ability to continue operating uninterrupted despite the failure


of one or more of its components.

Fault-tolerant systems use backup components that automatically take


the place of failed components, ensuring no loss of service.
Technical Issues: Scalability

 The property of a system to handle a growing amount of work by adding


resources to the system.

 Vertical Scalability (Scale-up)

 Horizontal Scalability (Scale-out)


Technical Issues: Scalability
Vertical Scalability (Scale-up)

In this type of scalability, we increase the power of existing resources in the


working environment in an upward direction
Technical Issues: Scalability

Horizontal Scalability (Scale-out)

In this kind of scaling, the resources are added in a horizontal row.


Technical Issues: Quality of Data

 Data Quality :
 Completeness

 Validity

 Accuracy

 Consistency

 Integrity

 Timeless
Technical Issues: Heterogeneous Data
 Data is collected from different source with different formats.

 Data Source as:

 Database

 Websites

 Social Networks

 Files

 Ontologies

 APIs

 ….
Big Data Analytics
 A set of fundamental concepts/principles that underlie techniques for
extracting useful knowledge from large datasets containing a variety of data
types.

 Big data analytics is a term that describes the process of using data to
discover trends, patterns, and other correlations, as well as using them to
make data-driven decisions.
Types of Big Data Analytics
There are four main types of big data analytics: descriptive,
diagnostic, predictive, and prescriptive analytics.

They use various tools for processes such as cleaning,


integration, visualization, data mining, and many others, to
improve the process of analyzing data and ensuring the
company benefits from the data they gather.
Descriptive Analytics
 Answers the question, “What happened?”

 It is one of the first steps of analyzing raw data by performing simple


mathematical operations and producing statements about samples and
measurements.

 It allows you to know the trends from raw data and describe what is
currently happening.

 Data visualization is a natural fit for descriptive analysis since charts,


graphs, and maps can show trends in data in a clear, easily
understandable way.
Diagnostic Analytics
 Answers the question, “Why did it happen?”

 Use to investigate data and content to answer “Why did it happen?”. So, by
analyzing data, we understand the reasons for certain behaviors and events
related to specific situation.

 It includes comparing coexisting trends or movement, uncovering correlations


between variables, and determining causal relationships where possible

 Some tools and techniques used for such a task include: searching for patterns
in the data sets, filtering the data, using probability theory, regression analysis,
and more.
predictive Analytics
 Answers the question, “What might happen in the future?”

 Use to make predictions about future outcomes based on analyzing


historical data.

 In order to get the best results, it uses many sophisticated predictive


tools and models such as machine learning and statistical modeling.

 Making predictions for the future can help your organization formulate
strategies based on likely scenarios.
Prescriptive Analytics
 Answers the question, “What should we do next?”

 It takes into account all possible factors in a scenario and suggests


actionable takeaways.

 It takes the results from descriptive and predictive analysis and finds
solutions for optimizing decisions through various simulations and
techniques
Assignment

Discuss the different between data warehouse and data lake


Thanks

You might also like