0% found this document useful (0 votes)
6 views37 pages

CH1 - Big Data Introduction-En

The document provides an overview of big data, highlighting its rapid growth and the challenges associated with managing and analyzing vast amounts of diverse data. It discusses various use cases across industries such as finance, healthcare, telecommunications, and retail, illustrating how big data analytics can address specific problems. Additionally, it emphasizes the importance of modern tools and technologies in handling big data's volume and variety.

Uploaded by

Hayder Melki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views37 pages

CH1 - Big Data Introduction-En

The document provides an overview of big data, highlighting its rapid growth and the challenges associated with managing and analyzing vast amounts of diverse data. It discusses various use cases across industries such as finance, healthcare, telecommunications, and retail, illustrating how big data analytics can address specific problems. Additionally, it emphasizes the importance of modern tools and technologies in handling big data's volume and variety.

Uploaded by

Hayder Melki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

n

si
oc d
at va I2
Pr e
es
C
a nc
Ad

CH
D

Big data Introduction


g
Bi

 Real cases and facts: Big data Tsunami !!!


 Big data use case
 Big Data & industries
 Big Data Vs
 Exercises

MUST, FSB, Anis Ben Aicha 1


n
si
oc d
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi

Jeff Reed (2017), Data Analytics: Applicable Data Analysis to Advance Any Business Using
the Power of Data Driven Analytics

MUST FSB, Anis Ben Aicha 2


n
si
oc d
Real cases and facts !!
at va I2
Pr e
es
C
a nc
Ad

 Every 02 day we create information as much we did from the beginning of


time until 2003
D
g

 Over 90% of all data in the world was created in the past 2 years
Bi

 Amount of digital information in 2020 = 40 zettabytes (10^21 bytes, 2^70


Bytes)
 The Amount of data doubles every 1,2 years
 Every minute: 204 million emails, 1,8 million Facebook likes,
 Google: processes 40000 search queries per second = 3,5 10^9 per day
 Youtube: 100 hours videos are uploaded per minute
 One day created data: if they are burning in DVD  reach the moon
 Largest volume of data: AT&T  312 Terabytes
 1,570 new websites per minute
 Companies monitor “twitter sentiment analysis”: 12 Terabytes per day
 More than 50 10^9 connected devices

MUST FSB, Anis Ben Aicha 3


n
si
oc d
Data tsunami
at va I2
Pr e
es
C
a nc
Ad

 We are witnessing a tsunami of data:


D

- Huge volumes
g
Bi

- Data of different types and formats


- New data with increasing speeds

 The challenges:
- Capturing, transporting, and moving the data
- Managing the data  the hardware involved, and the software
- Processing: managing & programming  to provide insight into the data
- Storing - safeguarding and securing

MUST FSB, Anis Ben Aicha 4


n
si
oc d
Data tsunami
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi

MUST FSB, Anis Ben Aicha 5


n
si
oc d
Big Data examples
at va I2
Pr e
es
C
a nc
• Science • Large scale eCommerce
Ad

• Astronomy • Government
D

• Atmospheric science • Regular government business and


g

• Genomics commerce needs


Bi

• Biogeochemical • Military and homeland security


• Biological surveillance
• Social • ……
• Social networks
• Social data
• Medical records
• Commercial
• Web / event / database logs
• Sensor networks
• Internet text and documents
• Internet search indexing
• Photographic archives
• Video / audio archives

MUST FSB, Anis Ben Aicha 6


n
si
oc d
Big Data examples: use case (Financial )
at va I2
Pr e
es
C
a nc
Ad

• Problem:
Manage the several Petabytes of data which is growing at 40-100% per
D


g

year under increasing pressure to prevent frauds and complaints to


Bi

regulators

• How big data analytics can help:


 Fraud detection
 Credit issuance
 Risk management
 360° view of the Customer

MUST FSB, Anis Ben Aicha 7


n
si
oc d
Big Data examples: use case (Financial )
at va I2
Pr e
es
C
a nc
• Problem (Visa Card fraud)
Ad

 Credit card fraud costs a lot of money per year


D

 Fraud schemes are constantly changing


g
Bi

 Understanding the fraud pattern months after the fact is only partially helpful
 Fraud detection models need to evolve faster
• If only Visa could …
 Reinvent how to detect the fraud patterns
 Stop new fraud patterns before they can rack-up significant losses

Solution
 Revolutionize the speed of detection
Visa loaded two years of test records, or 73 billion transactions,

amounting to 36 terabytes of data into Hadoop - the processing time fell
from one month with traditional methods to a mere 13 minutes

MUST FSB, Anis Ben Aicha 8


n
si
oc d
Big Data examples: use case (Healthcare )
at va I2
Pr e
es
C
a nc
• Problem:
Ad

 Vast quantities of real-time information are starting to come from wireless


monitoring devices that postoperative patients and those with chronic diseases
D

are wearing at home and in their daily lives.


g
Bi

 Example: The U.S. produces 1.2 billion clinical care documents each year.
These documents contain information about a patient’s medical history,
doctor’s visits, hospital visits, previous treatments, procedures, test results and
prescription medications.

• How big data analytics can help:


 Epidemic early warning
 Intensive Care Unit and remote monitoring
 A Complete Picture of Patients for Effective Care
 An Accurate Patient Profile for Correct Care
 A Growing Data Laboratory for Precise and Practice-Based Care
The Data Is In: 3 Ways Analytics Will Improve Healthcare
http://dataconomy.com/the-data-is-in-3-ways-analytics-will-improve-healthcare
MUST FSB, Anis Ben Aicha 9
n
si
oc d
Big Data examples: use case (Healthcare )
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi

MUST FSB, Anis Ben Aicha 10


n
si
oc d
Big Data examples: use case (Telecommunications)
at va I2
Pr e
es
C
a nc
• Problem:
Ad

 Legacy systems are used to gain insights from internally generated data facing
issues of high storage costs, long data loading time, and long administration
D
g

processing times…
Bi

• How big data analytics can help:


 Combat fraud
 Churn prediction
 Geomapping / marketing
 Network monitoring

MUST FSB, Anis Ben Aicha 11


n
si
oc d
Big Data examples: use case (transportation)
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi

• Problem:
 Traffic congestion has been increased worldwide as a result of increased
urbanization and population growth reducing the efficiency of transportation
infrastructure and increasing travel time and fuel consumption.

• How big data analytics can help:


 Urban planning & monitoring
 Real time analysis to weather and traffic congestion data
streams to identify traffic patterns reducing transportation costs.

MUST FSB, Anis Ben Aicha 12


n
si
oc d
Big Data examples: use case (Retails & social media)
at va I2
Pr e
es
C
a nc
• Problem:
Ad

 Retailers want to use “big data” to predict trends, prepare for demand, pinpoint
customers, optimize pricing & promotions, and monitor real-time analytics &
D

results by combining data from web browsing patterns, social media, industry
g
Bi

forecasts, existing customer records, etc  huge amount of data

• How big data analytics can help:


 Access social media to gain insight
 Federate data between Big Data and RDBMs
 Apply graph analysis to the available data
 Work to understand demand and engage
customers

The Impact of Big Data on The Retail Sector: Examples And Use-Cases
https://www.datapine.com/blog/big-data-in-retail-examples/

MUST FSB, Anis Ben Aicha 13


n
si
oc d
Big Data examples: use case (Retails & social media)
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi

• Path analysis
• Connectivity analysis
• Community analysis
• Centrality analysis

MUST FSB, Anis Ben Aicha 14


n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
• Problem:
Ad

 The world of production will become more and more networked


until everything is interlinked with everything else. The complexity
D
g

of production and supplier networks has grow enormously. Previously,


Bi

networks and processes were limited to one factory, but the boundaries of
individual factories will most likely no longer exist in favor of the
interconnect of multiple factories or even geographical regions..

• How big data analytics can help:


 The Internet of Things (IoT)

 Industry 4.0

MUST FSB, Anis Ben Aicha 15


n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
• Fourth industrial revolution
Ad

 Industry 1.0: Water/steam power


D

 Industry 2.0: Electric power


g
Bi

 Industry 3.0: Computing power


 Industry 4:0: Internet of Things (IoT) power

MUST FSB, Anis Ben Aicha 16


n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
Ad
D

• The Eras of Data


g
Bi

 0 Flat files
 1 Relational Databases (RBDMs) - 1970s - OLTP (Online Transactional
processing)
 2 Data Warehouses - 1990s - OLAP (Online Analytical processing) or
DSS (Decision Support Systems) workloads
 3 Big Data - 2000s - Batch, with a movement towards Real-time

MUST FSB, Anis Ben Aicha 17


n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi

MUST FSB, Anis Ben Aicha 18


n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
Ad

• Different types of data


D

• Each of them require different tools and techniques.


g
Bi

• The main categories of data:


• Structured
• Semi-Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and image
• Streaming

MUST FSB, Anis Ben Aicha 19


n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi

How ???

MUST FSB, Anis Ben Aicha 20


n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
big data platform
Ad
D
g
Bi

MUST FSB, Anis Ben Aicha 21


n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
big data Architecture
Ad
D
g
Bi

MUST FSB, Anis Ben Aicha 22


n
si
oc d
Big Data Vs
at va I2
Pr e
es
C
a nc
4 classic dimensions of big data
Ad
D
g
Bi

MUST FSB, Anis Ben Aicha 23


n
Big Data Vs

si
oc d
at va I2
Pr e
es
V1: Volume
C
a nc
Ad

• Big data's main attribute is its huge volume, which has been collected
through several sources.
D

• Data is collected from diverse sources: business transactions, social


g
Bi

media, sensors, surfing history etc.


• Data is often measured in gigabytes or terabytes. However, many analyses
indicate that the total amount of big data generated to date is measured in
Zettabytes  the enormous amount of data that is accessible for company
research and analysis.
• Data is expanding dramatically with each new day: Every minute, data
worth millions of TBs is generated globally through Facebook, tweets,
instant messaging, emails, mobile usage, product evaluations, etc.
Hundreds of new Twitter accounts are established every minute, tens of
thousands of apps are downloaded, and thousands of fresh tweets and
advertisements are published Every two years, the quantity of big data
generated globally will double.

MUST FSB, Anis Ben Aicha 24


n
Big Data Vs

si
oc d
at va I2
Pr e
es
V1: Volume
C
a nc
Ad
D

• Traditional database technology cannot meet the demand for effective data
g

management, including storage and analysis, as the volume of data is


Bi

increasing at the speed of light.

• Adoption of modern tools like Hadoop and MongoDB on a wide scale is


crucial right now. To make it easier to store and analyze this massive
amount of big data across several databases, they utilize distributed
systems.

• The modern era now has a wider range of opportunities because to the
information explosion.

MUST FSB, Anis Ben Aicha 25


n
Big Data Vs

si
oc d
at va I2
Pr e
es
V2: Variety
C
a nc
• Big data is collected and created in various formats and sources. It includes
Ad

structured data as well as unstructured data like text, multimedia, social


D

media, business reports etc.


g
Bi

• Structured data: Traditional data management and analysis techniques


may be used to store and analyze structured data, such as bank records,
demographic data, inventory databases, company data, and product data
streams.

• Unstructured data contains information that has been collected, such as


photos, tweets or Facebook status updates, discussions through instant
messenger, blogs, videos uploaded, voice recordings, and sensor
data.There is no clear pattern in these kinds of data. Unstructured data
frequently reflects human ideas, sentiments, and emotions that are
sometimes difficult to articulate in precise terms.

MUST FSB, Anis Ben Aicha 26


n
Big Data Vs

si
oc d
at va I2
Pr e
es
V2: Variety
C
a nc
Ad
D
g
Bi

• One of the main objectives of big data is to collect all this unstructured data
and analyze it using the appropriate technology

• Variety of data definitely helps to get insights from different set of samples,
users and demographics.
 It helps to bring different perspective to same information.
 It also allows analyzing and understanding the impact of different form
and sources of data collection from a ‘larger picture’ point of view.

MUST FSB, Anis Ben Aicha 27


n
Big Data Vs

si
oc d
at va I2
Pr e
es
V3: Velocity
C
a nc
• Speed is one of the key drivers for success in company business. Fast turn-
Ad

around is one of the pre-requisites to stay alive in this fierce competition.


D

Expectations of quick results and quick deliverables are pressing to a great


g
Bi

extent.

• In these situations, it becomes essential to quickly collect and analyze huge


amounts of heterogeneous data in order to make accurate decisions.

• Low velocity of even high quality of data may hinder the decision making of
a business.

• Velocity is the speed or frequency at which data is collected in various forms


and from different sources for processing.

• It ranges from batch updates, to periodic to real-time flow of the data.

MUST FSB, Anis Ben Aicha 28


n
Big Data Vs

si
oc d
at va I2
Pr e
es
V4: Veracity
C
a nc
Ad

• It is very likely that the vast amounts of data include some ambiguity.
D
g
Bi

• Big data has to be filtered for clean and pertinent information if we want to
provide the company insights that will help it grow  The used data as an
input should be properly prepared, conformed, verified, and made
consistent in order to make reliable judgments.

• Causes: There are several causes of data contamination, including incorrect


references or associations, waste data, fake data, data entry mistakes or
typos (primarily in structured data), etc.

• In automated data collection, analysis, report generation, and decision


making process, it is inevitable to have a foolproof system in place to avoid
any lapses.

MUST FSB, Anis Ben Aicha 29


n
si
oc d
Big Data Vs
at va I2
Pr e
es
C
a nc
More Vs
Ad
D

• Volume - how much data is there?


g

• Velocity - how quickly is the data being created, moved, or accessed?


Bi

• Variety - how many different types of sources are there?


• Veracity - can we trust the data?
• Validity - is the data accurate and correct?
• Viability - is the data relevant to the use case at hand?
• Volatility - how often does the data change?
• Vulnerability - can we keep the data secure?
• Visualization - how can the data be presented to the user?
• Value - can this data produce a meaningful return on investment

Understanding the Many V’s of Healthcare Big Data Analytics


https://healthitanalytics.com/news/understanding-the-many-vs-of-healthcare-big-data-
analytics

MUST FSB, Anis Ben Aicha 30


n
si
oc d
Exercises
at va I2
Pr e
es
C
a nc
Exercise 1:
Ad

Analyze the following use cases with the respect of four V


D
g

• Case 1: Facebook
Bi

• Case 2: Skype

• Case 3: Fraud detection in banking transactions

• Case 4: Jumia

MUST FSB, Anis Ben Aicha 31


n
si
oc d
Exercises
at va I2
Pr e
es
C
a nc
Exercise 2:
Ad
D

- Problem Statement: Health organizations, such as the World Health


g

Organization (WHO) and the Centers for Disease Control and Prevention
Bi

(CDC), need to monitor and predict disease outbreaks to take timely preventive
actions. Traditional methods of disease surveillance may not provide real-time
insights.
1- What are the constraints that have to be faced by a big data solution
2- Propose an architecture of big data solution
3- What are expected benefits

MUST FSB, Anis Ben Aicha 32


n
si
oc d
Annexe A: Byte multiples
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi

MUST FSB, Anis Ben Aicha 33


n
si
oc d
Annexe B: OLTP Vs OLAP
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi

https://bugssiufam.wixsite.com/bugados/single-post/2016/08/18/OLTPOnline-Transaction-
Processing-e-OLAPOnline-Analytical-Processing
MUST FSB, Anis Ben Aicha 34
n
si
oc d
Annexe: Data professions
at va I2
Pr e
es
C
a nc
 Data engineer profile (skills)
Ad

 Networking (infrastructure, administration, security, …)


 System administration
D

 Programming languages (Python, Java, Scala, etc.)


g
Bi

 Scripting languages (Bash, shell scripting)


 Database technologies (SQL, NoSQL, data warehousing)
 Cloud computing platforms (AWS, Azure, GCP)
 Big data technologies (Hadoop, Spark, Kafka)
 Data modeling and ETL (Extract, Transform, Load) tools
 Problem-solving and analytical skills

 Data Responsibilities:
 Designing and building data pipelines
 Developing and maintaining data storage solutions
 Data cleaning and preparation
 Building data processing tools and scripts
 Monitoring and performance optimization

MUST FSB, Anis Ben Aicha 35


n
si
oc d
Annexe: Data professions
at va I2
Pr e
es
C
a nc
 Data scientist profile (skills)
Ad

 Programming languages (Python, R, SQL)


 Statistics and probability
D

 Machine learning algorithms and libraries (e.g., TensorFlow, Scikit-learn)


g
Bi

 Data visualization tools (e.g., Tableau, Power BI)


 Database technologies (SQL, NoSQL)
 Cloud computing platforms (AWS, Azure, GCP)
 Strong analytical and problem-solving skills
 Excellent communication and presentation skills
 Curiosity and passion for data
 Creativity and critical thinking
 Team player with strong collaboration skills
 Data scientist Responsibilities
 Formulating data-driven questions and hypotheses
 Data acquisition and wrangling
 Exploratory data analysis (EDA)
 Modeling and machine learning
 Data visualization and storytelling
 Evaluation and interpretation
 Collaboration and communication
MUST FSB, Anis Ben Aicha 36
n
si
oc d
Annexe: Data professions
at va I2
Pr e
es
C
a nc
 Machine Learning profile (skills)
Ad

 Programming languages (Python, Java, C++, etc.)


 Machine learning libraries and frameworks (TensorFlow, PyTorch, etc.)
D

 Deep learning expertise for complex model


g
Bi

 Cloud computing platforms (AWS, Azure, GCP)


 Software engineering concepts and principles
 Data engineering tools and pipelines
 Version control systems (Git)
 DevOps, MlOps,

 Data Responsibilities
 Deployment and monitoring
 Software engineering and automation
 Data engineering and infrastructure

MUST FSB, Anis Ben Aicha 37

You might also like