Bda (Chapter 1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

BDA

CHAPTER 1

Introduction to Big Data (Simplified)

What is Data? Data refers to the quantities, characters, or symbols that a computer uses to perform
operations. It can be stored and shared through electrical signals or different storage media like
magnetic, optical, or mechanical devices.

Where Does Data Come From? Data comes from various sources, such as documents, images, audio,
software programs, and more.

Computer Data as Information Computer data is any information processed or stored by a computer.
It includes text files, images, audio, or software. The computer’s CPU processes this data, and it’s
saved in files on the hard disk.

Definition of Big Data Big Data refers to an extremely large and growing collection of data that is
too complex to be handled by regular data management systems. While regular data can be
measured in megabytes (MB) or gigabytes (GB), Big Data can reach sizes in petabytes (PB), which is
1,000,000,000,000,000 bytes.

Interesting Fact It is said that 90% of the world's data has been created in just the past three years.

Sources of Big Data

 Weather stations and satellites: Produce massive amounts of data for forecasting.

 Emails, blogs, and news websites: Continuously generate large data volumes.

 Social media: Posts, photos, videos, likes, and comments contribute to Big Data.

 Traffic data and GPS signals: Data from vehicles and maps.

 Digital pictures and videos: Cameras and smartphones produce a huge amount of data.

Characteristics of Big Data (Simplified)

Big Data has several key characteristics, often referred to as the 5 Vs:

1. Volume

o Definition: The amount of data is huge.

o Example: Social media generates tons of posts, videos, and photos every second.

2. Velocity

o Definition: Data is created and processed very quickly.

o Example: Online shopping sites process thousands of transactions every minute.

3. Variety

o Definition: Data comes in different formats like text, images, videos, and numbers.

o Example: A single app might store user messages, photos, and videos all in different
formats.
4. Veracity

o Definition: The data can sometimes be uncertain or incorrect.

o Example: Social media posts may contain false information, which needs to be
filtered.

5. Value

o Definition: The importance of the data in making decisions.

o Example: Companies analyze customer feedback to improve their products and


services.

These characteristics make Big Data challenging but also valuable for gaining insights.

Explanation of Big Data Characteristics:

1. Volume (Data at Rest)

o What it means: Big Data is huge. We're talking about data in terabytes or even
petabytes, not just megabytes or gigabytes.
o Example: The Internet of Things (IoT) generates enormous amounts of data, which
keeps growing.

2. Variety (Data in Many Forms)

o What it means: Data comes in many formats—structured (like databases) and


unstructured (like text, videos, or images).

o Example: Emails, social media posts, and videos all create different types of data that
need to be stored and analyzed.

3. Veracity (Data in Doubt)

o What it means: This refers to the accuracy and trustworthiness of data. Large
volumes of data can sometimes be incomplete or inaccurate.

o Example: Social media posts may contain incorrect information, which makes it
difficult to ensure data quality.

4. Velocity (Data in Motion)

o What it means: The speed at which data is generated, processed, and made
accessible. It’s important for real-time data analysis.

o Example: Data from social media, sensors, and mobile devices is generated and
shared continuously at high speeds.

5. Value (Data into Money)

o What it means: The goal is to turn raw data into something valuable, like insights or
revenue for businesses.

o Example: Analyzing customer data to understand behavior and make personalized


offers.
6. Visualization (Data Readable)

o What it means: Presenting data in an easy-to-understand way using graphs, charts,


and other visual tools.

o Example: Companies use charts to spot trends or patterns in their sales data.

7. Virality (Data Spread)

o What it means: How fast data or information spreads from one person to another,
often through social media.

o Example: A viral video that quickly spreads across the internet through social media
platforms.
These characteristics show what makes Big Data unique and challenging to manage but also very
powerful.

Challenges of Conventional Systems with Big Data

1. Volume of Data

o What it means: Data is growing rapidly from various sources like machines,
telecommunication, and sensors.

o Example: IBM estimates that by 2020, the world's data volume will reach about 35
zettabytes. Managing such vast amounts of data is challenging.

2. Processing and Analyzing

o What it means: Handling and analyzing large amounts of data is difficult and time-
consuming.

o Example: Extracting meaningful insights from huge data sets requires significant time
and effort, and it can be expensive due to the complexity and different formats of
data.

3. Management of Data

o What it means: Data comes in various forms—structured (like databases), semi-


structured (like XML files), and unstructured (like emails or social media posts).

o Example: Managing and integrating these different types of data is complex and
requires sophisticated systems.

In essence, conventional systems struggle to keep up with the growing volume of data, the
complexity of processing and analyzing it, and the challenge of managing diverse data formats.

Types of Big Data

1. Unstructured Data

 What it is: Data that doesn’t have a predefined format or structure.

 Characteristics: Often large and complex, making it difficult to process and analyze.

 Examples: Search results from Google, social media posts, emails, images, and videos.
 Challenges: Hard to derive value from this raw, unstructured data without advanced tools
and techniques.

2. Structured Data

 What it is: Data that is organized in a fixed format and can be easily stored, accessed, and
processed.

 Characteristics: Data is well-defined and fits neatly into tables or spreadsheets.

 Examples: Employee records in a database (like a table with Employee_ID, Name, Gender,
etc.).

 Advantages: Easy to manage and analyze using traditional database systems and tools.

3. Semi-structured Data

 What it is: Data that combines elements of both structured and unstructured data.

 Characteristics: Contains tags or markers to separate data elements, but doesn’t fit into a
rigid structure.

 Examples: XML files with tags (like <name>, <age>, etc.), web logs, and transaction histories.

 Advantages: More flexible than structured data, but still organized enough to be useful.

Differences Between Data Types

Factor Structured Data Semi-structured Data Unstructured Data

More flexible; some Highly flexible; no


Flexibility Less flexible; fixed schema
structure and tags predefined schema

No transaction
Transaction Matured techniques for Less mature; adapted
management; no
Management handling transactions from DBMS
concurrency

Query Complex queries and joins Queries possible but Mainly text-based queries;
Performance are possible less complex less efficient

Based on relational Based on text and character


Technology Based on XML, RDF
databases data

In summary, structured data is organized and easy to manage, semi-structured data offers some
flexibility with a bit of structure, and unstructured data is highly variable and challenging to process.

Intelligent Data Analysis (IDA) - Simple Explanation

What is IDA?

 IDA helps us find hidden patterns and useful information from large amounts of data. It uses
smart techniques to uncover insights that are not obvious at first glance.

Steps in IDA:

1. Data Preparation:
o What it means: Collect and clean the data you need from different sources.

o Example: If you're studying customer reviews, you collect all reviews and remove
any errors or irrelevant information.

2. Rules Finding or Data Mining:

o What it means: Look for patterns or rules in the cleaned data.

o Example: Discover that customers who buy running shoes often buy sports socks
too.

3. Result Validation and Explanation:

o What it means: Check if the patterns you found are accurate and explain them
clearly.

o Example: Confirm that your discovery about shoe and sock purchases is correct and
explain it in simple terms.

IDA Process:

 Collect Data: Gather information from different places.

 Analyze Data: Use methods to find patterns or trends.

 Explain Results: Make sure the findings are accurate and easy to understand.

Where is IDA Used?

 Banking: To find fraud or manage risks.

 Media: To understand what content people like and improve advertisements.

 Healthcare: To predict illnesses and improve patient care.

How Does It Work?

 Machine Learning: Teaches computers to learn from data and make predictions.

 Deep Learning: Handles complex data and recognizes intricate patterns.

In short, Intelligent Data Analysis helps us turn lots of data into useful information, making it easier
to make decisions and understand trends.

Traditional Data vs. Big Data - Simple Explanation

1. Confidentiality & Data Accuracy:

 Traditional Data: Easier to manage confidentiality with access control rules.

 Big Data: More complex, needs special mechanisms to ensure data confidentiality and
accuracy.

2. Data Relationship:

 Traditional Data: Relationships between data are clear and stable.


 Big Data: Relationships are often unknown or constantly changing.

3. Data Storage Size:

 Traditional Data: Stored in gigabytes to terabytes.

 Big Data: Stored in petabytes to zettabytes (very large amounts of data).

4. Types of Data:

 Traditional Data: Mostly structured (stored in databases like tables).

 Big Data: Includes structured, semi-structured, and unstructured data (like text, images,
videos).

5. Flexibility:

 Traditional Data: Based on fixed schemas (data models don’t change easily).

 Big Data: Dynamic, adaptable to different types of data without fixed structures.

6. Real-Time Analytics:

 Traditional Data: Data is processed periodically (hourly, daily).

 Big Data: Data is processed in real-time (every second).

7. Distributed Architecture:

 Traditional Data: Managed centrally.

 Big Data: Managed in a distributed system (spread across multiple locations).

Key Differences Between Traditional Data and Big Data

Traditional Data Big Data

Generated in enterprise systems (like ERP,


Generated from social media, sensors, etc.
CRM)

Smaller volume (Gigabytes-Terabytes) Larger volume (Petabytes-Zettabytes)

Deals with all types of data (structured,


Deals with structured data
unstructured)

Centralized storage and management Distributed storage and management

Easier to process Requires special tools and processing methods

Schema is fixed and static Schema is flexible and dynamic

Importance of Big Data:

 Big data helps organizations process and analyze massive amounts of information that
traditional systems can’t handle.

 By using big data, businesses can gain insights to improve decision-making and create value.
Case Study: Big Data Solutions (Easy Explanation)

Big Data helps companies handle huge amounts of data to improve their services and make smarter
decisions. Here's a simple case study to explain Big Data solutions.

E-Commerce Site XYZ

Situation: An online shopping site with 100 million users wants to:

 Give $100 vouchers to its top 10 customers who spent the most in the last year.

 Understand what these customers like to buy, so they can recommend similar products.

Problems:

 There’s a huge amount of customer data, and it’s difficult to store and analyze it all.

Solution:

1. Storage:
o Use Hadoop to store all the data across multiple computers. Hadoop can store a lot
of data cheaply.

2. Processing:

o Use MapReduce to go through all the data and find the top 10 customers quickly.

3. Analysis:

o Use tools like Pig and Hive to figure out the buying trends of these customers.

4. Cost:

o Hadoop is free, so it doesn’t cost much to set up and run.

Real-World Examples of Big Data Solutions

1. Walmart:

o Walmart uses Big Data to understand what products customers usually buy together.
With this information, they suggest related products to increase sales.

o They use tools like Hadoop to handle real-time data from their many stores around
the world.

2. Uber:

o Uber uses Big Data to track where their services are in high demand, adjusting prices
accordingly (surge pricing).

o This helps them make sure drivers are available where people need them most.

3. Netflix:

o Netflix uses Big Data to recommend shows and movies based on what users watch
and like. They even use this data to decide what new content to create.
o They use tools like Hadoop and Hive to analyze user data and improve
recommendations.

In simple terms, Big Data helps companies like Walmart, Uber, and Netflix understand customer
behavior, improve services, and make better business decisions.

4o

You might also like