Bda (Chapter 1)
Bda (Chapter 1)
Bda (Chapter 1)
CHAPTER 1
What is Data? Data refers to the quantities, characters, or symbols that a computer uses to perform
operations. It can be stored and shared through electrical signals or different storage media like
magnetic, optical, or mechanical devices.
Where Does Data Come From? Data comes from various sources, such as documents, images, audio,
software programs, and more.
Computer Data as Information Computer data is any information processed or stored by a computer.
It includes text files, images, audio, or software. The computer’s CPU processes this data, and it’s
saved in files on the hard disk.
Definition of Big Data Big Data refers to an extremely large and growing collection of data that is
too complex to be handled by regular data management systems. While regular data can be
measured in megabytes (MB) or gigabytes (GB), Big Data can reach sizes in petabytes (PB), which is
1,000,000,000,000,000 bytes.
Interesting Fact It is said that 90% of the world's data has been created in just the past three years.
Weather stations and satellites: Produce massive amounts of data for forecasting.
Emails, blogs, and news websites: Continuously generate large data volumes.
Social media: Posts, photos, videos, likes, and comments contribute to Big Data.
Traffic data and GPS signals: Data from vehicles and maps.
Digital pictures and videos: Cameras and smartphones produce a huge amount of data.
Big Data has several key characteristics, often referred to as the 5 Vs:
1. Volume
o Example: Social media generates tons of posts, videos, and photos every second.
2. Velocity
3. Variety
o Definition: Data comes in different formats like text, images, videos, and numbers.
o Example: A single app might store user messages, photos, and videos all in different
formats.
4. Veracity
o Example: Social media posts may contain false information, which needs to be
filtered.
5. Value
These characteristics make Big Data challenging but also valuable for gaining insights.
o What it means: Big Data is huge. We're talking about data in terabytes or even
petabytes, not just megabytes or gigabytes.
o Example: The Internet of Things (IoT) generates enormous amounts of data, which
keeps growing.
o Example: Emails, social media posts, and videos all create different types of data that
need to be stored and analyzed.
o What it means: This refers to the accuracy and trustworthiness of data. Large
volumes of data can sometimes be incomplete or inaccurate.
o Example: Social media posts may contain incorrect information, which makes it
difficult to ensure data quality.
o What it means: The speed at which data is generated, processed, and made
accessible. It’s important for real-time data analysis.
o Example: Data from social media, sensors, and mobile devices is generated and
shared continuously at high speeds.
o What it means: The goal is to turn raw data into something valuable, like insights or
revenue for businesses.
o Example: Companies use charts to spot trends or patterns in their sales data.
o What it means: How fast data or information spreads from one person to another,
often through social media.
o Example: A viral video that quickly spreads across the internet through social media
platforms.
These characteristics show what makes Big Data unique and challenging to manage but also very
powerful.
1. Volume of Data
o What it means: Data is growing rapidly from various sources like machines,
telecommunication, and sensors.
o Example: IBM estimates that by 2020, the world's data volume will reach about 35
zettabytes. Managing such vast amounts of data is challenging.
o What it means: Handling and analyzing large amounts of data is difficult and time-
consuming.
o Example: Extracting meaningful insights from huge data sets requires significant time
and effort, and it can be expensive due to the complexity and different formats of
data.
3. Management of Data
o Example: Managing and integrating these different types of data is complex and
requires sophisticated systems.
In essence, conventional systems struggle to keep up with the growing volume of data, the
complexity of processing and analyzing it, and the challenge of managing diverse data formats.
1. Unstructured Data
Characteristics: Often large and complex, making it difficult to process and analyze.
Examples: Search results from Google, social media posts, emails, images, and videos.
Challenges: Hard to derive value from this raw, unstructured data without advanced tools
and techniques.
2. Structured Data
What it is: Data that is organized in a fixed format and can be easily stored, accessed, and
processed.
Examples: Employee records in a database (like a table with Employee_ID, Name, Gender,
etc.).
Advantages: Easy to manage and analyze using traditional database systems and tools.
3. Semi-structured Data
What it is: Data that combines elements of both structured and unstructured data.
Characteristics: Contains tags or markers to separate data elements, but doesn’t fit into a
rigid structure.
Examples: XML files with tags (like <name>, <age>, etc.), web logs, and transaction histories.
Advantages: More flexible than structured data, but still organized enough to be useful.
No transaction
Transaction Matured techniques for Less mature; adapted
management; no
Management handling transactions from DBMS
concurrency
Query Complex queries and joins Queries possible but Mainly text-based queries;
Performance are possible less complex less efficient
In summary, structured data is organized and easy to manage, semi-structured data offers some
flexibility with a bit of structure, and unstructured data is highly variable and challenging to process.
What is IDA?
IDA helps us find hidden patterns and useful information from large amounts of data. It uses
smart techniques to uncover insights that are not obvious at first glance.
Steps in IDA:
1. Data Preparation:
o What it means: Collect and clean the data you need from different sources.
o Example: If you're studying customer reviews, you collect all reviews and remove
any errors or irrelevant information.
o Example: Discover that customers who buy running shoes often buy sports socks
too.
o What it means: Check if the patterns you found are accurate and explain them
clearly.
o Example: Confirm that your discovery about shoe and sock purchases is correct and
explain it in simple terms.
IDA Process:
Explain Results: Make sure the findings are accurate and easy to understand.
Machine Learning: Teaches computers to learn from data and make predictions.
In short, Intelligent Data Analysis helps us turn lots of data into useful information, making it easier
to make decisions and understand trends.
Big Data: More complex, needs special mechanisms to ensure data confidentiality and
accuracy.
2. Data Relationship:
4. Types of Data:
Big Data: Includes structured, semi-structured, and unstructured data (like text, images,
videos).
5. Flexibility:
Traditional Data: Based on fixed schemas (data models don’t change easily).
Big Data: Dynamic, adaptable to different types of data without fixed structures.
6. Real-Time Analytics:
7. Distributed Architecture:
Big data helps organizations process and analyze massive amounts of information that
traditional systems can’t handle.
By using big data, businesses can gain insights to improve decision-making and create value.
Case Study: Big Data Solutions (Easy Explanation)
Big Data helps companies handle huge amounts of data to improve their services and make smarter
decisions. Here's a simple case study to explain Big Data solutions.
Situation: An online shopping site with 100 million users wants to:
Give $100 vouchers to its top 10 customers who spent the most in the last year.
Understand what these customers like to buy, so they can recommend similar products.
Problems:
There’s a huge amount of customer data, and it’s difficult to store and analyze it all.
Solution:
1. Storage:
o Use Hadoop to store all the data across multiple computers. Hadoop can store a lot
of data cheaply.
2. Processing:
o Use MapReduce to go through all the data and find the top 10 customers quickly.
3. Analysis:
o Use tools like Pig and Hive to figure out the buying trends of these customers.
4. Cost:
1. Walmart:
o Walmart uses Big Data to understand what products customers usually buy together.
With this information, they suggest related products to increase sales.
o They use tools like Hadoop to handle real-time data from their many stores around
the world.
2. Uber:
o Uber uses Big Data to track where their services are in high demand, adjusting prices
accordingly (surge pricing).
o This helps them make sure drivers are available where people need them most.
3. Netflix:
o Netflix uses Big Data to recommend shows and movies based on what users watch
and like. They even use this data to decide what new content to create.
o They use tools like Hadoop and Hive to analyze user data and improve
recommendations.
In simple terms, Big Data helps companies like Walmart, Uber, and Netflix understand customer
behavior, improve services, and make better business decisions.
4o