BI Module 2
BI Module 2
To analyze data properly, we need to check if the data is ready for analysis. Below are some important
qualities (or metrics) that define whether the data is good for an analytics study:
This checks if the data comes from a trusted and original source.
If data is copied or moved through multiple steps, it might get changed or lost, affecting its accuracy.
It’s always best to use data from the original source to avoid errors.
The data should be correct and match what is needed for the analysis.
Example: A customer’s phone number in the database should be exactly what the customer provided.
3. Data Accessibility
5. Data Richness
6. Data Consistency
Data from different sources should be combined correctly without mixing things up.
Example: If we merge medical records and patient contact details, we must not accidentally assign the wrong contact
details to a patient.
9. Data Validity
Summary
For a good data analytics study, the data should be reliable, accurate, accessible, secure, rich, consistent,
timely, detailed, valid, and relevant. Checking these qualities ensures that the analysis gives the best and
most useful results.
2. Nominal Data
A type of categorical data where there is no ranking or order.
Examples:
o Marital Status: Single, Married, Divorced
o Eye Color: Brown, Blue, Green
o Yes/No, True/False choices
3. Ordinal Data
A type of categorical data where there is a ranking or order, but the difference between
values is not measurable.
Examples:
5. Interval Data
A type of numeric data where differences between values are meaningful, but there is
no true zero.
Example: Temperature in Celsius or Fahrenheit (0°C doesn’t mean “no temperature”).
6. Ratio Data
A type of numeric data where differences between values are meaningful, and there is
a true zero.
Examples: Height, Weight, Distance, Time.
Temperature in Kelvin is a ratio data type because 0K means “no heat.
Big Data is often explained using three main "V"s: Volume, Variety, and Velocity. Over
time, more "V"s have been added, such as Veracity, Variability, and Value Proposition.
About 80-85% of company data is unstructured, but it is still valuable for decision-making.
If data is not processed quickly, it loses its value. Real-time analytics (data stream analytics) helps companies make
fast decisions.
Data preprocessing
Real-world data is often messy, incomplete, and unstructured. Before using it for analysis, we need to clean and
organize it. This process is called data preprocessing, and it is a crucial step in data analytics. It involves four
main phases:
Normalization: Data is scaled to a common range to avoid bias (e.g., large values like income should not dominate small
values like years of experience).
Discretization: Continuous data is converted into categories (e.g., age 18-30 = "young", 31-50 = "middle-aged").
Aggregation: Groups similar values together to reduce complexity.
Feature Engineering: New useful variables are created from existing ones.
Hadop
What is Hadoop?
Hadoop is an open-source system that helps store, process, and analyze huge amounts of data. It
was created by Doug Cutting at Yahoo! and is now managed by the Apache Software
Foundation.
Instead of using one powerful computer to process big data, Hadoop splits the data into smaller
parts and processes them on multiple machines at the same time. This makes it faster and more
efficient.
1. Data comes from different sources like log files, social media, and internal records.
2. Hadoop stores this data using Hadoop Distributed File System (HDFS).
3. The data is divided into multiple parts and stored across different computers (nodes).
4. Each part is copied multiple times so that if one machine fails, the data is still safe.
2.
This method of breaking down tasks and working on them in parallel makes Hadoop powerful
and efficient.
What is MapReduce?
MapReduce is a programming model developed by Google to process very large data sets
efficiently. It is used inside Hadoop to handle big data.
Hadoop and Spark are both big data technologies, but they work in different ways. Here’s how they compare:
1. Performance ⚡
Spark is faster because it processes data in memory (RAM), avoiding slow disk operations.
Hadoop is slower because it reads and writes data to a hard drive, making it less efficient for real-time tasks.
2. Cost
Hadoop is cheaper since it works with regular hard drives and doesn’t need much RAM.
Spark is more expensive because it requires a lot of RAM to process data quickly in real-time.
3. Parallel Processing
Hadoop is better for batch processing (processing large data in chunks). It works well for tasks that don’t require instant
results.
Spark is better for real-time processing (analyzing live data as it comes in). It’s great for streaming data from sources like
social media or sensors.
4. Scalability
Hadoop scales easily when data grows because of HDFS (Hadoop Distributed File System), which spreads data across
multiple machines.
Spark also scales well but still depends on HDFS for handling very large data.
5. Security
Hadoop is more secure because it has strong authentication and access control features.
Spark has basic security but can be combined with Hadoop to improve it.
Spark is better for analytics because it has MLlib, a built-in machine learning library.
It can handle tasks like regression, classification, and model evaluation faster than Hadoop.
Choose Hadoop if you need cheap storage and batch processing for large data.
Choose Spark if you need real-time data analysis, speed, and machine learning capabilities.
Both technologies can work together for better performance and security!
NoSQL (short for "Not Only SQL") is a new type of database designed to handle huge amounts of data in a
flexible way. Unlike traditional databases, NoSQL can work with different types of data (structured, semi-
structured, and unstructured).
Sometimes, NoSQL and Hadoop are used together. For example, HBase, a popular NoSQL database, runs on
Hadoop’s HDFS (Hadoop Distributed File System). This allows quick lookups of data stored in Hadoop.
Challenges of NoSQL
NoSQL databases sacrifice some traditional database features to improve speed and scalability:
They don’t fully follow ACID (Atomicity, Consistency, Isolation, Durability) rules, which ensure data accuracy in traditional
databases.
Many NoSQL databases lack proper tools for management and monitoring.
However, the open-source community and companies are working to improve these issues.
Would you like to know which NoSQL database would be best for your blockchain-based project?
Stream analytics is a way of analyzing data that is constantly being created and updated in real time. It is also
called real-time data analytics or data-in-motion analytics. Instead of analyzing data that has been stored for
a long time, stream analytics focuses on making quick decisions based on live data.
A stream is a continuous flow of data. Each piece of data in a stream is called a tuple, which is similar to a row
in a database. However, in cases where a single tuple doesn’t provide enough information, multiple tuples are
grouped together in a window for better analysis.
Websites like Amazon and eBay track customer behavior in real time. Every click, search, or product view is
analyzed instantly to suggest better product recommendations and deals. This increases sales by converting
casual visitors into buyers.
Telecom companies collect huge amounts of data from customer calls and messages. By analyzing this data in
real time, they can:
Predict customer behavior – Identify customers who might stop using their services.
Understand customer networks – Identify influencers who affect others’ choices.
Improve marketing – Combine call data with social media trends to create better campaigns.
Crime prevention – Analyzing video surveillance, social media, and online activity.
Cybersecurity – Detecting and stopping online threats, hacking, and fraud in real time.
Smart meters and sensors in power grids send real-time data to electricity providers. This helps them:
Stream analytics helps financial companies make fast decisions by analyzing stock market trends. It is also used
to:
Hospitals use real-time data from medical devices to monitor patients and detect health issues early. Examples
include:
7. Government Services