0% found this document useful (0 votes)
18 views11 pages

BI Module 2

The document outlines essential qualities for effective data analysis, including reliability, accuracy, accessibility, security, richness, consistency, timeliness, granularity, validity, and relevancy. It also categorizes data types and discusses the importance of big data characteristics such as volume, variety, and velocity. Additionally, it covers data preprocessing steps, the functionality of Hadoop and Spark, and the role of NoSQL databases and stream analytics in modern data handling.

Uploaded by

layappa2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

BI Module 2

The document outlines essential qualities for effective data analysis, including reliability, accuracy, accessibility, security, richness, consistency, timeliness, granularity, validity, and relevancy. It also categorizes data types and discusses the importance of big data characteristics such as volume, variety, and velocity. Additionally, it covers data preprocessing steps, the functionality of Hadoop and Spark, and the role of NoSQL databases and stream analytics in modern data handling.

Uploaded by

layappa2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

dModule 2

To analyze data properly, we need to check if the data is ready for analysis. Below are some important
qualities (or metrics) that define whether the data is good for an analytics study:

1. Data Source Reliability

 This checks if the data comes from a trusted and original source.
 If data is copied or moved through multiple steps, it might get changed or lost, affecting its accuracy.
 It’s always best to use data from the original source to avoid errors.

2. Data Content Accuracy

 The data should be correct and match what is needed for the analysis.
 Example: A customer’s phone number in the database should be exactly what the customer provided.

3. Data Accessibility

 The data should be easy to access when needed.


 If data is stored in multiple places or different formats, it can be difficult to retrieve and use.
 New technologies like data lakes and Hadoop help in making data more accessible.

4. Data Security and Privacy

 Only authorized people should be able to access the data.


 Sensitive data, like medical records, should be protected from unauthorized access (e.g., by laws like HIPAA).

5. Data Richness

 The data should have enough details to be useful for analysis.


 Example: If we are studying customer behavior, we should have details like age, purchase history, and location to get a
clear picture.

6. Data Consistency

 Data from different sources should be combined correctly without mixing things up.
 Example: If we merge medical records and patient contact details, we must not accidentally assign the wrong contact
details to a patient.

7. Data Timeliness (or Data Currency)

 The data should be recent and updated.


 If we use old data, the analysis might not be accurate or useful.
 Example: A company using last year’s sales data to predict next month’s trends might not get the best results.

8. Data Granularity (Level of Detail)

 Data should be recorded at the right level of detail.


 Example: A hospital should record a patient’s test results with exact decimal values, instead of rounding them off.
 Once data is summarized (aggregated), it cannot be broken down into its original detailed form.

9. Data Validity

 Data values should match what is expected or allowed.


 Example: If we collect gender data, valid values could be Male, Female, or Other. If someone enters "XYZ," it would be
invalid.
10. Data Relevancy

 The data should be important and useful for the study.


 Example: If we are studying customer buying behavior, including weather data might not be relevant.
 Irrelevant data can make the analysis confusing and misleading.

Summary

For a good data analytics study, the data should be reliable, accurate, accessible, secure, rich, consistent,
timely, detailed, valid, and relevant. Checking these qualities ensures that the analysis gives the best and
most useful results.

A simple taxonomy of data

1. Categorical Data (Discrete Data)


 Represents labels or groups.
 Examples: Gender (Male/Female), Age Group (Child/Teen/Adult), Education Level (High
School/College).
 Even if numbers are used (like 1 = Male, 2 = Female), they are just symbols and don't
have numerical meaning.

2. Nominal Data
 A type of categorical data where there is no ranking or order.
 Examples:
o Marital Status: Single, Married, Divorced
o Eye Color: Brown, Blue, Green
o Yes/No, True/False choices

3. Ordinal Data
 A type of categorical data where there is a ranking or order, but the difference between
values is not measurable.
 Examples:

o Credit Score: Low, Medium, High


o Education Level: High School, College, Graduate School
o Age Group: Child, Young Adult, Middle-aged, Senior

4. Numeric Data (Continuous Data)


 Represents measurable values that can have fractions.
 Examples: Age, Income, Temperature, Travel Distance.
 Two types:

o Integer Data – Whole numbers (e.g., number of children).


o Real Data – Can have decimal points (e.g., height, weight).

5. Interval Data
 A type of numeric data where differences between values are meaningful, but there is
no true zero.
 Example: Temperature in Celsius or Fahrenheit (0°C doesn’t mean “no temperature”).

6. Ratio Data
 A type of numeric data where differences between values are meaningful, and there is
a true zero.
 Examples: Height, Weight, Distance, Time.
 Temperature in Kelvin is a ratio data type because 0K means “no heat.

3.6 keys to success with big data analytucs

1. Clear Business Purpose


 Big Data should help the business, not just be used for the sake of technology.
 The main goal should be solving business problems, whether at a strategic, tactical, or operational
level.
2. Strong Leadership Support
 A successful Big Data project needs support from top executives.
 If the project is small, department-level support may be enough.
 For company-wide changes, leaders across the organization must support and promote it.

3. Business and IT Working Together


 The business goals should drive the use of analytics, not the other way around.
 IT and business teams must align their strategies to make analytics useful.

4. Decision-Making Based on Facts


 Decisions should be based on data and analytics, not just guesses or gut feelings.
 A culture of testing and experimenting should be encouraged.
 To create this culture, senior management should:

o Support data-driven decisions.


o Stop using outdated methods.
o Encourage employees to use analytics when making choices.
o Offer rewards for using data correctly.

5. A Strong Data Infrastructure


 Traditional data warehouses store business data, but new Big Data technologies are making them
better.
 Companies need a modern system that combines both old and new technologies.
 Since Big Data is large and complex, advanced computing techniques are required to process it quickly
and efficiently. These techniques are known as high-performance computing.
 t
3.5 big data defnition
The "V"s That Define Big Data

Big Data is often explained using three main "V"s: Volume, Variety, and Velocity. Over
time, more "V"s have been added, such as Veracity, Variability, and Value Proposition.

1. Volume (Amount of Data)


 Big Data means huge amounts of data.
 This data comes from sources like social media, sensors, GPS, and business transactions.
 Storing large data was once a problem, but now storage is cheaper. The main challenge is finding useful information
in massive data.
 Data volume is growing fast—what was once measured in petabytes (PB) is now in zettabytes (ZB) (1 ZB = 1 trillion
GB).

2. Variety (Different Types of Data)

 Data comes in many formats, not just organized tables.


 Examples:

o Structured Data (databases, spreadsheets)


o Semi-structured Data (emails, XML, JSON)
o Unstructured Data (videos, images, social media posts)

 About 80-85% of company data is unstructured, but it is still valuable for decision-making.

3. Velocity (Speed of Data)


 Big Data is generated very fast and needs quick processing.
 Examples:

o Stock market prices change every second.


o Social media updates happen instantly.
o Sensors in smart devices collect data in real-time.

 If data is not processed quickly, it loses its value. Real-time analytics (data stream analytics) helps companies make
fast decisions.

4. Veracity (Accuracy & Trustworthiness)


 Big Data is not always accurate or reliable.
 Incorrect or low-quality data can lead to bad decisions.
 Tools and techniques are used to clean, filter, and verify data for accuracy.
5. Variability (Changing Data Patterns)
 Data flow is inconsistent—sometimes huge amounts come in suddenly.
 Example:

o A celebrity tweet can make a product trend suddenly.


o A big event (like a major sale or product launch) can cause a spike in online traffic.

 Businesses must handle these fluctuations efficiently.

6. Value Proposition (Importance & Benefits)


 The real reason for Big Data is the value it provides.
 Analyzing large datasets helps companies find patterns, predict trends, and make better decisions.
 Example:

o E-commerce sites analyze customer behavior to recommend products.


o Healthcare providers use Big Data to predict diseases.

 Big Data leads to better insights and smarter decision-making

Data preprocessing

The Art and Science of Data Preprocessing (Easy Explanation)

Real-world data is often messy, incomplete, and unstructured. Before using it for analysis, we need to clean and
organize it. This process is called data preprocessing, and it is a crucial step in data analytics. It involves four
main phases:

1. Data Collection & Integration (Gathering and Combining Data)

 Data is collected from different sources.


 Only useful data is selected, and unnecessary details are removed.
 Data from multiple sources is merged carefully to avoid duplication or errors.
 This merging process is called data blending.

2. Data Cleaning (Fixing Errors and Missing Data)

 Data is often incomplete or incorrect (dirty data).


 Missing values are either filled with the most likely value or ignored.
 Outliers (unusual values) are identified and corrected.
 Inconsistent values are adjusted using expert knowledge.

3. Data Transformation (Adjusting Data for Better Use)

 Normalization: Data is scaled to a common range to avoid bias (e.g., large values like income should not dominate small
values like years of experience).
 Discretization: Continuous data is converted into categories (e.g., age 18-30 = "young", 31-50 = "middle-aged").
 Aggregation: Groups similar values together to reduce complexity.
 Feature Engineering: New useful variables are created from existing ones.

4. Data Reduction (Making Data Manageable)

 Too much data can slow down processing.


 Dimensionality Reduction: Reducing the number of variables while keeping important information (e.g., using Principal
Component Analysis (PCA)).
 Sampling: Selecting a smaller, representative set of records instead of using the full dataset.
 Balancing Data: If data is skewed (e.g., more data for one category than another), we adjust it by oversampling (adding
more of the smaller category) or undersampling (reducing the larger category).

Hadop

What is Hadoop?
Hadoop is an open-source system that helps store, process, and analyze huge amounts of data. It
was created by Doug Cutting at Yahoo! and is now managed by the Apache Software
Foundation.

Instead of using one powerful computer to process big data, Hadoop splits the data into smaller
parts and processes them on multiple machines at the same time. This makes it faster and more
efficient.

How Does Hadoop Work?

Data Collection & Storage

1. Data comes from different sources like log files, social media, and internal records.
2. Hadoop stores this data using Hadoop Distributed File System (HDFS).
3. The data is divided into multiple parts and stored across different computers (nodes).
4. Each part is copied multiple times so that if one machine fails, the data is still safe.
2.

Processing the Data (MapReduce Framework

1. Step 1: "Map" Job

1. A request (query) is sent to find specific information in the data.


2. Hadoop distributes the job to different machines.
3. Each machine processes its assigned data separately.

2. Step 2: "Reduce" Job


1. The results from all machines are collected and combined to get the final answer.
2. This data is then stored for further analysis.

This method of breaking down tasks and working on them in parallel makes Hadoop powerful
and efficient.

What is MapReduce?

MapReduce is a programming model developed by Google to process very large data sets
efficiently. It is used inside Hadoop to handle big data.

Camparison between hodop and spark

Comparison Between Hadoop and Spark – Explained Simply

Hadoop and Spark are both big data technologies, but they work in different ways. Here’s how they compare:

1. Performance ⚡

 Spark is faster because it processes data in memory (RAM), avoiding slow disk operations.
 Hadoop is slower because it reads and writes data to a hard drive, making it less efficient for real-time tasks.

2. Cost
 Hadoop is cheaper since it works with regular hard drives and doesn’t need much RAM.
 Spark is more expensive because it requires a lot of RAM to process data quickly in real-time.

3. Parallel Processing

 Hadoop is better for batch processing (processing large data in chunks). It works well for tasks that don’t require instant
results.
 Spark is better for real-time processing (analyzing live data as it comes in). It’s great for streaming data from sources like
social media or sensors.

4. Scalability

 Hadoop scales easily when data grows because of HDFS (Hadoop Distributed File System), which spreads data across
multiple machines.
 Spark also scales well but still depends on HDFS for handling very large data.

5. Security

 Hadoop is more secure because it has strong authentication and access control features.
 Spark has basic security but can be combined with Hadoop to improve it.

6. Analytics & Machine Learning

 Spark is better for analytics because it has MLlib, a built-in machine learning library.
 It can handle tasks like regression, classification, and model evaluation faster than Hadoop.

Which One Should You Choose?

 Choose Hadoop if you need cheap storage and batch processing for large data.
 Choose Spark if you need real-time data analysis, speed, and machine learning capabilities.

Both technologies can work together for better performance and security!

Here’s a simpler version of your text:

NoSQL: A New Type of Database

NoSQL (short for "Not Only SQL") is a new type of database designed to handle huge amounts of data in a
flexible way. Unlike traditional databases, NoSQL can work with different types of data (structured, semi-
structured, and unstructured).

How NoSQL is Different from Hadoop

 Hadoop is great for analyzing large amounts of historical data in batches.


 NoSQL is designed for fast access to specific pieces of data from large datasets, making it useful for real-time applications.

How NoSQL and Hadoop Work Together

Sometimes, NoSQL and Hadoop are used together. For example, HBase, a popular NoSQL database, runs on
Hadoop’s HDFS (Hadoop Distributed File System). This allows quick lookups of data stored in Hadoop.

Challenges of NoSQL

NoSQL databases sacrifice some traditional database features to improve speed and scalability:
 They don’t fully follow ACID (Atomicity, Consistency, Isolation, Durability) rules, which ensure data accuracy in traditional
databases.
 Many NoSQL databases lack proper tools for management and monitoring.

However, the open-source community and companies are working to improve these issues.

Popular NoSQL Databases

There are several types of NoSQL databases, including:

 HBase (works with Hadoop)


 Cassandra (good for high-speed data processing)
 MongoDB (popular for web applications)
 DynamoDB (Amazon’s NoSQL database)
 CouchDB, Riak, Accumulo (other widely used options)

Would you like to know which NoSQL database would be best for your blockchain-based project?

What is Stream Analytics?

Stream analytics is a way of analyzing data that is constantly being created and updated in real time. It is also
called real-time data analytics or data-in-motion analytics. Instead of analyzing data that has been stored for
a long time, stream analytics focuses on making quick decisions based on live data.

A stream is a continuous flow of data. Each piece of data in a stream is called a tuple, which is similar to a row
in a database. However, in cases where a single tuple doesn’t provide enough information, multiple tuples are
grouped together in a window for better analysis.

Stream analytics is becoming more important because:

1. Speed is crucial – Businesses and organizations need to act quickly.


2. Technology has improved – We now have better tools to collect and analyze data in real time.

Where is Stream Analytics Used?

1. e-Commerce (Online Shopping Websites)

Websites like Amazon and eBay track customer behavior in real time. Every click, search, or product view is
analyzed instantly to suggest better product recommendations and deals. This increases sales by converting
casual visitors into buyers.

2. Telecommunications (Mobile and Internet Companies)

Telecom companies collect huge amounts of data from customer calls and messages. By analyzing this data in
real time, they can:

 Predict customer behavior – Identify customers who might stop using their services.
 Understand customer networks – Identify influencers who affect others’ choices.
 Improve marketing – Combine call data with social media trends to create better campaigns.

3. Law Enforcement and Cybersecurity


Law enforcement agencies use stream analytics for:

 Crime prevention – Analyzing video surveillance, social media, and online activity.
 Cybersecurity – Detecting and stopping online threats, hacking, and fraud in real time.

4. Power and Energy Industry

Smart meters and sensors in power grids send real-time data to electricity providers. This helps them:

 Predict power demand – Adjust supply based on electricity usage trends.


 Optimize renewable energy – Use weather data to decide when to increase or decrease solar/wind energy production.
 Prevent outages – Detect issues before they cause blackouts.

5. Financial Services (Banking and Stock Markets)

Stream analytics helps financial companies make fast decisions by analyzing stock market trends. It is also used
to:

 Detect fraud – Identify unusual transactions that may indicate fraud.


 Improve trading strategies – Analyze market trends in real time to maximize profits.

6. Healthcare and Medical Fields

Hospitals use real-time data from medical devices to monitor patients and detect health issues early. Examples
include:

 Detecting heart problems – Using real-time ECG data to alert doctors.


 Monitoring patients remotely – Devices that send live health data to hospitals.

7. Government Services

Governments use stream analytics for:

 Disaster management – Tracking storms, wildfires, and floods in real time.


 Traffic control – Using GPS and camera data to adjust traffic lights and reduce congestion.
 Environmental monitoring – Checking water and air quality to detect pollution early.

You might also like