0% found this document useful (0 votes)

46 views79 pages

Unit 1 Big Data

The document provides an extensive overview of Big Data, including its definition, types, history, architecture, and importance in modern analytics. It discusses the evolution of Big Data technologies, the drivers behind its growth, and the various platforms and tools used for data management and analysis. Additionally, it highlights the differences between OLTP and OLAP systems, as well as the components of a Big Data platform.

Uploaded by

n69659205

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views79 pages

Unit 1 Big Data

Uploaded by

n69659205

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 79

BIG DATA (BCS-061)

UNIT-1 : INTRODUCTION TO BIG DATA

Contents
Introduction
Types of digital data
History of Big Data Innovation
introduction to Big Data platform
Drivers for Big Data
Big Data architecture
Characteristics and 5 Vs of Big Data
Big Data technology components
Contents (Cont…)
Big Data Importance and applications
Big Data features
Challenges of conventional systems
Intelligent data analysis
Nature of data
Analytic processes and tools
Analysis vs reporting
Modern data analytic tools
What is Big Data?

Definition 1: Big Data is a collection of data that is huge in volume, yet growing
exponentially with time. It is a data with so large size and complexity that none of
traditional data management tools can store it or process it efficiently. Big data is also a
data but with huge size.
Big data is used in machine learning, predictive modeling, and other advanced analytics
to solve business problems and make informed decisions.
Example-Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of
social media site Facebook, every day. This data is mainly generated in terms of photo
and video uploads, message exchanges, putting comments etc.
Types of Digital Data

1.Structured Data (10-15%): This is relational data that can be processed, is easily accessible, and can be
stored in a fixed format.
Examples include: Financial transactions ,Customer records (CRM),Enterprise databases
(SQL, data warehouses)
Traditional enterprises rely on structured data, but its share is shrinking due to the rise of unstructured data.

2. Semi-structured Data (20-25%): This is a hybrid of structured and unstructured data.

Examples include: JSON, XML, and NoSQL databases (MongoDB, Cassandra),Emails and
log files,IoT sensor data (structured metadata but unstructured content)
The dominant form of Big Data, driven by multimedia, social media, and IoT devices.

3. Unstructured Data (60-65%): This is data of different formats, such as document files, multimedia
files, images, backup files, social media posts, audio files, and open-ended customer comments.
Example Include: Social media data (posts, images, videos),Multimedia (videos,
audio files, images),IoT-generated sensor data,Medical imaging (X-rays, MRIs)
The dominant form of Big Data, driven by multimedia, social media, and IoT devices.
History of Big Data Innovation
1. Early Foundations (1960s–1980s): Birth of Data Management

2. Rise of the Internet & Distributed Computing (1990s)

1990s: Data generation increases with enterprise computing, web applications, and online transactions.
1995: The term "Big Data" is first used by John Mashey at SGI, referring to the challenges of large-scale
data processing.
1997: Google is founded, eventually revolutionizing data indexing and search.
1998: PageRank Algorithm (Google) optimizes search results using link analysis.

3. Big Data Boom & Open Source Technologies (2000s)

2001: Doug Laney defines the "3Vs" of Big Data (Volume, Velocity, Variety).
2003–2004: Google publishes the GFS (Google File System) and MapReduce frameworks, enabling
large-scale distributed processing.
2006: Hadoop is born (Yahoo! engineers create an open-source implementation of Google’s MapReduce).
2007: NoSQL databases (Cassandra, MongoDB) emerge, handling semi-structured and unstructured
data.
4. Cloud Computing & AI-Driven Big Data (2010s)
2010: Apache Spark is developed, improving real-time Big Data analytics over Hadoop.
2012: Big Data reaches mainstream recognition, with companies adopting data-driven strategies.
2014: Deep Learning (AI) revolutionizes data analysis, leveraging GPUs for large-scale processing.
2015: Cloud-based Big Data platforms (AWS, Google BigQuery, Azure) gain adoption.
2016: Streaming analytics (Apache Kafka, Flink) enable real-time data processing.
2018: Edge Computing grows, allowing IoT devices to process data closer to the source.

5. AI-Integrated Big Data & Future Trends (2020s & Beyond)

2020: COVID-19 accelerates AI-driven Big Data applications in healthcare, supply chain
management, and remote work analytics.
2022: AutoML and AI-augmented analytics democratize data science, reducing reliance on human
expertise.
2023-Present: Quantum computing, federated learning, and blockchain are shaping the next
wave of secure and decentralized Big Data processing.
SOURCES TOWARDS DATA SCIENCE
Big data refers to storing, managing, and processing large volumes of data, while data
science focuses on analyzing and interpreting data to gain insights and make informed
decisions.
Introduction to Big Data platform
Big Data Platform provides the approach for data management that combines servers, Big Data
Tools: Empowering Data Management and Analysis, and Analytical and Machine Learning into one
Cloud Platform, for managing as well as Real-time Insights.
Big data Platform workflow is divided into the following stages
1. Data Collection
2. Data Storage
3. Data Processing
4. Data Analytics
5. Data Management and Warehousing
6. Data Catalog and Metadata Management
7. Data Observability
8. Data Intelligence

Note: Stream computing is a way to analyze and process Big Data in real time to gain
current insights to take appropriate decisions or to predict new trends in the immediate
Need for a Big Data Platform
To provide users with efficient analytics tools specifically designed for
handling massive datasets.
 Data engineers often utilize these platforms to aggregate, clean, and
prepare data for insightful business analysis.
Data scientists, on the other hand, leverage this platform to uncover
valuable relationships and patterns within large datasets using advanced
machine learning algorithms.
Furthermore, users have the flexibility to build custom applications
tailored to their specific use cases, such as calculating customer loyalty in
the e-commerce industry, among countless other possibilities.
Different Types of Big Data Platforms and Tools
This includes four letters: S, A, P, and S, which means Scalability, Availability, Performance,
and Security. There are various tools responsible for managing hybrid data of IT systems. The
list of platforms are listed below:
Hadoop- Delta Lake Migration Platform
It is an open-source software platform managed by Apache Software Foundation. It is used to
collect and store large data sets cheaply and efficiently.
Note: A Delta Lake is a table format that supports Parquet file format. It is an open-source
storage layer that helps bring reliability to the data lakes. It provides ACID transactions,
unifies streaming and batch data processing, and scalable metadata handling.
Data Catalog and Data Observability Platform
It provides a single self-service environment to the users, helping them find, understand, and
trust the data source. It also helps the users discover new data sources, if any. Seeing and
understanding data sources are the initial steps for registering the births. Users search for the
Data Catalog Tools and filter the appropriate results based on their needs. In Enterprises, Data
Lake is needed for Business Intelligence, Data Scientists, and ETL Developers where the
correct data is needed. The users use catalog discovery to find the data that fits their needs.
Different Types of Big Data Platforms and Tools (Cont..)

Data Ingestion and Integration Platform

This layer is the first step for the data from variable sources to start its journey. This means the data
here is prioritized and categorized, making data flow smoothly in further layers in this process flow.
Big Data and IoT Analytics Platform
It provides a wide range of tools to work on; this functionality comes in handy while using it over the
IoT case.
Data Discovery and Management Platform
A data mesh introduces the concept of a self-serve data platform to avoid duplication of efforts. Data
engineers set up technologies so that all business units can process and store their data products.
Cloud ETL Data Transformation Platform
This Platform can be used to build pipelines and even schedule the running of the same for data
transformation. Deep research on data transformation platforms using ETL.
What is a Data Warehouse?
A data warehouse (DW or DWH) is a centralized repository that stores large volumes of
structured data collected from multiple sources. It is designed specifically for querying,
analysis, and reporting, rather than transactional processing.
Data warehouses help organizations store, process, and analyze historical data to support
business intelligence (BI) and decision-making.
Key Characteristics of a Data Warehouse
Subject-Oriented
Data is organized around business subjects like sales, finance, marketing, or customers.
Integrated
Combines data from multiple sources (databases, cloud storage, CRM, IoT, etc.).
Time-Variant (Historical Data)
Stores historical data over time for trend analysis.
Non-Volatile
Data remains unchanged once entered, ensuring consistent analytics.
How a Data Warehouse Works
Extract – Collects data from various sources (databases, APIs, cloud, etc.).
Transform – Cleans, formats, and structures data for analysis.
Load (ETL/ELT Process) – Stores processed data in the warehouse.
Query & Analysis – Uses SQL queries, OLAP, or BI tools to generate reports.
Types of Data Warehouses
Enterprise Data Warehouse (EDW)
A central repository for the entire organization.
Example: Amazon Redshift, Snowflake, Google BigQuery.
Operational Data Store (ODS)
A near-real-time system used for operational reporting.
Example: Healthcare, banking transactions.
Data Mart
A smaller, department-specific warehouse (subset of EDW).
Example: Marketing Data Mart, Sales Data Mart.
OLAP stands for Online
Analytical Processing
OLAP systems have the capability to analyze
database information of multiple systems at the
current time. The primary goal of OLAP Service
is data analysis and not data processing.

OLTP stands for Online

Transaction Processing
OLTP has the work to administer
day-to-day transactions in any
organization. The main goal of OLTP
is data processing not data analysis.
Feature OLTP OLAP
Purpose Transaction Processing Data Analysis & Reporting
Data Type Current, Real-time Historical, Aggregated
Operations CRUD (Create, Read, Update, Complex Read &
Delete) Aggregations
Normalization Highly Normalized Denormalized
Processing Speed Slower but optimized for
Fast, Low Latency
queries
Queries Simple, fast queries Complex, multi-dimensional
queries
Data Volume Small to Medium Large (TB to PB)
Users Business Analysts,
Operational Staff
Executives
Examples Banking, E-commerce, CRM Data Warehouses, BI Tools
Use Case: Banking Use Case: Business
transactions, e-commerce intelligence, data mining,
orders, CRM systems. trend analysis.
Difference between OLTP &OLAP
Components of Big Data Platform
1. Data Ingestion, Integration and ETL – It provides these resources for effective
data management and effective data warehousing, and this manages data as a
valuable resource.
2. Stream Computing – Helps compute the streaming data used for real-time
analytics.

3. Big Data Analytics Platform / Machine Learning – It Provides analytics tools

and Machine learning Tools with MLOps and Features for advanced analytics and
machine learning.

4. Data Integration and Warehouse – It provides its users with features like
integrating it from any source with ease.

5. Data Governance – Data Governance also provides comprehensive security, data

governance, and data protection solutions.
Components of Big Data Platform (Cont…)
6. Provides Accurate Data – It delivers analytic tools, which help to omit any
inaccurate data that has not been analyzed. This also allows the business to make the
right decision using accurate information.

7. Cloud Datawarehouse for Scalability – It also helps scale the application to

analyze all-time climbing data; it sizes to provide efficient analysis. It offers scalable
storage capacity.

8. Data Discovery Platform for Price Optimization – Data analytics, with the help
of a big data platform, provides insight for B2C and B2B enterprises, which helps
businesses optimize the prices they charge accordingly.

9. Data Observability – With the warehouse set, analytics tools, and efficient Data
transformation, it helps reduce the data latency and provide high throughput.
Drivers for Big Data

Big Data emerged in the last decade from a combination of business needs and
technology innovations. A number of companies that have Big Data at the core of their
strategy have become very successful at the beginning of the 21st century. Famous
examples include Apple, Amazon, Facebook and Netflix. A number of business drivers are
at the core of this success and explain why Big Data has quickly risen to become one of
the most coveted topics in the industry. Six main business drivers can be identified :
1. The digitization of society;
2. The plummeting of technology costs;
3. Connectivity through cloud computing;
4. Increased knowledge about data science;
5. Social media applications;
6. The upcoming Internet-of-Things (IoT)
1. The digitization of society
Big data is largely
consumer driven and
consumer oriented.
Most people now
spent 4 to 6 hours per
day consuming and
generating data
through a variety of
devices and
applications. Some
studies estimate that
60% of data was
generated within the
last three years, which
is a good indication of
the rate with which
society has digitized.
*Example: The adoption of electronic Bills of Lading (eBL) will enable the trade industry to
benefit from faster transactions, cost savings, and lowered fraud risks.
2. The plummeting of technology costs
The costs of data storage and processors keep declining, making it
possible for small businesses and individuals to become involved
with Big Data.
Besides the plummeting of the storage costs, a second key
contributing factor to the affordability of Big Data has been the
development of open source Big Data software frameworks.
The most popular software framework is Apache Hadoop for
distributed storage and processing.
Due to the high availability of these software frameworks in open
sources, it has become increasingly inexpensive to start Big Data
projects in organizations.
3. Connectivity through cloud computing
Cloud computing
environments have
made it possible to
quickly scale up or scale
down IT infrastructure
and facilitate a pay-as-
you-go model. Instead,
they can license the
storage and processing
capacity they need and
only pay for the amounts
they actually used. As a
result, most of Big Data
solutions leverage the
possibilities of cloud
computing to deliver
their solutions to
4. Increased knowledge about data science

The demand for data

scientist has
increased
tremendously and
many people have
actively become
engaged in the
domain of data
science.
5. Social media applications
Social media data provides
insights into the behaviors,
preferences and opinions of
‘the public’ on a scale that has
never been known before.
Due to this, it is immensely
valuable to anyone who is able
to derive meaning from these
large quantities of data.
Social media data can be used
to identify customer
preferences for product
development, target new
customers for future
purchases, or even target
potential voters in elections.
Social media data might even
be considered one of the most
important business drivers of
6. The upcoming Internet-of-Things (IoT)

The Internet of things (IoT) is the

network of physical devices, vehicles,
home appliances and other items
embedded with electronics, software,
sensors, actuators, and network
connectivity which enables these
objects to connect and exchange data.
It is increasingly gaining popularity as
consumer goods providers start
including ‘smart’ sensors in household
appliances.
Big Data architecture
It is a comprehensive system of processing a vast amount of data.
The Big data architecture framework lays out the blueprint of
providing solutions and infrastructures to handle big data
depending on an organization’s needs.
It clearly defines the architecture components of big data analytics,
layers to be used, and the flow of information.
The reference point is ingesting, processing, storing, managing,
accessing, and analyzing the data.
A typical big data architecture framework looks like below, having
the following big data architecture layers.
Big Data architecture
Big Data Architecture Layers
There are four main Big Data architecture layers to an architecture of Big Data:
1. Data Ingestion
This layer is responsible for collecting and storing data from various sources. In Big Data, the data
ingestion process of extracting data from various sources and loading it into a data repository. Data
ingestion is a key component of a Big Data architecture because it determines how data will be ingested,
transformed, and stored.
2. Data Processing
Data processing is the second layer, responsible for collecting, cleaning, and preparing the data for
analysis. This layer is critical for ensuring that the data is high quality and ready to be used in the future.
3. Data Storage
Data storage is the third layer, responsible for storing the data in a format that can be easily accessed and
analyzed. This layer is essential for ensuring that the data is accessible and available to the other layers.
4. Data Visualization
Data visualization is the fourth layer and is responsible for creating visualizations of the data that humans
can easily understand. This layer is important for making the data accessible.
Characteristics of Big Data (5 Vs of Big Data)

Big Data contains a large

amount of data that is not
being processed by
traditional data storage or
the processing unit. It is
used by many
multinational companies
to process the data and
business of many
organizations. The data
flow would exceed 150
exabytes per day before
replication.
There are five v's of Big
Data that explains the
characteristics.
Characteristics of Big Data (Cont…)

1. Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button
is recorded, and more than 350 million new posts are uploaded each day. Big data technologies
can handle large amounts of data.
2. Variety
Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos,
etc.

2. Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
For example, Facebook posts with hashtags.
Characteristics of Big Data (Cont…)

4. Value
Value is an essential characteristic of big data. It is not the data
that we process or store. It is valuable and reliable data that we
store, process, and also analyze.

5. Velocity
Velocity plays an important role compared to others. Velocity
creates the speed by which the data is created in real-time. It
contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide
demanding data rapidly.
Characteristics of Big Data Cont..
Big Data Technology Components

We can categorize the leading big data technologies into the

following four sections:
Data Storage
Data Mining
Data Analytics
Data Visualization
1. Data storage

Big data technology that deals with data storage has the capability to fetch, store, and
manage big data.
It is made up of infrastructure that allows users to store the data so that it is convenient
to access.
Most data storage platforms are compatible with other programs.
Two commonly used tools are Apache Hadoop and MongoDB.
Example 1- Apache Hadoop
Apache is the most widely used big data tool.
It is an open-source software platform that stores and processes big data
in a distributed computing environment across hardware clusters.
This distribution allows for faster data processing.
The framework is designed to reduce bugs or faults, be scalable, and
process all data formats.
1. Data storage (Cont…)

Example 2- MongoDB
•MongoDB is a NoSQL database that can be used to store large
volumes of data.
•Using key-value pairs (a basic unit of data), MongoDB categorizes
documents into collections.
•It is written in C, C++, and JavaScript, and is one of the most
popular big data databases because it can manage and store
unstructured data with ease.
2. Data mining

Data mining extracts the useful patterns and trends from the raw data.
Big data technologies such as Rapidminer and Presto can turn unstructured and structured data
into usable information.
Example 1- Rapidminer
Rapidminer is a data mining tool that can be used to build predictive models.
It draws on these two roles as strengths, of processing and preparing data, and
building machine and deep learning models.
The end-to-end model allows for both functions to drive impact across the
organization.
Example 2- Presto
Presto is an open-source query engine that was originally developed by Facebook
to run analytic queries against their large datasets.
Now, it is available widely. One query on Presto can combine data from multiple
sources within an organization and perform analytics on them in a matter of
minutes.
3.Dataanalytics

In big data analytics, technologies are used to clean and transform data into information that can
be used to drive business decisions.
This next step (after data mining) is where users perform algorithms, models, and predictive
analytics using tools such as Apache Spark and Splunk.
Example 1- Apache Spark
•Spark is a popular big data tool for data analysis because it is fast and efficient at
running applications.
•It is faster than Hadoop because it uses random access memory (RAM) instead of
being stored and processed in batches via MapReduce.
•Spark supports a wide variety of data analytics tasks and queries.
Example 2- Splunk
•Splunk is another popular big data analytics tool for deriving insights from large
datasets.
•It has the ability to generate graphs, charts, reports, and dashboards.
•Splunk also enables users to incorporate artificial intelligence (AI) into data
outcomes.
4.Datavisualization

Finally, big data technologies can be used to create stunning visualizations from the data.
In data-oriented roles, data visualization is a skill that is beneficial for presenting
recommendations to stakeholders for business profitability and operations—to tell an
impactful story with a simple graph.
Example 1- Tableau
Tableau is a very popular tool in data visualization because its drag-and-
drop interface makes it easy to create pie charts, bar charts, box plots,
Gantt charts, and more.
It is a secure platform that allows users to share visualizations and
dashboards in real time.
Example 2- Looker
Looker is a business intelligence (BI) tool used to make sense of big data
analytics and then share those insights with other teams.
Charts, graphs, and dashboards can be configured with a query, such as
monitoring weekly brand engagement through social media analytics.
Big DataApplications

Big companies utilize those data for their business growth. By analyzing this data, the
useful decision can be made in various cases as discussed below:
1. Tracking Customer Spending Habit, Shopping Behavior:
 In big retails store (like Amazon, Walmart, Big Bazar etc.) management team has to keep
data of customer’s spending habit, shopping behavior, customer’s most liked product .
Which product is being searched/sold most, based on that data, production/collection rate
of that product get fixed.
2. Recommendation:
By tracking customer spending habit, shopping behavior, Big retails store provide a
recommendation to the customer.
E-commerce site like Amazon, Walmart, Flipkart does product recommendation.
They track what product a customer is searching, based on that data they recommend that
type of product to that customer.
Big Data Applications (Cont…)

3. Smart Traffic System:

Data about the condition of the traffic of different road, collected through camera kept
beside the road, at entry and exit point of the city, GPS device placed in the vehicle (Ola,
Uber cab, etc.).
All such data are analyzed and jam-free or less jam way, less time taking ways are
recommended.
Such a way smart traffic system can be built in the city by Big data analysis. One more
profit is fuel consumption can be reduced.
4. Secure Air Traffic System:
At various places of flight (like propeller etc) sensors present.
These sensors capture data like the speed of flight, moisture, temperature, other
environmental condition.
Based on such data analysis, an environmental parameter within flight are set up and
varied.
Big Data Applications (Cont…)

5. Auto Driving Car:

Big data analysis helps drive a car without human interpretation.
In the various spot of car camera, a sensor placed, that gather data like the size of the
surrounding car, obstacle, distance from those, etc.
These data are being analyzed, then various calculation like how many angles to rotate,
what should be speed, when to stop, etc carried out. These calculations help to take action
automatically.
6. Virtual Personal Assistant Tool:
Big data analysis helps virtual personal assistant tool (like Siri in Apple Device, Cortana
in Windows, Google Assistant in Android) to provide the answer of the various question
asked by users.
This tool tracks the location of the user, their local time, season, other data related to
question asked, etc. Analyzing all such data, it provides an answer.
Big Data Applications (Cont…)

7. IoT:
Manufacturing company install IOT sensor into machines to collect operational data.
Analyzing such data, it can be predicted how long machine will work without any
problem when it requires repairing so that company can take action before the situation
when machine facing a lot of issues or gets totally down. Thus, the cost to replace the
whole machine can be saved.
In the Healthcare field, Big data is providing a significant contribution.
Using big data tool, data regarding patient experience is collected and is used by doctors
to give better treatment.
8. Education Sector:
Online educational course conducting organization utilize big data to search candidate,
interested in that course.
Big Data Applications (Cont…)

9. Energy Sector:
Smart electric meter read consumed power every 15 minutes and sends this read data to
the server, where data analyzed and it can be estimated what is the time in a day when the
power load is less throughout the city.
By this system manufacturing unit or housekeeper are suggested the time when they
should drive their heavy machine in the night time when power load less to enjoy less
electricity bill.
10. Media and Entertainment Sector:
Media and entertainment service providing company like Netflix, Amazon Prime,
Spotify do analysis on data collected from their users.
Data like what type of video, music users are watching, listening most, how long users
are spending on site, etc are collected and analyzed to set the next business strategy.
Importance of Big Data

1- Big data importance doesn’t revolve

around the amount of data a company
has.
2- the companies in the present market
need to collect it analyze it because
. Cost Saving
. Time-saving
.Understood the market condition
. solve advertisers problem
. The driver of innovations & product
development.
Big Data Characteristics
Characteristics describe
the inherent qualities or
attributes that define
something, while
features refer to the
specific functionalities
or properties that
enhance its
performance.
Big Data Characteristics (5 V’s)
1. Volume
Volume refers to the unimaginable amounts of information generated every second from
social media, cell phones, cars, credit cards, M2M sensors, images, video, and whatnot.
We are currently using distributed systems, to store data in several locations and brought
together by a software Framework like Hadoop.
Facebook alone can generate about billion messages, 4.5 billion times that the “like”
button is recorded, and over 350 million new posts are uploaded each day. Such a huge
amount of data can only be handled by Big Data Technologies.

2. Variety
As Discussed before, Big Data is generated in multiple varieties. Compared to the
traditional data like phone numbers and addresses, the latest trend of data is in the form
of photos, videos, and audios and many more, making about 80% of the data to be
completely unstructured.
Big Data Characteristics(5 V’s) (Cont…)
3. Veracity
Veracity means the degree of reliability that the data has to offer. Since a major part of the
data is unstructured and irrelevant, Big Data needs to find an alternate way to filter them or
to translate them out as the data is crucial in business developments.
4. Value
Value is the major issue that we need to concentrate on. It is not just the amount of data that
we store or process. It is actually the amount of valuable, reliable and trustworthy data that
needs to be stored, processed, and analyzed to find insights. You can get a better
understanding with the Azure Data Engineering certification.
5. Velocity
Velocity plays a major role compared to the others, there is no point in investing so much to
end up waiting for the data. So, the major aspect of Big Dat is to provide data on demand
and at a faster pace. Example- Unleash the power of distributed computing and scalable
data processing with Apache Spark.
Big Data Features

1. Data wrangling and Preparation

2. Data exploration
3. Scalability
4. Support for various types of Analytics
5. Version control
6. Data management
7. Data Integration
8. Data Governance
9. Data security
10. Data visualization
Challenges of Conventional Systems

•Difficulty in compiling multiple file types from various sources into a

single point of access with conventional tools.
•Data often ends up in silos (e.g.-Information silos. When
departments don't properly record, share, and integrate new
information), which are easier to manage but limit visibility, limiting
security and accuracy.
•Lack of proper understanding of Big Data.
•Data growth issues, including storing all these huge sets of data
properly.
•Confusion while Big Data tool selection.
Data and its Types and Issues
Data can be defined as a representation of
facts, concepts, or instructions in a formalized
manner.
Differences between Small Data, Medium
Data and Big Data
Data can be small, medium or big.
Small data is data in a volume and format
that makes it accessible, informative and
actionable.
Medium data refers to data sets that are too
large to fit on a single machine but don’t
require enormous clusters of thousands.
Big data is extremely large data sets that
may be analysed computationally to reveal
patterns, trends, and associations, especially
relating to human behaviour and interactions.
Traditional Database shows the challenges of Conventional Systems.
Intelligent DataAnalysis

Intelligent Data Analysis (IDA)

is an interdisciplinary study that
is concerned with the extraction
of useful knowledge from data,
drawing techniques from a
variety of fields, such as artificial
intelligence, high-performance
computing, pattern recognition,
and statistics.
Intelligent Data Analysis (Cont..)
IDA include three stages:
1- Preparation of Data
2- Data Mining
3- Data Validation and Explanation
The main goal of IDA is to obtain knowledge.
The preparation of data involves opting for the required data from the related data source and
incorporating it into a data set that can be used for data mining.
Data analysis is the process of a combination of extracting data from data set, analyzing,
classification of data, organizing, reasoning, and so on. It is challenging to choose suitable methods to
resolve the complexity of the process.
Regarding the term visualization, we have moved away from visualization to use the term charting.
The term analysis is used for the method of incorporating, influencing, filtering and scrubbing the
data, which certainly contains, but is not limited to interrelating with their data through charts.
Data Analysis
Breaking up of any data into parts i.e., the examination of these parts to know about
their nature, proportion, function, interrelationship, etc.
A process in which the analyst moves laterally and recursively between three
modes: describing data (profiling, correlation, summarizing), assembling data
(scrubbing, translating, synthesizing, filtering) and creating data (deriving,
formulating, simulating (seems real but not actually).
It is a sense of making data. The process of finding and identifying the meaning of
data.
Data Visualization
It is a process of revealing already existing data and/or its features (origin, metadata,
allocation), which includes everything from the table to charts and multidimensional
animation.
To form an intellectual image of something not there to the sight.
Visual data analysis is another form of data analysis, in which some or all forms of
data visualization may be used to give feedback sign to the analyst. Our product uses
visual signs such as charts, interactive browsing, and workflow process cues to
help the analyst in moving through the modes of data analysis.
The main advantage of visual representations is to discover, make sense of data
and communicating data. Data visualization is a central part and an essential means
to carry out data analysis and then, once the importance have been identified and
understood, it is easy to communicate those meanings to others.
Importance of Intelligent Data Analysis
Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and
information. Intelligent data analysis discloses hidden facts that are not known previously and
provides potentially important information or facts from large quantities of data.
It also helps in making a decision.
Based on machine learning, artificial intelligence, recognition of pattern, and records and
visualization technology mainly, IDA helps to obtain useful information, necessary data and
interesting models from a lot of data available online in order to make the right choices.
Intelligent data analysis helps to solve a problem that is already solved as a matter of routine.
If the data is collected for the past cases together with the result that was finally achieved, such
data can be used to revise and optimize the presently used strategy to arrive at a conclusion.
In certain cases, if some questions arise for the first time, and have only a little knowledge
about it, data from the related situations helps us to solve the new problem or any unknown
relationships can be discovered from the data to gain knowledge in an unfamiliar area.
Nature of data
Nature of data (Cont…)
Categorical data are values or observations that can be sorted into groups or
categories.
There are two types of categorical values, nominal and ordinal.
A nominal variable has no intrinsic ordering to its categories. For example,
housing is a categorical variable having two categories (own and rent).
An ordinal variable has an established ordering. For example, age as a
variable with three orderly categories (young, adult, and elder).
Numerical data are values or observations that can be measured.
There are two kinds of numerical values, discrete and continuous.
Discrete data are values or observations that can be counted and are distinct
and separate. For example, number of lines in a code.
Continuous data are values or observations that may take on any value
within a finite or infinite interval. For example, an economic time series such as
historic gold prices.
Analytic Processes
Analytic Tools
Analysis vs Reporting

Analytics is the technique of examining data and reports to obtain actionable insights
that can be used to comprehend and improve business performance. Business users
may gain insights from data, recognize trends, and make better decisions with
workforce analytics.

The steps involved in data analytics are as follows:

Developing a data hypothesis
Data collection and transformation
Creating analytical research models to analyze and provide insights
Utilization of data visualization, trend analysis, deep dives, and other tools.
Making decisions based on data and insights
Analysis vs Reporting (Cont…)

reporting is the process of presenting data from numerous sources clearly and simply.
The procedure is always carefully set out to report correct data and avoid
misunderstandings.
In general, the procedures needed to create a report are as follows:
Determining the business requirement
Obtaining and compiling essential data
Technical data translation
Recognizing the data context
Building dashboards for reporting
Providing real-time reporting
Allowing users to dive down into reports
Modern DataAnalytic Tools
THANK YOU

KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Big Data All Unit by Study4sub
No ratings yet
Big Data All Unit by Study4sub
161 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Vietnamese For Beginners
100% (19)
Vietnamese For Beginners
152 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Unit 1-BigDataTools
No ratings yet
Unit 1-BigDataTools
69 pages
Module 1
No ratings yet
Module 1
29 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Janlloyd Dugo - HOME ROOM GUIDANCE MODULE 1
93% (15)
Janlloyd Dugo - HOME ROOM GUIDANCE MODULE 1
2 pages
1st Activity VMGO BTLED
No ratings yet
1st Activity VMGO BTLED
12 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Communicative Competence Strategies in Various Speech Situations
No ratings yet
Communicative Competence Strategies in Various Speech Situations
4 pages
Detailed Lesson Plan in Trends, Network & Critical Thinking (2 Quarter)
100% (4)
Detailed Lesson Plan in Trends, Network & Critical Thinking (2 Quarter)
5 pages
CSWIP-WI-6-92 14th Edition April 2017
No ratings yet
CSWIP-WI-6-92 14th Edition April 2017
17 pages
Unit 1
No ratings yet
Unit 1
51 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
Unit 1 - Bda
No ratings yet
Unit 1 - Bda
21 pages
ENGLISH 7 - Basic Factors of Delivery
No ratings yet
ENGLISH 7 - Basic Factors of Delivery
5 pages
What Is Big Data
No ratings yet
What Is Big Data
18 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Unit - 1
No ratings yet
Unit - 1
46 pages
Homeroom Guidance Quarter 1
No ratings yet
Homeroom Guidance Quarter 1
78 pages
Unit 1 Introduction: Data Science and Big Data: Syllabus
No ratings yet
Unit 1 Introduction: Data Science and Big Data: Syllabus
38 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Detailednotes - Unit1 - Big Data
No ratings yet
Detailednotes - Unit1 - Big Data
22 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
29 pages
Unit 1
No ratings yet
Unit 1
17 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Big Data Analytics Unit1
No ratings yet
Big Data Analytics Unit1
20 pages
ACC IT APP MIdterm Bigdata
No ratings yet
ACC IT APP MIdterm Bigdata
12 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
Unit 1 BD
No ratings yet
Unit 1 BD
24 pages
Unit 1
No ratings yet
Unit 1
20 pages
Unit-1 Introduction To Data Analytics
No ratings yet
Unit-1 Introduction To Data Analytics
35 pages
BD U-1 (Anupam Sir)
No ratings yet
BD U-1 (Anupam Sir)
20 pages
Big Data Unit-I
No ratings yet
Big Data Unit-I
28 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
37 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
U1 A CLSRM
No ratings yet
U1 A CLSRM
33 pages
BIGDATAUNIT1 AKTUpdf
No ratings yet
BIGDATAUNIT1 AKTUpdf
33 pages
Big Data Storage Platforms
No ratings yet
Big Data Storage Platforms
19 pages
Unit 1 Topic 2 Big Data Platform
No ratings yet
Unit 1 Topic 2 Big Data Platform
31 pages
Unit - 1 (Big Data)
No ratings yet
Unit - 1 (Big Data)
15 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
BDU1
No ratings yet
BDU1
39 pages
Hamid Seminar
No ratings yet
Hamid Seminar
57 pages
CS8091 BDA Unit 1
No ratings yet
CS8091 BDA Unit 1
118 pages
Data Analytics
No ratings yet
Data Analytics
69 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
37 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Introduction To Big Data, Hadoop and Spark
No ratings yet
Introduction To Big Data, Hadoop and Spark
40 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Assamese Catalogue
No ratings yet
Assamese Catalogue
32 pages
Bigdata Notes
No ratings yet
Bigdata Notes
136 pages
Unit I: Chapter 1: Introduction To Big Data
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
35 pages
Big Data and Blockchain Basics: Dr. Poonam Saini Poonamsaini@pec - Edu.in
No ratings yet
Big Data and Blockchain Basics: Dr. Poonam Saini Poonamsaini@pec - Edu.in
42 pages
Ntal Manual
No ratings yet
Ntal Manual
86 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Cover Letter Examples Byu
100% (2)
Cover Letter Examples Byu
8 pages
BIG DATA Module 1
No ratings yet
BIG DATA Module 1
16 pages
BD U1.PDF - Crdownload
No ratings yet
BD U1.PDF - Crdownload
65 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
CH 1
No ratings yet
CH 1
218 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
32 pages
Pythagoras Essay
100% (2)
Pythagoras Essay
3 pages
Dowsing ReviewOfExperimetnalResearch Hansen JSPR 1982 PDF
No ratings yet
Dowsing ReviewOfExperimetnalResearch Hansen JSPR 1982 PDF
13 pages
BTech Mechanical Engg Structure
No ratings yet
BTech Mechanical Engg Structure
12 pages
MIS ASSIGNMENT 2: KEDA: SAP Implementation Q1. ERP Projects Are Expensive and Risky. Why Did Keda Embark On A ERP Implementation Project?
No ratings yet
MIS ASSIGNMENT 2: KEDA: SAP Implementation Q1. ERP Projects Are Expensive and Risky. Why Did Keda Embark On A ERP Implementation Project?
3 pages
Intent Letter Food Packs
No ratings yet
Intent Letter Food Packs
4 pages
CLASS X (2020-21) Mathematics Basic (241) Sample Paper-1
No ratings yet
CLASS X (2020-21) Mathematics Basic (241) Sample Paper-1
7 pages
WiKAHON-Specifications R12
No ratings yet
WiKAHON-Specifications R12
3 pages
2 Newborn Assesment
No ratings yet
2 Newborn Assesment
23 pages
Complete Speech
No ratings yet
Complete Speech
2 pages
Tajuddin Personal Philosophy Paper
No ratings yet
Tajuddin Personal Philosophy Paper
5 pages
Gradebook Shaima
No ratings yet
Gradebook Shaima
2 pages
(Ebook) Academy Stars Level 1 Teacher's Book Pack by Dave Tucker ISBN 9781380006509, 1380006503
100% (1)
(Ebook) Academy Stars Level 1 Teacher's Book Pack by Dave Tucker ISBN 9781380006509, 1380006503
77 pages
From Detached Concern To Empathy Humanizing Medical Practice Jodi Halpern Instant Download
No ratings yet
From Detached Concern To Empathy Humanizing Medical Practice Jodi Halpern Instant Download
46 pages
Code of Ethics For Portfolio
No ratings yet
Code of Ethics For Portfolio
4 pages
Title Proposal For Quantitative Research
No ratings yet
Title Proposal For Quantitative Research
3 pages
Homework s13
No ratings yet
Homework s13
14 pages
Pengaruh Perawatan Perianal Hygiene Dengan Minyak Zaitun Terhadap Pencegahan Ruam Popok Pada Bayi
No ratings yet
Pengaruh Perawatan Perianal Hygiene Dengan Minyak Zaitun Terhadap Pencegahan Ruam Popok Pada Bayi
9 pages
Ls Student Parent Handbook SY2021 2022
No ratings yet
Ls Student Parent Handbook SY2021 2022
83 pages
Variations of Love by Margaret Atwood
No ratings yet
Variations of Love by Margaret Atwood
3 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet

Unit 1 Big Data

Uploaded by

Unit 1 Big Data

Uploaded by

BIG DATA (BCS-061)

UNIT-1 : INTRODUCTION TO BIG DATA

2. Semi-structured Data (20-25%): This is a hybrid of structured and unstructured data.

2. Rise of the Internet & Distributed Computing (1990s)

3. Big Data Boom & Open Source Technologies (2000s)

5. AI-Integrated Big Data & Future Trends (2020s & Beyond)

Data Ingestion and Integration Platform

OLTP stands for Online

3. Big Data Analytics Platform / Machine Learning – It Provides analytics tools

5. Data Governance – Data Governance also provides comprehensive security, data

7. Cloud Datawarehouse for Scalability – It also helps scale the application to

The demand for data

The Internet of things (IoT) is the

Big Data contains a large

We can categorize the leading big data technologies into the

3. Smart Traffic System:

5. Auto Driving Car:

1- Big data importance doesn’t revolve

1. Data wrangling and Preparation

•Difficulty in compiling multiple file types from various sources into a

Intelligent Data Analysis (IDA)

The steps involved in data analytics are as follows:

You might also like