0% found this document useful (0 votes)
15 views28 pages

Big Data Unit-I

Big Data refers to large datasets that exceed the capabilities of traditional data processing tools, often generated from sources like social media, e-commerce, and IoT devices. It encompasses structured, semi-structured, and unstructured data, and has applications across various industries including retail, healthcare, and finance. The growth of Big Data is driven by data explosion, business needs for insights, and advancements in technology, leading to the development of specialized platforms for storage, processing, and analysis.

Uploaded by

Anand Raj Ashwin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views28 pages

Big Data Unit-I

Big Data refers to large datasets that exceed the capabilities of traditional data processing tools, often generated from sources like social media, e-commerce, and IoT devices. It encompasses structured, semi-structured, and unstructured data, and has applications across various industries including retail, healthcare, and finance. The growth of Big Data is driven by data explosion, business needs for insights, and advancements in technology, leading to the development of specialized platforms for storage, processing, and analysis.

Uploaded by

Anand Raj Ashwin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Unit-I

Big Data

Introduction to Big Data:


The term big data has been in use since the 1990s, with some giving credit to John Mashey for
popularizing the term. Big data usually includes data sets with sizes beyond the ability of commonly
used ware tools to capture, curate, manage, and process data within a tolerable elapsed time. Data
which are very large in size is called Big Data. Normally we work on data of size MB (Word Doc,
Excel) or maximum GB (Movies, Codes) but data in Peta bytes i.e. 10^15-byte size is called Big
Data. It is stated that almost 90% of today's data has been generated in the past 3 years.
Sources of Big Data
These data come from many sources like

 Social networking sites: Facebook, Google, LinkedIn all these sites generate huge amount
of data on a day-to-day basis as they have billions of users worldwide.
 E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge number of logs from
which users buying trends can be traced.
 Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.
 Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
 Share Market: Stock exchange across the world generates huge amount of data through its
daily transaction.
 IoT Appliance: Electronic devices that are connected to the internet create data for their
smart functionality, examples are a smart TV, smart washing machine, smart coffee machine,
smart AC, etc. It is machine-generated data that are created by sensors kept in various devic-
es. For Example, a Smart printing machine – is connected to the internet. A number of such
printing machines connected to a network can transfer data within each other.
 Global Positioning System (GPS): GPS in the vehicle helps in monitoring the movement
of the vehicle to shorten the path to a destination to cut fuel, and time consumption. This
system creates huge data on vehicle position and movement.
 Machine Data: Automatically generated machine data is produced in reaction to an event or
according to a set timetable. This indicates that all of the data was compiled from a variety
of sources, including satellites, desktop computers, mobile phones, industrial machines,
smart sensors, SIEM logs, medical and wearable devices, road cameras, IoT devices, and
more

Types of Digital Data: -

1. Structured Data

 Structured data can be crudely defined as the data that resides in a fixed
field within a record.

 It is type of data most familiar to our everyday lives. for ex: birthday,
address
 A certain schema binds it, so all the data has the same set of properties.
Structured data is also called relational data. It is split into multiple tables to
enhance the integrity of the data by creating a single record to depict an
entity. Relationships are enforced by the application of table constraints.

 The business value of structured data lies within how well an organization
can utilize its existing systems and processes for analysis purposes.

2. Semi-Structured Data: -
 Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly organized
into rows and columns like that in a spreadsheet. However, there are some
features like key-value pairs that help in discerning the different entities
from each other.
 Since semi-structured data doesn’t need a structured query language, it is
commonly called NoSQL data.
 A data serialization language is used to exchange semi-structured data
across systems that may even have varied underlying infrastructure.
 Semi-structured content is often used to store metadata about a business
process but it can also include files containing machine instructions for
computer programs.
 This type of information typically comes from external sources such as
social media platforms or other web-based data feeds.

3. Unstructured Data:
 Unstructured data is the kind of data that doesn’t adhere to any definite
schema or set of rules. Its arrangement is unplanned and haphazard.
 Photos, videos, text documents, and log files can be generally considered
unstructured data. Even though the metadata accompanying an image or a
video may be semi-structured, the actual data being dealt with is
unstructured.
 Additionally, Unstructured data is also known as “dark data” because it
cannot be analysed without the proper software tools.

Applications of big data: -


1) Retail: -
 Leading online retail platforms are wholeheartedly deploying big data
throughout a customer’s purchase journey, to predict trends, forecast
demands, optimize pricing, and identify customer behavioural patterns
 Big data is helping retailers implement clear strategies that minimize risk
and maximize profit.
2) Healthcare: -
 Big data is revolutionizing the healthcare industry, especially the way
medical professionals in the past diagnosed and treated diseases.
 In recent times, effective analysis and processing of big data by
machine learning algorithms provide significant advantages for the
evaluation and assimilation of complex clinical data, which prevent
deaths and improve the quality of life by enabling healthcare workers to
detect early warning signs and symptoms.
3) Financial Services and Insurance: -
 The increased ability to analyse and process big data is dramatically
impacting the financial services, banking, and insurance landscape.
 In addition to using big data for swift detection of fraudulent
transactions, lowering risks, and supercharging marketing efforts, few
companies are taking the applications to the next levels.
4) Manufacturing: -
 Advancements in robotics and automation technologies, modern-day
manufacturers are becoming more and more data focused, heavily
investing in automated factories that exploit big data to streamline
production and lower operational costs.
 Top global manufacturers are also integrating sensors into their products,
capturing big data to provide valuable insights on product performance
and its usage.
5) Energy: -
 To combat the rising costs of oil extraction and exploration difficulties
because of economic and political turmoil, the energy industry is turning
toward data-driven solutions to increase profitability.
 Big data is optimizing every process while cutting down energy waste
from drilling to exploring new reserves, production, and distribution.
6) Logistics & Transportation: -
 State-of-the-art warehouses use digital cameras to capture stock level
data, which, when fed into ML algorithms, facilitates intelligent
inventory management with prediction capabilities that indicate when
restocking is required.
 In the transportation industry, leading transport companies now promote
the collection and analysis of vehicle telematics data, using big data to
optimize routes, driving behaviour, and maintenance.
7) Government: -
 Cities worldwide are undergoing large-scale transformations to become
“smart”, through the use of data collected from various Internet of
Things (IoT) sensors.
 Governments are leveraging this big data to ensure good governance via
the efficient management of resources and assets, which increases urban
mobility, improves solid waste management, and facilitates better
delivery of public utility services.

The History of Big Data: -

Although the concept of big data itself is relatively new, the origins of large data sets go
back to the 1960s and '70s when the world of data was just getting started with the first data
centres and the development of the relational database.

Around 2005, people began to realize just how much data users generated through
Facebook, YouTube, and other online services. Hadoop (an open-source framework created
specifically to store and analyse big data sets) was developed that same year. NoSQL also
began to gain popularity during this time.

The development of open-source frameworks, such as Hadoop (and more recently, Spark)
was essential for the growth of big data because they make big data easier to work with and
cheaper to store. In the years since then, the volume of big data has skyrocketed. Users are
still generating huge amounts of data—but it‘s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are connected to
the internet, gathering data on customer usage patterns and product performance. The
emergence of machine learning has produced still more data.

While big data has come far, its usefulness is only just beginning. Cloud computing has
expanded big data possibilities even further. The cloud offers truly elastic scalability, where
developers can simply spin up ad hoc clusters to test a subset of data.

Benefits of Big Data and Data Analytics: -


 Big data makes it possible for you to gain more complete answers because you have
more information.
 More complete answers mean more confidence in the data—which means a
completely different approach to tackling problems.

Big Data Platform: -

Big data platforms are specialized tools and software designed to efficiently
store, process, and analyse large datasets, enabling organizations to gain
valuable insights and make data-driven decisions.

Key Features and Examples:


 Scalability and High-Speed Processing: Big data platforms must be able
to handle massive volumes of data and process it quickly.
 Real-time Analytics: Some platforms offer real-time data processing
capabilities, enabling organizations to make immediate decisions based on
current data.
 Data Integration: They should seamlessly integrate with various data
sources and tools.
 Diverse Data Types and Formats: A good platform should support various
data types, including structured, semi-structured, and unstructured data.
 Security: Robust security measures are essential to protect sensitive data.
 User-Friendly Interfaces: Platforms should provide user-friendly
interfaces for easier data management and analysis.

Examples of Big Data Platforms:


 Apache Hadoop:
A distributed computing framework for storing and processing large
datasets.
 Apache Spark:
A fast, unified analytics engine for processing large datasets, both batch and
streaming data.
 Google BigQuery:
A cloud-based data warehouse service for storing and analyzing large
datasets.
 Microsoft Azure HDInsight:
A managed Hadoop service on the Azure cloud platform.
 Databricks:
A cloud-based data analytics platform built on Apache Spark, offering a
collaborative environment for data science and machine learning.
 Cloudera:
A leading big data and analytics platform that helps organizations gain
valuable insights from their data.
 Snowflake:
A cloud-based data warehouse platform that provides a unified data
platform for analytical workloads.
 Amazon Web Services (AWS):
Offers a wide range of services for big data management, including data
warehousing, clickstream analytics, and IoT processing.
 Oracle:
Provides fully integrated cloud applications and platform services, including
a flagship database for big data analytics.
 Microsoft Azure:
A powerful cloud platform that offers a variety of services to help
businesses manage and make sense of their data.

Drivers for Big Data:

Several factors drive the growth and adoption of Big Data solutions:

 Data Explosion: The exponential growth in data volume generated by


various sources, including social media, IoT devices, sensors, and mobile
devices.
 Business Needs: Organization increasingly rely on data-driven insights to
gain a competitive edge, improve decision-making, and enhance customer
experiences.
 Technological Advancements: Advances in storage, processing, and
analytics technologies have made it more cost-effective and feasible to
store, process, and analyse large datasets.
 Regulatory Requirements: Compliance with data protection and privacy
regulations, such as GDPR and HIPAA, also drives the adoption of Big
Data solutions to ensure data governance and compliance.

Big Data Architecture and Characteristics: -

There is more than one workload type involved in big data systems, and they
are broadly classified as follows:
1. Merely batching data where big data-based sources are at rest is a data
processing situation.
2. Real-time processing of big data is achievable with motion-based
processing.
3. The exploration of new interactive big data technologies and tools.
4. The use of machine learning and predictive analysis.

• Data Sources: All of the sources that feed into the data extraction pipeline are
subject to this definition, so this is where the starting point for the big data
pipeline is located. Data sources, open and third-party, play a significant role in
architecture. Relational databases, data warehouses, cloud-based data
warehouses, SaaS applications, real-time data from company servers and
sensors such as IoT devices, third-party data providers, and also static files
such as Windows logs, comprise several data sources. Both batch processing
and real-time processing are possible. The data managed can be both batch
processing and real-time processing.

• Data Storage: There is data stored in file stores that are distributed in nature
and that can hold a variety of format-based big files. It is also possible to store
large numbers of different format-based big files in the data lake. This consists
of the data that is managed for batch-built operations and is saved in the file
stores. We provide HDFS, Microsoft Azure, AWS, and GCP storage, among
other blob containers.

• Batch Processing: Each chunk of data is split into different categories using
long-running jobs, which filter and aggregate and also prepare data for
analysis. These jobs typically require sources, process them, and deliver the
processed files to new files. Multiple approaches to batch processing are
employed, including Hive jobs, U-SQL jobs, Sqoop or Pig and custom map
reducer jobs written in any one of the Java or Scala or other languages such as
Python.

• Real Time-Based Message Ingestion: A real-time streaming system that


caters to the data being generated in a sequential and uniform fashion is a batch
processing system. When compared to batch processing, this includes all real-
time streaming systems that cater to the data being generated at the time it is
received. This data mart or store, which receives all incoming messages and
discards them into a folder for data processing, is usually the only one that
needs to be contacted. Message-based ingestion stores such as Apache Kafka,
Apache Flume, Event hubs from Azure, and others, on the other hand, must be
used if message-based processing is required. The delivery process, along with
other message queuing semantics, is generally more reliable.

• Stream Processing: Real-time message ingest and stream processing are


different. The latter uses the ingested data as a publish-subscribe tool, whereas
the former takes into account all of the ingested data in the first place and then
utilizes it as a publish-subscribe tool. Stream processing, on the other hand,
handles all of that streaming data in the form of windows or streams and writes
it to the sink. This includes Apache Spark, Flink, Storm, etc.

• Analytics-Based Datastore: In order to analyse and process already processed


data, analytical tools use the data store that is based on HBase or any other
NoSQL data warehouse technology. The data can be presented with the help of
a hive database, which can provide metadata abstraction, or interactive use of a
hive database, which can provide metadata abstraction in the data store.
NoSQL databases like HBase or Spark SQL are also available.

• Reporting and Analysis: The generated insights, on the other hand, must be
processed and that is effectively accomplished by the reporting and analysis
tools that utilize embedded technology and a solution to produce useful graphs,
analysis, and insights that are beneficial to the businesses. For example,
Cognos, Hyperion, and others.

• Orchestration: Data-based solutions that utilize big data are data-related tasks
that are repetitive in nature, and which are also contained in workflow chains
that can transform the source data and also move data across sources as well as
sinks and loads in stores. Sqoop, oozie, data factory, and others are just a few
examples.

5Vs of Big Data: -


1. Volume: - The amount of data is the most defining characteristic of Big Data.
Enterprises collect data from various sources including business transactions,
smart (IoT) devices, industrial equipment, videos, social media, and more.
Dealing with potential petabytes or exabytes of data requires specialized
storage, management, and analysis technologies.
2. Velocity: - Data is being generated at unprecedented speeds and must be dealt
with in a timely manner. Velocity refers to the rate at which data flows from
various sources like business processes, machines, networks, social media
feeds, mobile devices, etc. The ability to manage this speed is crucial for real-
time decision-making and processing.
3. Variety: - Data comes in various formats – structured data, semi-structured
data, and unstructured data. Structured data follows a model and is easily
searchable, whereas unstructured data, such as emails, video, and audio, lacks a
defined model. Semi-structured data lies in between and includes formats like
XML and JSON. Handling this variety involves extracting data and
transforming it into a cleaner format for analysis.
4. Veracity: - The quality of collected data can vary greatly, affecting accurate
analysis. Veracity refers to the uncertainty of data, which can be due to
inconsistency and incompleteness, ambiguities, latency, deception, and model
approximations. Ensuring the veracity of data is critical as it affects the
decision-making process in businesses.
5. Value: - The final V stands for value. It’s critical to assess whether the data that
is being gathered is actually valuable in decision-making processes. The main
goal of businesses investing in big data technologies is to extract meaningful
insights from collected data that lead to better decisions and strategic business
moves. The value is all about turning data into a competitive advantage.
Big Data Importance:-

The importance of big data does not revolve around how much data a company
has but how a company utilizes the collected data. Every company uses data in
its own way; the more efficiently a company uses its data, the more potential it
has to grow. The company can take data from any source and analyse it to find
answers which will enable

1) Cost Savings: - Some tools of Big Data like Hadoop and Cloud-Based
Analytics can bring cost advantages to business when large amounts of data
are to be stored and these tools also help in identifying more efficient ways of
doing business.
2) Time Reductions: - The high speed of tools like Hadoop and in-memory
analytics can easily identify new sources of data which helps businesses
analysing data immediately and make quick decisions based on the learning.
3) Understand the market conditions: -By analysing big data you can get a
better understanding of current market conditions. For example, by
analysing customers purchasing behaviours, a company can find out the
products that are sold the most and produce products according to this trend.
By this, it can get ahead of its competitors.
4) Control online reputation: - Big data tools can do sentiment analysis.
Therefore, you can get feedback about who is saying what about your
company. If you want to monitor and improve the online presence of your
business, then, big data tools can help in all this
5) Using Big Data Analytics to Boost Customer Acquisition and Retention: -
Retention The customer is the most important asset any business depends on.
There is no single business that can claim success without first having to
establish a solid customer base. However, even with a customer base, a
business cannot afford to disregard the high competition it faces. If a business
is slow to learn what customers are looking for, then it is very easy to begin
offering poor quality products. In the end, loss of clientele will result, and this
creates an adverse overall effect on business success. The use of big data
allows businesses to observe various customer related patterns and trends.
Observing customer behaviour is important to trigger loyalty.
6) Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights:- Big data analytics can help change all business
operations. This includes the ability to match customer expectation, changing
company‘s product line and of course ensuring that the marketing campaigns
are powerful.
7) Big Data Analytics As a Driver of Innovations and Product Development: -
Another huge advantage of big data is the ability to help companies innovate
and redevelop their products.

Big Data and Application: -


Big Data is crucial for organizations across various industries to gain insights, make
informed decisions, and drive innovation. Some key applications include:
 Retail: Customer segmentation, personalized recommendations, supply chain
optimization.
 Healthcare: Predictive analytics for disease diagnosis and treatment, patient
monitoring, drug discovery.

 Finance: Fraud detection, risk management, algorithmic trading, customer


analytics.
 Manufacturing: Predictive maintenance, quality control, supply chain
management.
 Telecommunications: Network optimization, customer churn prediction,
targeted marketing.
 Media and Entertainment: Content recommendation, audience segmentation,
sentiment analysis.

Big Data Features: -


Security, Compliance, Auditing and Protection:
 Security: Ensuring the confidentiality, integrity and availability of data
through encryption, access controls, and authentication mechanisms.
 Compliance: Adhering to regulatory requirements such as GDPR, HIPAA,
and PCI-DSS to protect sensitive data and ensure legal and ethical use.
 Auditing: Tracking and monitoring data access, usage and modification for
accountability and compliance purposes.
 Protection: Implementing measures to safeguard data against unauthorized
access, data breaches and cyber threats using firewalls, intrusion detection
systems, and data encryption.

Big Data Privacy and Ethics: -


Privacy concerns arise from the collection, storage, and use of personal data in
Big Data analytics. Ethical considerations include:
 Transparency: Providing clear information about data collection and
usage practices to users.
 Consent: Obtaining explicit consent from individuals before collecting
and processing their data.
 Fairness: Ensuring fair and unbiased data analysis to avoid
discrimination or unfair treatment.
 Accountability: Holding organizations accountable for the responsible
use of data and compliance with privacy regulations.

Big-Data Analytics: -

Big Data Analytics is all about crunching massive amounts of information to uncover
hidden trends, patterns, and relationships. It's like sifting through a giant mountain of
data to find the gold nuggets of insight.

Here's a breakdown of what it involves:

 Collecting Data: Such data is coming from various sources such as social
media, web traffic, sensors and customer reviews.

 Cleaning the Data: Imagine having to assess a pile of rocks that included
some gold pieces in it. You would have to clean the dirt and the debris first.
When data is being cleaned, mistakes must be fixed, duplicates must be
removed and the data must be formatted properly.

 Analyzing the Data: It is here that the wizardry takes place. Data analysts
employ powerful tools and techniques to discover patterns and trends. It is the
same thing as looking for a specific pattern in all those rocks that you sorted
through.

How does big data analytics work?

Big Data Analytics is a powerful tool which helps to find the potential of large and
complex datasets. To get better understanding, let's break it down into key steps:
 Data Collection: Data is the core of Big Data Analytics. It is the gathering of
data from different sources such as the customers’ comments, surveys, sensors,
social media, and so on. The primary aim of data collection is to compile as
much accurate data as possible. The more data, the more insights.

 Data Cleaning (Data Preprocessing): The next step is to process this


information. It often requires some cleaning. This entails the replacement of
missing data, the correction of inaccuracies, and the removal of duplicates. It is
like sifting through a treasure trove, separating the rocks and debris and
leaving only the valuable gems behind.

 Data Processing: After that we will be working on the data processing. This
process contains such important stages as writing, structuring, and formatting
of data in a way it will be usable for the analysis. It is like a chef who is
gathering the ingredients before cooking. Data processing turns the data into a
format suited for analytics tools to process.

 Data Analysis: Data analysis is being done by means of statistical,


mathematical, and machine learning methods to get out the most important
findings from the processed data. For example, it can uncover customer
preferences, market trends, or patterns in healthcare data.

 Data Visualization: Data analysis usually is presented in visual form, for


illustration – charts, graphs and interactive dashboards. The visualizations
provided a way to simplify the large amounts of data and allowed for decision
makers to quickly detect patterns and trends.

 Data Storage and Management: The stored and managed analyzed data is of
utmost importance. It is like digital scrapbooking. May be you would want to
go back to those lessons in the long run, therefore, how you store them has
great importance. Moreover, data protection and adherence to regulations are
the key issues to be addressed during this crucial stage.

 Continuous Learning and Improvement: Big data analytics is a continuous


process of collecting, cleaning, and analyzing data to uncover hidden insights.
It helps businesses make better decisions and gain a competitive edge.

Types of Big Data Analytics

Big Data Analytics comes in many different types, each serving a different purpose:

1. Descriptive Analytics: This type helps us understand past events. In social


media, it shows performance metrics, like the number of likes on a post.

2. Diagnostic Analytics: In Diagnostic analytics delves deeper to uncover the


reasons behind past events. In healthcare, it identifies the causes of high patient
re-admissions.
3. Predictive Analytics: Predictive analytics forecasts future events based on past
data. Weather forecasting, for example, predicts tomorrow's weather by
analyzing historical patterns.

4. Prescriptive Analytics: However, this category not only predicts results but
also offers recommendations for action to achieve the best results. In e-
commerce, it may suggest the best price for a product to achieve the highest
possible profit.

5. Real-time Analytics: The key function of real-time analytics is data


processing in real time. It swiftly allows traders to make decisions based on
real-time market events.

6. Spatial Analytics: Spatial analytics is about the location data. In urban


management, it optimizes traffic flow from the data unde the sensors and
cameras to minimize the traffic jam.

7. Text Analytics: Text analytics delves into the unstructured data of text. In the
hotel business, it can use the guest reviews to enhance services and guest
satisfaction.

Big Data Analytics Technologies and Tools

Big Data Analytics relies on various technologies and tools that might sound
complex, let's simplify them:

 Hadoop: Imagine Hadoop as an enormous digital warehouse. It's used by


companies like Amazon to store tons of data efficiently. For instance, when
Amazon suggests products you might like, it's because Hadoop helps manage
your shopping history.

 Spark: Think of Spark as the super-fast data chef. Netflix uses it to quickly
analyze what you watch and recommend your next binge-worthy show.

 NoSQL Databases: NoSQL databases, like MongoDB, are like digital filing
cabinets that Airbnb uses to store your booking details and user data. These
databases are famous because of their quick and flexible, so the platform can
provide you with the right information when you need it.

 Tableau: Tableau is like an artist that turns data into beautiful pictures. The
World Bank uses it to create interactive charts and graphs that help people
understand complex economic data.

 Python and R: Python and R are like magic tools for data scientists. They use
these languages to solve tricky problems. For example, Kaggle uses them to
predict things like house prices based on past data.
 Machine Learning Frameworks (e.g., TensorFlow): In Machine
learning frameworks are the tools who make predictions. Airbnb
uses TensorFlow to predict which properties are most likely to be booked in
certain areas. It helps hosts make smart decisions about pricing and availability.

These tools and technologies are the building blocks of Big Data Analytics and helps
organizations gather, process, understand, and visualize data, making it easier for
them to make decisions based on information.

Benefits of Big Data Analytics

Big Data Analytics offers a host of real-world advantages, and let's understand with
examples:

1. Informed Decisions: Imagine a store like Walmart. Big Data Analytics helps
them make smart choices about what products to stock. This not only reduces
waste but also keeps customers happy and profits high.

2. Enhanced Customer Experiences: Think about Amazon. Big Data Analytics


is what makes those product suggestions so accurate. It's like having a personal
shopper who knows your taste and helps you find what you want.

3. Fraud Detection: Credit card companies, like MasterCard, use Big Data
Analytics to catch and stop fraudulent transactions. It's like having a guardian
that watches over your money and keeps it safe.

4. Optimized Logistics: FedEx, for example, uses Big Data Analytics to deliver
your packages faster and with less impact on the environment. It's like taking
the fastest route to your destination while also being kind to the planet.

Challenges of Big data analytics

While Big Data Analytics offers incredible benefits, it also comes with its set of
challenges:

 Data Overload: Consider Twitter, where approximately 6,000 tweets are


posted every second. The challenge is sifting through this avalanche of data to
find valuable insights.

 Data Quality: If the input data is inaccurate or incomplete, the insights


generated by Big Data Analytics can be flawed. For example, incorrect sensor
readings could lead to wrong conclusions in weather forecasting.

 Privacy Concerns: With the vast amount of personal data used, like in
Facebook's ad targeting, there's a fine line between providing personalized
experiences and infringing on privacy.
 Security Risks: With cyber threats increasing, safeguarding sensitive data
becomes crucial. For instance, banks use Big Data Analytics to detect
fraudulent activities, but they must also protect this information from breaches.

 Costs: Implementing and maintaining Big Data Analytics systems can be


expensive. Airlines like Delta use analytics to optimize flight schedules, but
they need to ensure that the benefits outweigh the costs.

Usage of Big Data Analytics

Big Data Analytics has a significant impact in various sectors:

 Healthcare: It aids in precise diagnoses and disease prediction, elevating


patient care.

 Retail: Amazon's use of Big Data Analytics offers personalized product


recommendations based on your shopping history, creating a more tailored and
enjoyable shopping experience.

 Finance: Credit card companies such as Visa rely on Big Data Analytics to
swiftly identify and prevent fraudulent transactions, ensuring the safety of your
financial assets.

 Transportation: Companies like Uber use Big Data Analytics to optimize


drivers' routes and predict demand, reducing wait times and improving overall
transportation experiences.

 Agriculture: Farmers make informed decisions, boosting crop yields while


conserving resources.

 Manufacturing: Companies like General Electric (GE) use Big Data Analytics
to predict machinery maintenance needs, reducing downtime and enhancing
operational efficiency.

CHALLENGES OF CONVENTIONAL SYSTEM IN BIG DATA

Big data has revolutionized the way businesses operate, but it has also
presented a number of challenges for conventional systems. Here are some of
the challenges faced by conventional systems in handling big data. Big data is a
term used to describe the large amount of data that can be stored and analysed
by computers. Big data is often used in business, science and government. Big
Data has been around for several years now, but it's only recently that people
have started realizing how important it is for businesses to use this technology
in order to improve their operations and provide better services to customers. A
lot of companies have already started using big data analytics tools because
they realize how much potential there is in utilizing these systems effectively!
However, while there are many benefits associated with using such systems -
including faster processing times as well as increased accuracy -there are also
some challenges involved with implementing them correctly.

Challenges of Conventional System in big data

 ● Scalability
 ●Speed
 ●Storage
 ●Data Integration
 ●Security

Scalability: -

A common problem with conventional systems is that they can't scale. As the
amount of data increases, so does the time it takes to process and store it. This
can cause bottlenecks and system crashes, which are not ideal for businesses
looking to make quick decisions based on their data. Conventional systems also
lack flexibility in terms of how they handle new types of information--for
example, if you want to add another column (columns are like fields) or row
(rows are like records) without having to rewrite all your code from scratch.

Speed: -

Speed is a critical component of any data processing system. Speed is


important because it allows you to process and analyze your data faster, which
means you can make better-informed decisions about how to proceed with your
business. Make more accurate predictions about future events based on past
performance.

Storage: -

The amount of data being created and stored is growing exponentially, with
estimates that it will reach 44 zettabytes by 2020. That's a lot of storage space!
The problem with conventional systems is that they don't scale well as you add
more data. This leads to huge amounts of wasted storage space and lost
information due to corruption or security breaches.

Data Integration: -

The challenges of conventional systems in big data are numerous. Data


integration is one of the biggest challenges, as it requires a lot of time and effort
to combine different sources into a single database. This is especially true when
you're trying to integrate data from multiple sources with different schemas and
formats. Another challenge is errors and inaccuracies in analysis due to lack of
understanding of what exactly happened during an event or transaction. For
example, if there was an error while transferring money from one bank account to
another, then there would be no way for us to know what actually happened
unless someone tells us about it later on (which may not happen).

Security: -
Security is a major challenge for enterprises that depend on conventional
systems to process and store their data. Traditional databases are designed to be
accessed by trusted users within an organization, but this makes it difficult to
ensure that only authorized people have access to sensitive information.
Security measures such as firewalls, passwords and encryption help protect
against unauthorized access and attacks by hackers who want to steal data or
disrupt operations. But these security measures have limitations: They're
expensive; they require constant monitoring and maintenance; they can slow
down performance if implemented too extensively; and they often don't
prevent breaches altogether because there's always some way around them
(such as through phishing emails). Conventional systems are not equipped for
big data. They were designed for a different era, when the volume of
information was much smaller and more manageable. Now that we're dealing
with huge amounts of data, conventional systems are struggling to keep up.
Conventional systems are also expensive and time -consuming to maintain;
they require constant maintenance and upgrades in order to meet new demands
from users who want faster access speeds and more features than ever before

Intelligent Data Analysis (IDA): -


Intelligent Data Analysis (IDA) refers to advanced methods for analysing large
datasets to identify patterns, trends, and relationships. It combines techniques from
fields such as statistics, machine learning, and artificial intelligence to extract
meaningful insights from raw data.
 Analysing data in details.
 Extracting the useful data and analyse the collected data using artificial
intelligence, machine learning, high perform computing, pattern recognition,
statistics, database and visualization.
 Intelligent data analysis helps to analysis and understand the data then extract
knowledge from it.
Stages in Intelligent data analysis: -
1. Data Preparation: -
 Collecting required data from available data sources and
integrating it into a data set.
 Collected data must contain past data which is relevant to required
data.
 Extract the data from multiple sources.
2. Data Mining: -
 Here we examine the collected data and extracts the data (useful
data or knowledge)
 By using extracted data or knowledge we generate information.
3. Result validation & Explanation: -
 We validates or verifies the generated information using
verification pattern generated or produced by data mining
algorithms.
 Result explanation is nothing but based on validation we got
conclusion i.e. valid, invalid, acceptable or not etc.
Types of Intelligent analysis: -
1. Descriptive Analysis: -
 It is a preliminary stage of data processing by analysing the data.
 As name says it describe what happened to data (based on past
data, analysis is done).
 To know what happened to data, we use descriptive analysis.
 We can only analysis superficial information which is not deep.
2. Predictive Analysis: -
 It predicts the future result by analysing past or historical data.
 We use predicative analysis on the data which is the data in
descriptive analysis.
 Identifies the recent data by looking usage of users.
 To do predictive analysis we use Linear and non-linear regression,
sum, decision tress.
3. Prescriptive analysis: -
 We use the data obtained from predictive analysis as input of
prescriptive analysis.
 Here we try to take decision using predicted data to determine best
output.
 It is statistical strategy to determine best course of action.
 Here we take decisions based on predicted data.

Key Features of Intelligent Data Analysis


1. Pattern Recognition: IDA helps identify trends, patterns, and anomalies in
datasets that might be overlooked by traditional analysis methods.
2. Forecasting: Based on historical data, IDA enables the prediction of future
events, which is crucial for areas such as production planning and predictive
maintenance.
Decision Support: Through IDA, businesses can gain data-driven insights that
support more informed decision-making, providing a solid foundation for
operational strategies
Application: -
1. Health/ Medical filed
2. Cyber Security
3. Business
4. Finding patterns and make decisions

Benefits of IDA
1. Better Decisions: Companies can make informed decisions based on accurate
and up-to-date data analysis.
2. Competitive Advantage: By identifying market opportunities, trends, and
risks, businesses can gain a competitive edge.
3. Increased Efficiency: IDA helps optimize business processes by identifying
inefficiencies and improving overall operations.

Nature of data: -
The "nature of data" refers to the inherent characteristics and attributes that
define data in terms of its type, structure, quality, and how it can be used,
analyzed, or interpreted. Understanding the nature of data is crucial for
selecting the right analysis techniques and tools. Here’s an overview of the key
aspects that make up the nature of data:
1. Type of Data
 Qualitative (Categorical) Data: Data that describes qualities or
characteristics. It can be divided into categories but cannot be measured
numerically.
o Example: Gender, color, type of product.
 Quantitative (Numerical) Data: Data that is expressed in numerical terms and
can be measured or counted.
o Example: Height, weight, age, or sales revenue.
2. Measurement Levels
Data can be classified into different levels of measurement based on how the data is
structured:
 Nominal: Data used to label or categorize without any order or ranking.
o Example: Colors, gender, country names.
 Ordinal: Data that has a meaningful order, but the differences between values
are not consistent.
o Example: Ranking of preferences (1st, 2nd, 3rd), education levels (high
school, bachelor's, master's).
 Interval: Data with a consistent difference between values, but no true zero
point.
o Example: Temperature in Celsius or Fahrenheit.
 Ratio: Data with a true zero point and consistent intervals, making it possible
to calculate ratios.
o Example: Weight, height, income, age.
3. Structure of Data
 Structured Data: Data that is organized into a predefined format, typically in
rows and columns. It can be easily analyzed and stored in databases.
o Example: Data in SQL databases or Excel spreadsheets.
 Unstructured Data: Data that does not have a fixed format and is not easily
categorized. It includes things like text, images, audio, and video.
o Example: Social media posts, images, video files, emails.
 Semi-Structured Data: Data that does not follow a strict structure but still has
some organizational properties, often in the form of tags or markers.
o Example: JSON files, XML documents.
4. Scale and Size
 Small-Scale Data: Data that is limited in scope, usually handled by simple
tools or small-scale software.
o Example: A single store's transaction data.
 Big Data: Extremely large data sets that are too complex to be processed by
traditional data processing tools.
o Example: Data from social media platforms, IoT sensors, or global
weather data.
5. Source of Data
 Primary Data: Data collected directly from a source for a specific purpose,
often through surveys, experiments, or observations.
o Example: Survey results, experimental data.
 Secondary Data: Data that was collected for a different purpose but is being
used for a new analysis.
o Example: Government reports, academic research data.
6. Nature of Data Representation
 Discrete Data: Data that takes distinct, separate values, often in whole
numbers.
o Example: Number of students in a class.
 Continuous Data: Data that can take any value within a range, with infinite
possibilities.
o Example: Height, weight, temperature.
7. Data Quality and Reliability
 Accuracy: How close the data is to the true value.
 Completeness: Whether all required data is present.
 Consistency: Whether the data is consistent across different sources or over
time.
 Timeliness: Whether the data is up-to-date and relevant for the analysis.
 Validity: Whether the data is suitable for the intended purpose.
8. Contextual Nature of Data
 Contextual Relevance: The meaning of data can change based on the context in
which it is being used. For example, the number "100" could refer to dollars,
points, or units, depending on the context.

Analytic processes and tools: -


Big Data Analytics is the process of collecting large chunks of structured/
unstructured data, segregating and analyzing it and discovering the patterns and
other useful business insights from it.
These days, organizations are realizing the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring
more efficiency in their work environment.
Many big data tools and processes are being utilized by companies these days
in the
processes of discovering insights and supporting decision making.
Big data processing is a set of techniques or programming models to access
large- scale data to extract useful information for supporting and providing
decisions.
Below is the list of some of the data analytics tools used most in the industry:
 R Programming (Leading Analytics Tool in the industry)
 Python
 Excel
 SAS
 Apache Spark
 Splunk
 RapidMiner
 Tableau Public
 KNime

Analysis vs Reporting: -
Analysis: -
 Analytics is the process of taking the organized data and analysing it.
 This helps users to gain valuable insights on how businesses can improve
their performance.
 Analysis transforms data and information into insights.
 The goal of the analysis is to answer questions by interpreting the data at a
deeper level and providing actionable recommendations

Reporting: -
 Once data is collected, it will be organized using tools such as graphs and
tables.
 The process of organizing this data is called reporting.
 Reporting translates raw data into information.
 Reporting helps companies to monitor their online business and be alerted
when data falls outside of expected ranges.
 Good reporting should raise questions about the business from its end users.
Conclusion:
 Reporting shows us “what is happening”.
 The analysis focuses on explaining “why it is happening” and “what we
can do about it”.

Reporting Analytics

Purpose Summarize and present data for Unearth insights and patterns for
informational purposes. strategic decision-making.
Benefits Enables informed decision- In addition, analytics helps you
making, tracks performance understand why things are
trends, and fosters transparency happening and know what to do
and accountability next.

Users Primarily operational managers Primarily data analysts, data


and executives. scientists, and executives.

Data Focus on simplicity and clarity, In addition, analytics may


Presentation using visual aids to convey employ advanced statistical
information efficiently. methods and models for more in-
depth analysis.

Data Source & Typically relies on structured May encompass a broader range
Type data from established sources. including unstructured, big data,
and real-time data.

Process Data collection, organization, Data collection, organization,


and presentation. and presentation. Plus, data
exploration, hypothesis testing,
and advanced analysis.

Tool Reporting tools are usually user- Self-service analytics tools are
Complexity friendly and straightforward, user-friendly but advanced
making them accessible to a analysis and predictive modeling
wide range of users without can require a higher level of
extensive technical training. technical expertise.

Modern data analytic tools: -


 These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring
more efficiency to their work environment.
 Many big data tools and processes are being utilised by companies these days
in the processes of discovering insights and supporting decision making.
 Data Analytics tools are types of application software that retrieve data from
one or more systems and combine it in a repository, such as a data warehouse,
to be reviewed and analysed.
 Most organizations use more than one analytics tool including spreadsheets
with statistical functions, statistical software packages, data mining tools, and
predictive modelling tools.
 Together, these Data Analytics Tools give the organization a complete
overview of the company to provide key insights and understanding of the
market/business so smarter decisions may be made.
 Data analytics tools not only report the results of the data but also explain why
the results occurred to help identify weaknesses, fix potential problem areas,
alert decision-makers to unforeseen events and even forecast future results
based on decisions the company might make.
Below is the list some of data analytics tools: -
o R Programming (Leading Analytics Tool in the industry)
o Python
o Excel
o SAS
o Apache Spark
o Splunk
o RapidMiner
o Tableau Public
o KNime

You might also like