Big Data Unit-I
Big Data Unit-I
Big Data
Social networking sites: Facebook, Google, LinkedIn all these sites generate huge amount
of data on a day-to-day basis as they have billions of users worldwide.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge number of logs from
which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
Share Market: Stock exchange across the world generates huge amount of data through its
daily transaction.
IoT Appliance: Electronic devices that are connected to the internet create data for their
smart functionality, examples are a smart TV, smart washing machine, smart coffee machine,
smart AC, etc. It is machine-generated data that are created by sensors kept in various devic-
es. For Example, a Smart printing machine – is connected to the internet. A number of such
printing machines connected to a network can transfer data within each other.
Global Positioning System (GPS): GPS in the vehicle helps in monitoring the movement
of the vehicle to shorten the path to a destination to cut fuel, and time consumption. This
system creates huge data on vehicle position and movement.
Machine Data: Automatically generated machine data is produced in reaction to an event or
according to a set timetable. This indicates that all of the data was compiled from a variety
of sources, including satellites, desktop computers, mobile phones, industrial machines,
smart sensors, SIEM logs, medical and wearable devices, road cameras, IoT devices, and
more
1. Structured Data
Structured data can be crudely defined as the data that resides in a fixed
field within a record.
It is type of data most familiar to our everyday lives. for ex: birthday,
address
A certain schema binds it, so all the data has the same set of properties.
Structured data is also called relational data. It is split into multiple tables to
enhance the integrity of the data by creating a single record to depict an
entity. Relationships are enforced by the application of table constraints.
The business value of structured data lies within how well an organization
can utilize its existing systems and processes for analysis purposes.
2. Semi-Structured Data: -
Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly organized
into rows and columns like that in a spreadsheet. However, there are some
features like key-value pairs that help in discerning the different entities
from each other.
Since semi-structured data doesn’t need a structured query language, it is
commonly called NoSQL data.
A data serialization language is used to exchange semi-structured data
across systems that may even have varied underlying infrastructure.
Semi-structured content is often used to store metadata about a business
process but it can also include files containing machine instructions for
computer programs.
This type of information typically comes from external sources such as
social media platforms or other web-based data feeds.
3. Unstructured Data:
Unstructured data is the kind of data that doesn’t adhere to any definite
schema or set of rules. Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally considered
unstructured data. Even though the metadata accompanying an image or a
video may be semi-structured, the actual data being dealt with is
unstructured.
Additionally, Unstructured data is also known as “dark data” because it
cannot be analysed without the proper software tools.
Although the concept of big data itself is relatively new, the origins of large data sets go
back to the 1960s and '70s when the world of data was just getting started with the first data
centres and the development of the relational database.
Around 2005, people began to realize just how much data users generated through
Facebook, YouTube, and other online services. Hadoop (an open-source framework created
specifically to store and analyse big data sets) was developed that same year. NoSQL also
began to gain popularity during this time.
The development of open-source frameworks, such as Hadoop (and more recently, Spark)
was essential for the growth of big data because they make big data easier to work with and
cheaper to store. In the years since then, the volume of big data has skyrocketed. Users are
still generating huge amounts of data—but it‘s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are connected to
the internet, gathering data on customer usage patterns and product performance. The
emergence of machine learning has produced still more data.
While big data has come far, its usefulness is only just beginning. Cloud computing has
expanded big data possibilities even further. The cloud offers truly elastic scalability, where
developers can simply spin up ad hoc clusters to test a subset of data.
Big data platforms are specialized tools and software designed to efficiently
store, process, and analyse large datasets, enabling organizations to gain
valuable insights and make data-driven decisions.
Several factors drive the growth and adoption of Big Data solutions:
There is more than one workload type involved in big data systems, and they
are broadly classified as follows:
1. Merely batching data where big data-based sources are at rest is a data
processing situation.
2. Real-time processing of big data is achievable with motion-based
processing.
3. The exploration of new interactive big data technologies and tools.
4. The use of machine learning and predictive analysis.
• Data Sources: All of the sources that feed into the data extraction pipeline are
subject to this definition, so this is where the starting point for the big data
pipeline is located. Data sources, open and third-party, play a significant role in
architecture. Relational databases, data warehouses, cloud-based data
warehouses, SaaS applications, real-time data from company servers and
sensors such as IoT devices, third-party data providers, and also static files
such as Windows logs, comprise several data sources. Both batch processing
and real-time processing are possible. The data managed can be both batch
processing and real-time processing.
• Data Storage: There is data stored in file stores that are distributed in nature
and that can hold a variety of format-based big files. It is also possible to store
large numbers of different format-based big files in the data lake. This consists
of the data that is managed for batch-built operations and is saved in the file
stores. We provide HDFS, Microsoft Azure, AWS, and GCP storage, among
other blob containers.
• Batch Processing: Each chunk of data is split into different categories using
long-running jobs, which filter and aggregate and also prepare data for
analysis. These jobs typically require sources, process them, and deliver the
processed files to new files. Multiple approaches to batch processing are
employed, including Hive jobs, U-SQL jobs, Sqoop or Pig and custom map
reducer jobs written in any one of the Java or Scala or other languages such as
Python.
• Reporting and Analysis: The generated insights, on the other hand, must be
processed and that is effectively accomplished by the reporting and analysis
tools that utilize embedded technology and a solution to produce useful graphs,
analysis, and insights that are beneficial to the businesses. For example,
Cognos, Hyperion, and others.
• Orchestration: Data-based solutions that utilize big data are data-related tasks
that are repetitive in nature, and which are also contained in workflow chains
that can transform the source data and also move data across sources as well as
sinks and loads in stores. Sqoop, oozie, data factory, and others are just a few
examples.
The importance of big data does not revolve around how much data a company
has but how a company utilizes the collected data. Every company uses data in
its own way; the more efficiently a company uses its data, the more potential it
has to grow. The company can take data from any source and analyse it to find
answers which will enable
1) Cost Savings: - Some tools of Big Data like Hadoop and Cloud-Based
Analytics can bring cost advantages to business when large amounts of data
are to be stored and these tools also help in identifying more efficient ways of
doing business.
2) Time Reductions: - The high speed of tools like Hadoop and in-memory
analytics can easily identify new sources of data which helps businesses
analysing data immediately and make quick decisions based on the learning.
3) Understand the market conditions: -By analysing big data you can get a
better understanding of current market conditions. For example, by
analysing customers purchasing behaviours, a company can find out the
products that are sold the most and produce products according to this trend.
By this, it can get ahead of its competitors.
4) Control online reputation: - Big data tools can do sentiment analysis.
Therefore, you can get feedback about who is saying what about your
company. If you want to monitor and improve the online presence of your
business, then, big data tools can help in all this
5) Using Big Data Analytics to Boost Customer Acquisition and Retention: -
Retention The customer is the most important asset any business depends on.
There is no single business that can claim success without first having to
establish a solid customer base. However, even with a customer base, a
business cannot afford to disregard the high competition it faces. If a business
is slow to learn what customers are looking for, then it is very easy to begin
offering poor quality products. In the end, loss of clientele will result, and this
creates an adverse overall effect on business success. The use of big data
allows businesses to observe various customer related patterns and trends.
Observing customer behaviour is important to trigger loyalty.
6) Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights:- Big data analytics can help change all business
operations. This includes the ability to match customer expectation, changing
company‘s product line and of course ensuring that the marketing campaigns
are powerful.
7) Big Data Analytics As a Driver of Innovations and Product Development: -
Another huge advantage of big data is the ability to help companies innovate
and redevelop their products.
Big-Data Analytics: -
Big Data Analytics is all about crunching massive amounts of information to uncover
hidden trends, patterns, and relationships. It's like sifting through a giant mountain of
data to find the gold nuggets of insight.
Collecting Data: Such data is coming from various sources such as social
media, web traffic, sensors and customer reviews.
Cleaning the Data: Imagine having to assess a pile of rocks that included
some gold pieces in it. You would have to clean the dirt and the debris first.
When data is being cleaned, mistakes must be fixed, duplicates must be
removed and the data must be formatted properly.
Analyzing the Data: It is here that the wizardry takes place. Data analysts
employ powerful tools and techniques to discover patterns and trends. It is the
same thing as looking for a specific pattern in all those rocks that you sorted
through.
Big Data Analytics is a powerful tool which helps to find the potential of large and
complex datasets. To get better understanding, let's break it down into key steps:
Data Collection: Data is the core of Big Data Analytics. It is the gathering of
data from different sources such as the customers’ comments, surveys, sensors,
social media, and so on. The primary aim of data collection is to compile as
much accurate data as possible. The more data, the more insights.
Data Processing: After that we will be working on the data processing. This
process contains such important stages as writing, structuring, and formatting
of data in a way it will be usable for the analysis. It is like a chef who is
gathering the ingredients before cooking. Data processing turns the data into a
format suited for analytics tools to process.
Data Storage and Management: The stored and managed analyzed data is of
utmost importance. It is like digital scrapbooking. May be you would want to
go back to those lessons in the long run, therefore, how you store them has
great importance. Moreover, data protection and adherence to regulations are
the key issues to be addressed during this crucial stage.
Big Data Analytics comes in many different types, each serving a different purpose:
4. Prescriptive Analytics: However, this category not only predicts results but
also offers recommendations for action to achieve the best results. In e-
commerce, it may suggest the best price for a product to achieve the highest
possible profit.
7. Text Analytics: Text analytics delves into the unstructured data of text. In the
hotel business, it can use the guest reviews to enhance services and guest
satisfaction.
Big Data Analytics relies on various technologies and tools that might sound
complex, let's simplify them:
Spark: Think of Spark as the super-fast data chef. Netflix uses it to quickly
analyze what you watch and recommend your next binge-worthy show.
NoSQL Databases: NoSQL databases, like MongoDB, are like digital filing
cabinets that Airbnb uses to store your booking details and user data. These
databases are famous because of their quick and flexible, so the platform can
provide you with the right information when you need it.
Tableau: Tableau is like an artist that turns data into beautiful pictures. The
World Bank uses it to create interactive charts and graphs that help people
understand complex economic data.
Python and R: Python and R are like magic tools for data scientists. They use
these languages to solve tricky problems. For example, Kaggle uses them to
predict things like house prices based on past data.
Machine Learning Frameworks (e.g., TensorFlow): In Machine
learning frameworks are the tools who make predictions. Airbnb
uses TensorFlow to predict which properties are most likely to be booked in
certain areas. It helps hosts make smart decisions about pricing and availability.
These tools and technologies are the building blocks of Big Data Analytics and helps
organizations gather, process, understand, and visualize data, making it easier for
them to make decisions based on information.
Big Data Analytics offers a host of real-world advantages, and let's understand with
examples:
1. Informed Decisions: Imagine a store like Walmart. Big Data Analytics helps
them make smart choices about what products to stock. This not only reduces
waste but also keeps customers happy and profits high.
3. Fraud Detection: Credit card companies, like MasterCard, use Big Data
Analytics to catch and stop fraudulent transactions. It's like having a guardian
that watches over your money and keeps it safe.
4. Optimized Logistics: FedEx, for example, uses Big Data Analytics to deliver
your packages faster and with less impact on the environment. It's like taking
the fastest route to your destination while also being kind to the planet.
While Big Data Analytics offers incredible benefits, it also comes with its set of
challenges:
Privacy Concerns: With the vast amount of personal data used, like in
Facebook's ad targeting, there's a fine line between providing personalized
experiences and infringing on privacy.
Security Risks: With cyber threats increasing, safeguarding sensitive data
becomes crucial. For instance, banks use Big Data Analytics to detect
fraudulent activities, but they must also protect this information from breaches.
Finance: Credit card companies such as Visa rely on Big Data Analytics to
swiftly identify and prevent fraudulent transactions, ensuring the safety of your
financial assets.
Manufacturing: Companies like General Electric (GE) use Big Data Analytics
to predict machinery maintenance needs, reducing downtime and enhancing
operational efficiency.
Big data has revolutionized the way businesses operate, but it has also
presented a number of challenges for conventional systems. Here are some of
the challenges faced by conventional systems in handling big data. Big data is a
term used to describe the large amount of data that can be stored and analysed
by computers. Big data is often used in business, science and government. Big
Data has been around for several years now, but it's only recently that people
have started realizing how important it is for businesses to use this technology
in order to improve their operations and provide better services to customers. A
lot of companies have already started using big data analytics tools because
they realize how much potential there is in utilizing these systems effectively!
However, while there are many benefits associated with using such systems -
including faster processing times as well as increased accuracy -there are also
some challenges involved with implementing them correctly.
● Scalability
●Speed
●Storage
●Data Integration
●Security
Scalability: -
A common problem with conventional systems is that they can't scale. As the
amount of data increases, so does the time it takes to process and store it. This
can cause bottlenecks and system crashes, which are not ideal for businesses
looking to make quick decisions based on their data. Conventional systems also
lack flexibility in terms of how they handle new types of information--for
example, if you want to add another column (columns are like fields) or row
(rows are like records) without having to rewrite all your code from scratch.
Speed: -
Storage: -
The amount of data being created and stored is growing exponentially, with
estimates that it will reach 44 zettabytes by 2020. That's a lot of storage space!
The problem with conventional systems is that they don't scale well as you add
more data. This leads to huge amounts of wasted storage space and lost
information due to corruption or security breaches.
Data Integration: -
Security: -
Security is a major challenge for enterprises that depend on conventional
systems to process and store their data. Traditional databases are designed to be
accessed by trusted users within an organization, but this makes it difficult to
ensure that only authorized people have access to sensitive information.
Security measures such as firewalls, passwords and encryption help protect
against unauthorized access and attacks by hackers who want to steal data or
disrupt operations. But these security measures have limitations: They're
expensive; they require constant monitoring and maintenance; they can slow
down performance if implemented too extensively; and they often don't
prevent breaches altogether because there's always some way around them
(such as through phishing emails). Conventional systems are not equipped for
big data. They were designed for a different era, when the volume of
information was much smaller and more manageable. Now that we're dealing
with huge amounts of data, conventional systems are struggling to keep up.
Conventional systems are also expensive and time -consuming to maintain;
they require constant maintenance and upgrades in order to meet new demands
from users who want faster access speeds and more features than ever before
Benefits of IDA
1. Better Decisions: Companies can make informed decisions based on accurate
and up-to-date data analysis.
2. Competitive Advantage: By identifying market opportunities, trends, and
risks, businesses can gain a competitive edge.
3. Increased Efficiency: IDA helps optimize business processes by identifying
inefficiencies and improving overall operations.
Nature of data: -
The "nature of data" refers to the inherent characteristics and attributes that
define data in terms of its type, structure, quality, and how it can be used,
analyzed, or interpreted. Understanding the nature of data is crucial for
selecting the right analysis techniques and tools. Here’s an overview of the key
aspects that make up the nature of data:
1. Type of Data
Qualitative (Categorical) Data: Data that describes qualities or
characteristics. It can be divided into categories but cannot be measured
numerically.
o Example: Gender, color, type of product.
Quantitative (Numerical) Data: Data that is expressed in numerical terms and
can be measured or counted.
o Example: Height, weight, age, or sales revenue.
2. Measurement Levels
Data can be classified into different levels of measurement based on how the data is
structured:
Nominal: Data used to label or categorize without any order or ranking.
o Example: Colors, gender, country names.
Ordinal: Data that has a meaningful order, but the differences between values
are not consistent.
o Example: Ranking of preferences (1st, 2nd, 3rd), education levels (high
school, bachelor's, master's).
Interval: Data with a consistent difference between values, but no true zero
point.
o Example: Temperature in Celsius or Fahrenheit.
Ratio: Data with a true zero point and consistent intervals, making it possible
to calculate ratios.
o Example: Weight, height, income, age.
3. Structure of Data
Structured Data: Data that is organized into a predefined format, typically in
rows and columns. It can be easily analyzed and stored in databases.
o Example: Data in SQL databases or Excel spreadsheets.
Unstructured Data: Data that does not have a fixed format and is not easily
categorized. It includes things like text, images, audio, and video.
o Example: Social media posts, images, video files, emails.
Semi-Structured Data: Data that does not follow a strict structure but still has
some organizational properties, often in the form of tags or markers.
o Example: JSON files, XML documents.
4. Scale and Size
Small-Scale Data: Data that is limited in scope, usually handled by simple
tools or small-scale software.
o Example: A single store's transaction data.
Big Data: Extremely large data sets that are too complex to be processed by
traditional data processing tools.
o Example: Data from social media platforms, IoT sensors, or global
weather data.
5. Source of Data
Primary Data: Data collected directly from a source for a specific purpose,
often through surveys, experiments, or observations.
o Example: Survey results, experimental data.
Secondary Data: Data that was collected for a different purpose but is being
used for a new analysis.
o Example: Government reports, academic research data.
6. Nature of Data Representation
Discrete Data: Data that takes distinct, separate values, often in whole
numbers.
o Example: Number of students in a class.
Continuous Data: Data that can take any value within a range, with infinite
possibilities.
o Example: Height, weight, temperature.
7. Data Quality and Reliability
Accuracy: How close the data is to the true value.
Completeness: Whether all required data is present.
Consistency: Whether the data is consistent across different sources or over
time.
Timeliness: Whether the data is up-to-date and relevant for the analysis.
Validity: Whether the data is suitable for the intended purpose.
8. Contextual Nature of Data
Contextual Relevance: The meaning of data can change based on the context in
which it is being used. For example, the number "100" could refer to dollars,
points, or units, depending on the context.
Analysis vs Reporting: -
Analysis: -
Analytics is the process of taking the organized data and analysing it.
This helps users to gain valuable insights on how businesses can improve
their performance.
Analysis transforms data and information into insights.
The goal of the analysis is to answer questions by interpreting the data at a
deeper level and providing actionable recommendations
Reporting: -
Once data is collected, it will be organized using tools such as graphs and
tables.
The process of organizing this data is called reporting.
Reporting translates raw data into information.
Reporting helps companies to monitor their online business and be alerted
when data falls outside of expected ranges.
Good reporting should raise questions about the business from its end users.
Conclusion:
Reporting shows us “what is happening”.
The analysis focuses on explaining “why it is happening” and “what we
can do about it”.
Reporting Analytics
Purpose Summarize and present data for Unearth insights and patterns for
informational purposes. strategic decision-making.
Benefits Enables informed decision- In addition, analytics helps you
making, tracks performance understand why things are
trends, and fosters transparency happening and know what to do
and accountability next.
Data Source & Typically relies on structured May encompass a broader range
Type data from established sources. including unstructured, big data,
and real-time data.
Tool Reporting tools are usually user- Self-service analytics tools are
Complexity friendly and straightforward, user-friendly but advanced
making them accessible to a analysis and predictive modeling
wide range of users without can require a higher level of
extensive technical training. technical expertise.