BDU1
BDU1
❖ Digital Data
▪ In computing world, digital data is considered as a collection of facts that’s transmitted
and saved in a electronic format, and processed through software system.
▪ Digital data is data that represents other forms of data using specific machine language
systems that can be interpreted by various technologies.
▪ Digital Data is generated by various devices, like desktops, laptops, tablets, mobile
phones, and electronic sensors.
▪ Digital data is stored as strings of binary values (0s and 1s) on a storage medium that’s
either internal or external to the devices generating or accessing the information.
▪ For Example - Whenever you send an email, read a social media post, or take pictures
with your digital camera, you are working with digital data.
▪ Structured is one of the types of big data and By structured data, we mean data that can
be processed, stored, and retrieved in a fixed format.
▪ This is the data which is in an organized form (e.g., in rows and columns) and can be
easily used by a computer program.
▪ It refers to highly organized information that can be readily and seamlessly stored and
accessed from a database by simple search engine algorithms.
▪ For example - the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized
manner.
Page 1
▪ Unstructured Data :- Data that does not conform to a data model or data schema is
known as unstructured data.
▪ Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze
unstructured data.
▪ About 80—90% data of an organization is in this format; for example, memos, chat
rooms, PowerPoint presentations, images, videos, letters, researches, white papers,
body of an email, etc.
▪ For example - To play a video file, it is essential that the correct codec (coder-
decoder) is available. Unstructured data cannot be directly processed or queried using
SQL. Email is an example of unstructured data.
Page 2
▪ Semi-Structured Data :- Semi-structured data has a defined level of structure and
consistency, but is not relational in nature. Instead, semi-structured data is
hierarchical or graph-based. This kind of data is commonly stored in files that contain
text.
▪ This is the data which does not conform to a data model but has some structure.
Semi-structured data is information that does not reside in a relational database but
that have some organizational properties that make it easier to analyze.
▪ It refers to highly organized information that can be readily and seamlessly stored
and accessed from a database by simple search engine algorithms.
• Data analysis, data analytics and Big Data originate from the longstanding domain of
database management. It relies heavily on the storage, extraction, and optimization
techniques that are common in data that is stored in Relational Database Management
Systems (RDBMS).
• Database management and data warehousing are considered the core components of Big
Data Phase 1. It provides the foundation of modern data analysis as we know it today,
using well-known techniques such as database queries, online analytical processing and
standard reporting tools.
▪ Since the early 2000s, the Internet and the Web began to offer unique data collections
and data analysis opportunities. With the expansion of web traffic and online stores,
companies such as Yahoo, Amazon and eBay started to analyse customer behaviour by
analysing click-rates, IP-specific location data and search logs. This opened a whole new
world of possibilities.
Page 3
• From a data analysis, data analytics, and Big Data point of view, HTTP-based web traffic
introduced a massive increase in semi-structured and unstructured data. Besides the
standard structured data types, organizations now needed to find new approaches and
storage solutions to deal with these new data types in order to analyse them effectively.
The arrival and growth of social media data greatly aggravated the need for tools,
technologies and analytics techniques that were able to extract meaningful information
out of this unstructured data.
• Although web-based unstructured content is still the main focus for many organizations
in data analysis, data analytics, and big data, the current possibilities to retrieve valuable
information are emerging out of mobile devices.
• Mobile devices not only give the possibility to analyze behavioral data (such as clicks
and search queries), but also give the possibility to store and analyze location-based data
(GPS-data). With the advancement of these mobile devices, it is possible to track
movement, analyze physical behavior and even health-related data (number of steps you
take per day). This data provides a whole new range of opportunities, from
transportation, to city design and health care.
Page 4
• Big data platform generally consists of big data storage, servers, database, big data
management, business intelligence and other big data management utilities. It also
supports custom development, querying and integration with other systems.
• The primary benefit behind a big data platform is to reduce the complexity of multiple
vendors/ solutions into a one cohesive solution. Big data platform are also delivered
through cloud where the provider provides an all inclusive big data solutions and
services.
✓ Ability to accommodate new applications and tools depending on the evolving business
needs.
✓ Provide the tools for scouring the data through massive data sets.
✓ 1. The digitization of society: - Big Data is largely consumer driven and consumer
oriented. Most of the data in the world is generated by consumers, who are nowadays
‘always-on’. Most people now spend 4-6 hours per day consuming and generating data
through a variety of devices and (social) applications. With every click, swipe or
message, new data is created in a database somewhere around the world. Because
everyone now has a smartphone in their pocket, the data creation sums to
incomprehensible amounts. Some studies estimate that 60% of data was generated within
the last two years, which is a good indication of the rate with which society has digitized.
Page 5
therefore capacity) still doubles every two years. The plummeting of technology costs
has been depicted in the figure below.
Besides the plummeting of the storage costs, a second key contributing factor to the
affordability of Big Data has been the development of open source Big Data software
frameworks. The most popular software framework (nowadays considered the standard
for Big Data) is Apache Hadoop for distributed storage and processing. Due to the high
availability of these software frameworks in open sources, it has become increasingly
inexpensive to start Big Data projects in organizations.
✓ 4. Increased knowledge about data science :- In the last decade, the term data science
and data scientist have become tremendously popular. In October 2012, Harvard
Business Review called the data scientist “sexiest job of the 21st century” and many
other publications have featured this new job role in recent years. The demand for data
scientist (and similar job titles) has increased tremendously and many people have
actively become engaged in the domain of data science. As a result, the knowledge and
education about data science has greatly professionalized and more information becomes
available every day. While statistics and data analysis mostly remained an academic field
previously, it is quickly becoming a popular subject among students and the working
population.
✓ 5. Social media applications :- Everyone understands the impact that social media has
on daily life. However, in the study of Big Data, social media plays a role of paramount
importance. Not only because of the sheer volume of data that is produced everyday
through platforms such as Twitter, Facebook, LinkedIn and Instagram, but also because
social media provides nearly real-time data about human behaviour.
Page 6
Social media data provides insights into the behaviors, preferences and opinions of ‘the
public’ on a scale that has never been known before. Due to this, it is immensely valuable
to anyone who is able to derive meaning from these large quantities of data. Social media
data can be used to identify customer preferences for product development, target new
customers for future purchases, or even target potential voters in elections. Social media
data might even be considered one of the most important business drivers of Big Data.
✓ 6. The upcoming internet of things (IoT) :- The Internet of things (IoT) is the network
of physical devices, vehicles, home appliances and other items embedded with
electronics, software, sensors, actuators, and network connectivity which enables these
objects to connect and exchange data. It is increasingly gaining popularity as consumer
goods providers start including ‘smart’ sensors in household appliances. Whereas the
average household in 2010 had around 10 devices that connected to the internet, this
number is expected to rise to 50 per household by 2020. Examples of these devices
include thermostats, smoke detectors, televisions, audio systems and even smart
refrigerators.
❖ Data Sources :- Data is sourced from multiple inputs in a variety of formats, including
both structured and unstructured. Data sources, open and third-party, play a significant
role in architecture. Relational databases, data warehouses, cloud-based data warehouses,
SaaS applications, real-time data from company servers and sensors such as IoT devices,
third-party data providers, and also static files such as Windows logs, comprise several
data sources. Both batch processing and real-time processing are possible. The data
managed can be both batch processing and real-time processing.
❖ Data Storage:- There is data stored in file stores that are distributed in nature and that
can hold a variety of format-based big files. It is also possible to store large numbers of
different format-based big files in the data lake. This consists of the data that is managed
for batch built operations and is saved in the file stores. We provide HDFS, Microsoft
Azure, AWS, and GCP storage, among other blob containers.
❖ Batch Processing:- Each chunk of data is split into different categories using long-
running jobs, which filter and aggregate and also prepare data for analysis. These jobs
typically require sources, process them, and deliver the processed files to new files.
Page 7
Multiple approaches to batch processing are employed, including Hive jobs, U-SQL jobs,
Sqoop or Pig and custom map reducer jobs written in any one of the Java or Scala or
other languages such as Python.
❖ Real-time message ingestion:- If the solution includes real-time sources, the architecture
must include a way to capture and store real-time messages for stream processing. This
might be a simple data store, where incoming messages are dropped into a folder for
processing. However, many solutions need a message ingestion store to act as a buffer
for messages, and to support scale-out processing, reliable delivery, and other message
queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and Kafka.
❖ Stream processing:- After capturing real-time messages, the solution must process them
by filtering, aggregating, and otherwise preparing the data for analysis. The processed
stream data is then written to an output sink. Azure Stream Analytics provides a managed
stream processing service based on perpetually running SQL queries that operate on
unbounded streams. You can also use open source Apache streaming technologies like
Storm and Spark Streaming in an HDInsight cluster.
❖ Analytics-Based Data-store:- In order to analyze and process already processed data,
analytical tools use the data store that is based on HBase or any other NoSQL data
warehouse technology. The data can be presented with the help of a hive database, which
can provide metadata abstraction, or interactive use of a hive database, which can
provide metadata abstraction in the data store. NoSQL databases like HBase or Spark
SQL are also available.
❖ Analysis and reporting: - Most Big Data platforms are geared to extracting business
insights from the stored data via analysis and reporting. This requires multiple tools.
Structured data is relatively easy to handle, while more advanced and specialized
techniques are required for unstructured data. Data scientists may undertake interactive
data exploration using various notebooks and tool-sets. A data modeling layer might also
be included in the architecture, which may also enable self-service BI using popular
visualization and modeling techniques. Analytics results are sent to the reporting
component, which replicates them to various output systems for human viewers, business
processes, and applications. After visualization into reports or dashboards, the analytic
results are used for data-driven business decision making.
❖ Orchestration:- Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple
sources and sinks, load the processed data into an analytical data store, or push the results
straight to a report or dashboard. To automate these workflows, you can use an
orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.
❖ 1. Volume:
✓ The name ‘Big Data’ itself is related to a size which is enormous.
✓ Volume is a huge amount of data.
✓ To determine the value of data, size of data plays a very crucial role. If the volume of
data is very large then it is actually considered as a ‘Big Data’. This means whether a
particular data can actually be considered as a Big Data or not, is dependent upon the
volume of data.
✓ Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.
Page 8
✓ Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes(6.2
billion GB) per month. Also, by the year 2020 we will have almost 40000 ExaBytes of
data.
❖ 2. Velocity:
✓ Velocity refers to the high speed of accumulation of data.
✓ In Big Data velocity data flows in from sources like machines, networks, social media,
mobile phones etc.
✓ There is a massive and continuous flow of data. This determines the potential of data
that how fast the data is generated and processed to meet the demands.
✓ Sampling data can help in dealing with the issue like ‘velocity’.
✓ Example: There are more than 3.5 billion searches per day are made on Google. Also,
FaceBook users are increasing by 22%(Approx.) year by year.
❖ 3. Variety:
✓ It refers to nature of data that is structured, semi-structured and unstructured data.
✓ It also refers to heterogeneous sources.
✓ Variety is basically the arrival of data from new sources that are both inside and outside
of an enterprise. It can be structured, semi-structured and unstructured.
❖ 4. Veracity:
✓ It refers to inconsistencies and uncertainty in data, that is data which is available can
sometimes get messy and quality and accuracy are difficult to control.
✓ Big Data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources.
✓ Example: Data in bulk could create confusion whereas less amount of data could
convey half or Incomplete Information.
❖ 5. Value:
✓ After having the 4 V’s into account there comes one more V which stands for Value!.
The bulk of Data having no Value is of no good to the company, unless you turn it into
something useful.
o Data in itself is of no use or importance but it needs to be converted into
something valuable to extract Information.
Page 9
Given below are the main technology components of big data.
Page 10
✓ 5. Boost Customer Acquisition and Retention :- Customers are a vital asset on
which any business depends on. No single business can achieve its success without
building a robust customer base. But even with a solid customer base, the companies
can’t ignore the competition in the market. If we don’t know what our customers
want then it will degrade companies’ success. It will result in the loss of clientele
which creates an adverse effect on business growth. Big data analytics helps
businesses to identify customer related trends and patterns. Customer behavior
analysis leads to a profitable business.
✓ 6. Solve Advertisers Problem and Offer Marketing Insights :- Big data analytics
shapes all business operations. It enables companies to fulfil customer expectations.
Big data analytics helps in changing the company’s product line. It ensures powerful
marketing campaigns.
✓ 7. The driver of Innovations and Product Development :- Big data makes
companies capable to innovate and redevelop their products.
Big data has found many applications in various fields today. The major fields where
big data is being used are as follows.
✓ Government :- Big data analytics has proven to be very useful in the government
sector. Big data analysis played a large role in Barack Obama’s successful 2012 re-
election campaign. Also most recently, Big data analysis was majorly responsible for the
BJP and its allies to win a highly successful Indian General Election 2014. The Indian
Government utilizes numerous techniques to ascertain how the Indian electorate is
responding to government action, as well as ideas for policy augmentation.
✓ Social Media Analytics :- The advent of social media has led to an outburst of big
data. Various solutions have been built in order to analyse social media activity like
IBM’s Cognos Consumer Insights, a point solution running on IBM’s BigInsights Big
Data platform, can make sense of the chatter. Social media can provide valuable real-
time insights into how the market is responding to products and campaigns. With the
help of these insights, the companies can adjust their pricing, promotion, and campaign
placements accordingly. Before utilizing the big data there needs to be some pre-
processing to be done on the big data in order to derive some intelligent and valuable
results. Thus to know the consumer mind-set the application of intelligent decisions
derived from big data is necessary.
✓ Technology :- The technological applications of big data comprise of the following
companies which deal with huge amounts of data every day and put them to use for
business decisions as well. For example, eBay.com uses two data warehouses at 7.5
petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer
recommendations, and merchandising. Inside eBay‟s 90PB data warehouse.
Amazon.com handles millions of back-end operations every day, as well as queries from
more than half a million third-party sellers. The core technology that keeps Amazon
running is Linux-based and as of 2005, they had the world’s three largest Linux
databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Facebook handles 50 billion
photos from its user base. Windermere Real Estate uses anonymous GPS signals from
nearly 100 million drivers to help new home buyers determine their typical drive times to
and from work throughout various times of the day.
✓ Fraud Detection :- For businesses whose operations involve any type of claims or
transaction processing, fraud detection is one of the most compelling Big Data
application examples. Historically, fraud detection on the fly has proven an elusive goal.
In most cases, fraud is discovered long after the fact, at which point the damage has been
Page 11
done and all that’s left is to minimize the harm and adjust policies to prevent it from
happening again. Big Data platforms that can analyze claims and transactions in real
time, identifying large-scale patterns across many transactions or detecting anomalous
behavior from an individual user, can change the fraud detection game.
✓ Call Center Analytics :- Now we turn to the customer-facing Big Data application
examples, of which call center analytics are particularly powerful. What’s going on in a
customer’s call center is often a great barometer and influencer of market sentiment, but
without a Big Data solution, much of the insight that a call center can provide will be
overlooked or discovered too late. Big Data solutions can help identify recurring
problems or customer and staff behavior patterns on the fly not only by making sense of
time/quality resolution metrics but also by capturing and processing call content itself.
✓ Banking :- The use of customer data invariably raises privacy issues. By uncovering
hidden connections between seemingly unrelated pieces of data, big data analytics could
potentially reveal sensitive personal information. Research indicates that 62% of bankers
are cautious in their use of big data due to privacy issues. Further, outsourcing of data
analysis activities or distribution of customer data across departments for the generation
of richer insights also amplifies security risks. Such as customers’ earnings, savings,
mortgages, and insurance policies ended up in the wrong hands. Such incidents reinforce
concerns about data privacy and discourage customers from sharing personal information
in exchange for customized offers.
✓ Agriculture :- A biotechnology firm uses sensor data to optimize crop efficiency. It
plants test crops and runs simulations to measure how plants react to various changes in
condition. Its data environment constantly adjusts to changes in the attributes of various
data it collects, including temperature, water levels, soil composition, growth, output, and
gene sequencing of each plant in the test bed. These simulations allow it to discover the
optimal environmental conditions for specific gene types.
✓ Marketing :- Marketers have begun to use facial recognition software to learn how
well their advertising succeeds or fails at stimulating interest in their products. A recent
study published in the Harvard Business Review looked at what kinds of advertisements
compelled viewers to continue watching and what turned viewers off. Among their tools
was “a system that analyses facial expressions to reveal what viewers are feeling.” The
research was designed to discover what kinds of promotions induced watchers to share
the ads with their social network, helping marketers create ads most likely to “go viral”
and improve sales.
✓ Smart Phones :- Perhaps more impressive, people now carry facial recognition
technology in their pockets. Users of I Phone and Android smartphones have applications
at their fingertips that use facial recognition technology for various tasks. For example,
Android users with the remember app can snap a photo of someone, then bring up stored
information about that person based on their image when their own memory lets them
down a potential boon for salespeople.
✓ Telecom :- Now a day’s big data is used in different fields. In telecom also it plays a
very good role. Operators face an uphill challenge when they need to deliver new,
compelling, revenue-generating services without overloading their networks and keeping
their running costs under control. The market demands new set of data management and
analysis capabilities that can help service providers make accurate decisions by taking
into account customer, network context and other critical aspects of their businesses.
Most of these decisions must be made in real time, placing additional pressure on the
operators. Real-time predictive analytics can help leverage the data that resides in their
multitude systems, make it immediately accessible and help correlate that data to
generate insight that can help them drive their business forward.
✓ Healthcare :- Traditionally, the healthcare industry has lagged behind other industries
in the use of big data, part of the problem stems from resistance to change providers are
accustomed to making treatment decisions independently, using their own clinical
Page 12
judgment, rather than relying on protocols based on big data. Other obstacles are more
structural in nature. This is one of the best place to set an example for Big Data
Application. Even within a single hospital, payor, or pharmaceutical company, important
information often remains siloed within one group or department because organizations
lack procedures for integrating data and communicating findings. Health care
stakeholders now have access to promising new threads of knowledge. This information
is a form of “big data,” so called not only for its sheer volume but for its complexity,
diversity, and timelines. Pharmaceutical industry experts, payers, and providers are now
beginning to analyse big data to obtain insights. Recent technologic advances in the
industry have improved their ability to work with such data, even though the files are
enormous and often have different database structures and technical characteristics.
➢ 1). Easy Result Formats :- Results are imperative parts of big data analytics model as
they support in the decision-making process, that are made to decide future strategy and
goals. Scientists prefer the results to get the result in the real-time so that they can take
better and appropriate decisions, based on the analysis result. The tools must be able to
produce a result in such a way that it can provide insights into data analysis and decision-
making platform. The platform should be able to provide the real-time streams that can
help in making instant and quick decisions.
➢ 2). Raw data Processing :- Here, the data processing means collecting and organizing
data in a meaningful manner. Data modeling takes complex data sets and displays them
in the visual form or diagram or chart. Here, data should be interpretable and digestible
so that it can be used in making decisions. Tools of big data analytics must be able to
import data from various data sources such as Microsoft Access, text files, Microsoft
Excel and other files. Tools must be able to collect data from multiple data sources and in
multiple formats. In this way need for data conversion will be reduced and overall
process speed will be improved. Even the export quality and capability to visualize data
sets and handling various formats like PDFs, Excel, or Word files can be used directly to
collect and transfer the data. Below-listed features are essential for the data processing
tools:
✓ Data Mining
✓ Data Modeling
✓ File Exporting
✓ Data File Sources
➢ 3). Prediction apps or Identity Management :- Identity management is also a required
and essential feature for any data analytics tool. The tool should be able to access any
system and all related information that may be related to the computer hardware,
software or any other individual computer. Here, the identity management system is also
related to managing all issues related to the identity, data protection, and access so that it
can support system, network passwords, and protocols. Here, it should be clear that
whether a user can access the system or not and to which level the system access
permission is granted? Identity management applications and system ensure that only
authenticated users can access the system information and the tool or system must be
able to organize a security plan and include fraud analytics and real-time security.
➢ 4). Reporting Feature :- Businesses remain on top with the help of reporting features.
Even time-to-time data should be fetched and represented in a well-organized manner.
These way decision-makers can take timely decisions and handle the critical situations as
well, especially in a society that is moving rapidly. Data tools use dashboards to present
KPIs and metrics. The reports must be customizable and target data set oriented. The
expected capabilities of reporting tools are Real-time reporting, dashboard management,
and location-based insights.
Page 13
➢ 5). Security Features :- For any successful business, it is essential to save their data. The
tools that are used for big data analytics should offer safety and security to the data. For
this there should be SSO feature that is known as a single sign-on feature with the help of
that there is no need for the user to sign-in multiple times during the same session, even
with the help of single or same login user can log in multiple times and monitor user
activities and accounts. Moreover, data encryption is also an imperative feature that
should be provided by Big Data analytics tools. It means to change the form of data or to
make it unreadable from a readable form by using several algorithms and codes.
Sometimes automatic encryption is also offered by web browsers. Comprehensive
encryption capabilities are also offered by data analytics tools. For this single sign-on and
data encryption are two of the most used and popular features.
➢ 6). Fraud management :- A variety of fraud detection functionalities remain involved in
the fraud analytics. Mainly when it comes to the fraud detection activities then it involves
various fraud analytics. Due to these activities, businesses mainly focus on the way with
which they will deal with the fraud rather than preventing any fraud. Fraud detection can
be performed by data analytics tools. The tools should be able to perform repeated tests
on the data at any time just to ensure that there will be no amiss. In this way, threats can
be identified quickly and efficiently. With effective fraud analytics and identity
management capabilities.
➢ 7). Technologies Support :- Your data analytics tool must support the latest tools and
technologies, especially those that are important for your organization. Here, one most
important one is the A/B testing that is also known as the bucket or split testing, in this
testing two webpage versions are compared to determine the performance of a better
page. Here both the versions are compared on the basis in which user interacts with the
webpage and then the best one is considered. Moreover, as far as technical support is
concerned then your tool must be able to integrate with Hadoop, that is a set of open-
source programs that can work as the backbone of data-analytics activities. Hadoop
mainly involves the following four modules with which integration is expected:
✓ MapReduce: It can read data from a file system that can be interpreted in the
visualized manner.
✓ Hadoop Common: For this, Java tool collection may be required to read data stored
in the user’s file system.
✓ YARN: It is responsible to manage system resources so that data can be stored and
analysis can be performed
✓ Distributed File System: It allows data to be stored in an easy format. If the results of
tools will be integrated with these Hadoop modules then the user can easily send the
results to the user system. In this way flexibility, interoperability and both way
communication can be ensured between organizations.
➢ 8). Version Control :- Most of the data analytics tools are involved in adjusting data
analytics model parameters. But it may cause problems when pushed into production.
Version control feature of big analytics tools will surely improve the capabilities to track
changes and it is able to release previous versions too whenever needed.
➢ 9). Scalability :- Data will not the same all the times but it will grow as your
organization is growing. With big data tools, this is always easy to scale-up as soon as
new data is collected for the company and it can be analyzed well as expected. Also, the
meaningful insights driven from data is pushed or integrated into the previous data
successfully.
➢ 10). Quick Integrations :- With integration capabilities, this is always easy to share data
results with developers and data scientists. Big data tools always support the quick
integration with cloud apps, data warehouses, other databases etc.
Page 14
Big data security is the collective term for all the measures and tools used to guard both
the data and analytics processes from attacks, theft, or other malicious activities that
could harm or negatively affect them. Big Data can come in a mix of structured formats
(organized into rows and columns containing numbers, dates, etc) or unstructured (social
media data, PDF files, emails, images, etc).
• 1. Cloud Security Monitoring :- Cloud computing generally offers more efficient
communication and increased profitability for all businesses. This communication needs
to be secure. Big data security offers cloud application monitoring. This provides host
sensitive data and also monitors cloud-hosted infrastructure. Solutions also offer support
across several relevant cloud platforms.
• 2. Network Traffic Analysis :- Traffic continually moves in and out of your network.
Due to the high volume of data over the network, it is difficult to maintain transactional
visibility over the network traffic. Security analytics allow your enterprise to watch over
this network traffic. It is used to establish baselines and detect anomalies. This also helps
in cloud security monitoring. It is used to analyze traffic in and out of cloud
infrastructure. It also illuminates dark spaces that are hidden in infrastructures and
analyze encrypted sensitive data. Thus, ensuring the proper working of channels.
• 3. Insider Threat Detection :- Insider threats are as much as a danger to your enterprise
as external threats. An active malicious user can do as much damage as any malware
attack. But it is only in some rare cases that an insider threat can destroy a network.
• 4. Threat Hunting :- Generally, the IT Security team mostly engage in threat hunting.
They search for potential indicators of dwelling threats and breaches that try to attack the
IT infrastructure. Security analytics helps to automate this threat of hunting. It acts as an
extra set of eyes for your threat hunting efforts. Threats hunting automation can help in
detecting malware beaconing activity and thus alerts for its stoppage as soon as possible.
• 5. Incident Investigation :- Generally, the sheer number of security alerts from SIEM
solutions would overwhelm your IT security team. These continuous alerts can cause
more fostering burnout and frustration. Thus to minimize this issue, security analytics
automates the incident investigation by providing contextualization to alerts. Thus your
team has more time to prioritize incidents and can deal with potential breach incidents
first.
• 6. User Behaviour Analysis :- Organization’s users generally interact with your IT
infrastructure all the time. Mainly it is the user’s behavior that decides the success or
failure of your cybersecurity. Therefore there is a need for tracking user’s behavior. The
security analytics monitor the unusual behavior of employees. Thus it helps to detect an
insider threat or a malicious account. It can also detect suspicious patterns by correlating
malicious activities. An example of one such renowned security analytics use case is
UEBA. It helps to provide visibility into the IT environment. Thus compiling user
activities from multiple datasets into complete profiles.
• 7. Data Exfiltration Detection :- Data exfiltration is termed as any unauthorized
movement of data moving in and out of any network. Unauthorized data movements can
cause theft and leakage of data. Thus there is a need to protect data from such
unauthorized access. Security analytics helps to detect the data exfiltration over a
network. It is generally used to detect data leakage in encrypted communications.
• 8. Advertisement :- With the help of security analytics, organizations can easily detect
the insider threats. This is anticipated through behaviors such as abnormal login times,
unusual email usage, and unauthorized database access requests. Sometimes it also looks
for indicators that ask for visibility to third-party actors.
1. Informed Consent
To consent means that you give uncoerced permission for something to happen to
you. Informed consent is the most careful, respectful and ethical form of consent. It requires the
data collector to make a significant effort to give participants a reasonable and accurate
understanding of how their data will be used. In the past, informed consent for data collection
was typically taken for participation in a single study. Big data makes this form of consent
impossible as the entire purpose of big data studies, mining and analytics is to reveal patterns
and trends between data points that were previously inconceivable. In this way, consent cannot
possibly be ‘informed’ as neither the data collector nor the study participant can reasonably
know or understand what will be garnered from the data or how it will be used. Revisions to the
standard of informed consent have been introduced. The first is known as ‘broad consent’, which
pre-authorises secondary uses of data. The second is ‘tiered consent’, which gives consent to
specific secondary uses of data, for example, for cancer research but not for genomic research.
Some argue that these newer forms of consent are a watering down of the concept and leave
users open to unethical practices.
2. Privacy
The ethics of privacy involve many different concepts such as liberty, autonomy, security, and in
a more modern sense, data protection and data exposure. You can understand the concept of big
data privacy by breaking it down into three categories:
The scale and velocity of big data pose a serious concern as many traditional privacy processes
cannot protect sensitive data, which has led to an exponential increase in cybercrime and data
leaks.
One example of a significant data leak that caused a loss of privacy to over 200 million internet
users happened in January 2021. A rising Chinese social media site called Sociallarks suffered a
breach due to a series of data protection errors that included an unsecured ElasticSearch
database. A hacker was able to access and scrape the database which stored:
• Names
• Phone numbers
• Email addresses
• Profile descriptions
• Follower and engagement data
• Locations
• LinkedIn profile links
• Connected social media account login names
Page 16
A further concern is the growing analytical power of big data, i.e. how this can impact privacy
when personal information from various digital platforms can be mined to create a full picture of
a person without their explicit consent. For example, if someone applies for a job, information
can be gained about them via their digital data footprint to identify political leanings, sexual
orientation, social life, etc. All of this data could be used as a reason to reject an employment
application even though the information was not offered up for judgement by the applicant.
3. Ownership
When we talk about ownership in big data terms, we steer away from the traditional or legal
understanding of the word as the exclusive right to use, possess, and dispose of property. Rather,
in this context, ownership refers to the redistribution of data, the modification of data, and the
ability to benefit from data innovations.
In the past, legislators have ruled that as data is not property or a commodity, it, therefore,
cannot be stolen - this belief offers little protection or compensation to internet users and
consumers who provide valuable information to companies without personal benefit.
• The right to control data - edit, manage, share and delete data
• The right to benefit from data - profit from the use or sale of data
Contrary to common belief, those who generate data, for example, Facebook users, do not
automatically own the data. Some even argue that the data we provide to use ‘free’ online
platforms is in fact a payment for that platform. But big data is big money in today’s world.
Many internet users feel that the current balance is tilted against them when it comes to
ownership of data and the transparency of companies who use and profit from the data we share.
In recent years, the idea of monetising personal data has gained traction; the ideology aims to
give ownership of data back to the user and balance the market by allowing users to sell their
data legally. This is a highly contentious field of legislation, and some argue that to designate
data as a commodity is to lose our autonomy and freedoms.
Algorithms are designed by humans, the data sets they study are selected and prepared by
humans, and humans have bias. So far, there is significant evidence to suggest that human
prejudices are infecting technology and algorithms, and negatively impacting the lives and
freedoms of humans. Particularly those who exist within the minorities of our societies.
The so-called “coded bias” has been identified in such high-profile cases as MIT lab researcher
Joy Buolamwini’s discovery of racial skin-type bias from commercial artificial intelligence
systems created by giant US companies. Buolamwini found that the software had been trained
on datasets of 77% male pictures and more than 83% white-skinned pictures. These biased
datasets created a situation wherein the program fails to recognise white male faces at an error
rate of only 0.8%, whereas dark-skinned female faces are detected at an error rate of 20% in one
case and 34% in the other two. These biases extend beyond racial and gendered lines and into
the issues of criminal profiling, poverty and housing.
Page 17
Algorithm biases have become such an ingrained part of everyday life that they have also been
documented as impacting our personal psyches and thought processes. The phenomenon occurs
when we perceive our reality to be a reflection of what we see online. However, what we view is
often a tailored reality created by algorithms and personalised using our previous viewing habits.
The algorithm shows us content that we are most likely to enjoy or agree with and discards the
rest. When filter bubbles like this exist they create echo chambers and, in extreme cases, can
lead to radicalisation, sectarianism and social isolation.
The big data divide seeks to define the current state of data access; the understanding and mining
capabilities of big data is isolated within the hands of a few major corporations. These divides
create ‘haves’ and ‘have nots’ in big data and exclude those who lack the necessary financial,
educational and technological resources to access and analyse big datasets.
Tim Berners-Lee has argued that the big data divide separates individuals from data that could
be highly valuable to their wellbeing. And despite the growing industry of applications that use
data to enhance our lives in terms of health, finance, etc., there is currently no way for
individuals to mine their own data or connect potential data silos missed by commercial
software. Again, we face the ethical problem of who owns the data we generate; if our data is not
ours to modify, analyse and benefit from on our own terms, then indeed we do not own it.
The data divide creates further problems when we consider algorithm biases that place
individuals in categories based on a culmination of data that individuals themselves cannot
access. For example, profiling software can mark a person as a high-risk potential for
committing criminal activity, causing them to be legally stop-and-searched by authorities or
even denied housing in certain areas. The big data divide means that the ‘data poor’ cannot
understand the data or methods used to make these decisions about them and their lives.
Unlike organizations dealing with small data, using sophisticated analytics tools will be
needed. You must also employ qualified professionals to analyze the data, identify
security threats, and suggest strategies for mitigating them. During the enforcement
process, managing big data would take more resources compared to handling small data.
However, organizations will take advantage of managing big data, which helps to get
direct predictive information about the probability of an attack. In auditing, the auditors
are likely to adopt more rigorous steps when using this type of data than when using
small data. As such, the use of big data analytics is one of the surest ways of building
some of the organization’s most robust security systems. How Big Data is used in the
Page 18
Process of Compliance. Big data assists the creation of a compressive risk assessment
framework by:
• Fraudulent Crime Prevention: The use of big data strengthens the approach to
predictive analysis, which is an effective way of detecting criminal activities such as
money laundering. If a compliance officer uses big data for internal audits, cyber risks
are discovered, and they intervene to avoid their occurrence. It speeds up the process of
compliance and builds trust among your clients.
• Managing Third Parties Threat: If you are in the process of obtaining compliance
certifications, you must maintain the risk associated with sharing the data with vendors
appropriately. Big data analytics can help you manage vendor-related risks. This you will
accomplish by carefully evaluating their ability to protect your data before sharing with
them.
• Helps in Customer Service: You are required to prove that your customers are pleased
with how you treat their data before you get any compliance certification. If you apply
big data analytics, you will understand your customer’s behavior, which will directly
influence the decision-making process, thereby enabling the compliance process.
Page 19
What Are Data Protection Principles?
Data protection principles help protect data and make it available under any
circumstances. It covers operational data backup and business continuity/disaster
recovery (BCDR) and involves implementing aspects of data management and data
availability.
• Data availability—ensuring users can access and use the data required to perform
business even when this data is lost or damaged.
• Data lifecycle management—involves automating the transmission of critical data to
offline and online storage.
• Information lifecycle management—involves the valuation, cataloging, and protection
of information assets from various sources, including facility outages and disruptions, application
and user errors, machine failure, and malware and virus attacks.
Data privacy is a guideline for how data should be collected or handled, based on its
sensitivity and importance. Data privacy is typically applied to personal health
information (PHI) and personally identifiable information (PII). This includes financial
information, medical records, social security or ID numbers, names, birthdates, and
contact information.
Data privacy concerns apply to all sensitive information that organizations handle,
including that of customers, shareholders, and employees. Often, this information plays a
vital role in business operations, development, and finances.
Data privacy helps ensure that sensitive data is only accessible to approved parties. It
prevents criminals from being able to maliciously use data and helps ensure that
organizations meet regulatory requirements.
Big data ethics serves as a branch of ethics that evaluates data practices by collecting,
generating, analyzing, and distributing data. As the world expands its digital footprint,
data collected has the potential to impact people and thus society. With big data scandals
left and right, users are voicing their concerns about their online privacy. Companies
must adhere to a data ethics framework to maintain the trust of existing and future
customers as well as business partners. A framework exists to provide transparency, so
Page 20
the public knows what data you collect and why. People everywhere want to feel
reassured that their data doesn’t fall into the wrong hands. Much of this data consists of
Personally identifiable information (PII):
✓ Full name
✓ Birthdate
✓ Street address
✓ Phone number
✓ Social security number
✓ Credit card information
✓ Bank account information
✓ Passport number
Once an organization fails to act ethically, it’s no secret that it damages the company’s
brand and reputation. Similarly, after the many data scandals that occurred in the past
couple of years, people lost trust in companies that manipulated customers’. However,
these scandals don’t just consist of data manipulation and sale. Housing data and keeping
it safe from harm’s way also is part of big data ethics. Some of the top data breaches ever
to occur had lasting effects on brand trustworthiness. Therefore, adopting a concrete big
data ethics framework is essential for the success of any large organization. Companies
must act as information protectors as long as they choose to collect it.
The bigger the data central to the business, the higher its risk of violating customer
privacy and individual rights. In 2022, the responsibility to actively manage data privacy
and security falls on roles within the large organization.
✓ Privacy
When users submit their information, it’s with the expectation that companies will keep it
to themselves. Two common scenarios exist when this information is no longer private:
o A data breach
o A sale of information to a third party
With its growth, users expect talented IT professionals to be able to protect their data. If a
data breach occurs, the company fails to meet privacy expectations. Furthermore, in the
21st century, consumers expect large companies to have the means to protect data if they
choose to collect it.
✓ Lack of transparency
Many users are unaware their information is being collected. In addition to being
unaware, companies go to lengths to make it very inconspicuous that they do so.
Websites add cookie opt-ins on pop-ups so that the user will accept to quickly see the
page. After getting a user to submit information, some companies don’t disclose how
they use a person’s data. Long lists of legal documentation are formatted in a way no
Page 21
user expects to read through it. It’s only after scandals or some type of media reporting
do people discover the company’s data collection method is unsatisfactory.
✓ Lack of governance
Before big data, the method to collect information was simple. People either gave you a
physical copy of their information or they didn’t. Companies stored the physical files
under lock and key. Although someone could always potentially steal someone’s
identity, criminals had difficulty doing so to masses of people at one time. Now in a new
age of information abundance, we have users unknowing submitting heaps of
information. The possibilities of using that information with AI and algorithms to
someone’s advantage are endless. In some countries, politicians are now creating laws to
hinder certain actions. However, because big data collection is still relatively new, many
laws don’t exist.
Algorithms make assumptions about users. Depending on the assumption, they can begin
to discriminate. For example, court systems started using algorithms to evaluate the
criminal risk potential defendants and have used this data while sentencing. If the data
encompasses a certain gender, nationality, or race then the results house bias against
groupings outside of those specific groups.
✓ Big Data Ethics Framework and Other Ways to Resolve Ethical Issues
The government and large businesses now create a big data ethics framework to avoid
ethical issues in the big data industry. While receiving initial consent, a company should
develop competencies that voice how they use the data in an easily digestible manner.
Most businesses have a set of mission statements and values. Now, many will also house
a big data ethics framework as well.
In an era of multi-cloud computing, data owners must keep up with both the pace of data
growth and the proliferation of regulations that govern it especially regulations protecting
the privacy of sensitive data and personally identifiable information (PII). With more data
spread across more locations, the business risk of a privacy breach has never been higher,
and with it, consequences ranging from high fines to loss of market share. Big data privacy
is also a matter of customer trust. The more data you collect about users, the easier it gets to
"connect the dots:" to understand their current behavior, draw inferences about their future
behavior, and eventually develop deep and detailed profiles of their lives and preferences.
The more data you collect, the more important it is to be transparent with your customers
about what you're doing with their data, how you're storing it, and what steps you're taking
Page 22
to comply with regulations that govern privacy and data protection. The volume and
velocity of data from existing sources, such as legacy applications and e-commerce, is
expanding fast. You also have new (and growing) varieties of data types and sources, such
as social networks and IoT device streams. To keep pace, your big data privacy strategy
needs to expand, too.
As organizations store more types of sensitive data in larger amounts over longer periods
of time, they will be under increasing pressure to be transparent about what data they
collect, how they analyze and use it, and why they need to retain it. The European Union's
General Data Privacy Regulation (GDPR) is a high-profile example. More government
agencies and regulatory organizations are following suit. To respond to these growing
demands, companies need reliable, scalable big data privacy tools that encourage and help
people to access, review, correct, anonymize, and even purge some or all of their personal
and sensitive information.
✓ Prediction 2: New big data analytic tools will enable organizations to perform deeper
analysis of legacy data, discover uses for which the data wasn't originally intended, and
combine it with new data sources. Big data analytics tools and solutions can now dig into
data sources that were previously unavailable, and identify new relationships hidden in
legacy data. That’s a great advantage when it comes to getting a complete view of your
enterprise data especially for customer 360 and analytics initiatives. But it also raises
questions about the accuracy of aging data and the ability to track down entities for
consent to use their information in new ways.
Big data analytics is the often complex process of examining big data to uncover
information such as hidden patterns, correlations, market trends and customer preferences
that can help organizations make informed business decisions. Big Data analytics deals
with collection, storage, processing and analysis of this massive scale data. Specialized
tools and frameworks are required for big data analysis when:
✓ (1) the volume of data involved is so large that it is difficult to store, process and
analyse data on a single machine,
✓ (2) the velocity of data is very high and the data needs to be analysed in real-time,
✓ (3) there is variety of data involved, which can be structured, unstructured or
semi-structured, and is collected from multiple data sources,
✓ (4) various types of analytics need to be performed to extract value from the data
such as descriptive, diagnostic, predictive and prescriptive analytics.
Big data analytics involves several steps starting from data cleansing, data munging (or
wrangling), data processing and visualization. Big data analytics life-cycle starts from the
collection of data from multiple data sources. Specialized tools and frameworks are required to
ingest the data from different sources into the dig data analytics backend. The data is stored in
specialized storage solutions (such as distributed file systems and non-relational databases)
which are designed to scale.
Page 23
Based on the analysis requirements (batch or real-time), and type of analysis to be performed
(descriptive, diagnostic, predictive, or predictive) specialized frameworks are used. Big data
analytics is enabled by several technologies such as cloud computing, distributed and parallel
processing frameworks, non-relational databases, in-memory computing, for instance. Some
examples of big data are listed as follows:
✓ Data generated by social networks including text, images, audio and video data
✓ Click-stream data generated by web applications such as e-Commerce to analyse user
behaviour
✓ Machine sensor data collected from sensors embedded in industrial and energy systems
for monitoring their health and detecting failures
✓ Healthcare data collected in electronic health record (EHR) systems
✓ Logs generated by web applications
✓ Stock markets data
✓ Transactional data generated by banking and financial applications
❖ Types of Analytics
Page 24
➢ Descriptive Analytics :- Descriptive Analytics is the examination of data or content,
usually manually performed, to answer the question “What happened?” (or What is
happening?), characterized by traditional business intelligence (BI) and visualizations
such as pie charts, bar charts, line graphs, tables, or generated narratives. Descriptive
analytics is the interpretation of historical data to better understand changes that have
occurred in a business. For example, computing the total number of likes for a particular
post, computing the average monthly rainfall or finding the average number of visitors
per month on a website.
Page 25
❖ Big data analytics advantages and disadvantages
➨Big data analysis derives innovative solutions. Big data analysis helps in
understanding and targeting customers. It helps in optimizing business processes.
➨It helps in improving science and research.
➨It improves healthcare and public health with availability of record of patients.
➨It helps in financial trading’s, sports, polling, security/law enforcement etc.
➨Any one can access vast information via surveys and deliver answers of any query.
➨Every second additions are made.
➨One platform carry unlimited information.
Page 26
❖ Challenges of conventional systems
The challenges in Big Data are the real implementation hurdles. These require immediate
attention and need to be handled because if not handled then the failure of the technology may
take place which can also lead to some unpleasant result.
Page 27
• Fault tolerance is another technical challenge and fault tolerance
computing is extremely hard, involving intricate algorithms.
• Nowadays some of the new technologies like cloud computing and big
data always intended that whenever the failure occurs the damage done
should be within the acceptable threshold that is the whole task should not
begin from the scratch.
– Scalability:
• Big data projects can grow and evolve rapidly. The scalability issue of Big
Data has lead towards cloud computing.
• It leads to various challenges like how to run and execute various jobs so
that goal of each workload can be achieved cost-effectively.
• It also requires dealing with the system failures in an efficient manner.
This leads to a big question again that what kinds of storage devices are to
be used.
IDA, in general, includes three stages: (1) Preparation of data; (2) data mining; (3) data
validation and explanation . The preparation of data involves opting for the required data
from the related data source and incorporating it into a data set that can be used for data
mining. The main goal of intelligent data analysis is to obtain knowledge. Data analysis
is the process of a combination of extracting data from data set, analyzing, classification
of data, organizing, reasoning, and so on. It is challenging to choose suitable methods to
resolve the complexity of the process. Regarding the term visualization, we have moved
away from visualization to use the term charting. The term analysis is used for the
method of incorporating, influencing, filtering and scrubbing the data, which certainly
contains, but is not limited to interrelating with their data through charts.
Page 28
❖ Nature of Data
➢ What is Data?
➢ Properties of data
• 1. Consistency
The element of consistency removes room for contradictory data. Rules will have to be
set around consistency metrics, which include range, variance, and standard deviation.
• 2. Accuracy
It is a necessity for DQ data to remain error-free and precise, which means it should be
free of erroneous information, redundancy, and typing errors. Error ratio and deviation
are two examples of accuracy metrics.
• 3. Completeness
The data should be complete without any missing data. To deliver cloud data quality
tools, all data entries should be complete with no room for lapses. The completeness
metric is defined as the percentage of complete data records.
• 4. Auditability
The ability to trace data and analyses the changes over time adds to the Data Quality
dimensions of audibility of data. An example of audacity metrics is the percentage of the
gaps in data sets, modified data, and untraceable and disconnected data.
• 5. Validity
Quality data in terms of validity indicates that all data is aligned with the existing
formatting rules. An example of a validity metric is the percentage of data records in the
required format.
• 6. Uniqueness
There will be no overlapping of data and it will be recorded only once. The same data
may be used in multiple ways, but it will remain unique. Uniqueness metrics are defined
by the percentage of repeated values.
• 7. Timeliness
Page 29
For data to retain its quality, it should be recorded promptly to manage changes. Weekly
over annually, tracking is the solution to timeliness. An example of timeliness metrics is
time variance.
• 8. Relevance
Data captured should be relevant to the purposes for which it is to be used. This will
require a periodic review of requirements to reflect changing needs.
• 9. Reliability
Data should reflect stable and consistent data collection processes across collection
points and over time. Progress toward performance targets should reflect real changes
rather than variations in data collection approaches or methods. Source data is clearly
identified and readily available from manual, automated, or other systems and records.
❖ Types of data
➢ Categorical Data
➢ Nominal Data
Nominal values represent discrete units and are used to label variables, that have no
quantitative value. Just think of them as „labels“. Note that nominal data that has no
order. Therefore if you would change the order of its values, the meaning would not
change. Nominal data are used to label variables where there is no quantitative value and
has no order. So, if you change the order of the value then the meaning will remain the
same.
Page 30
Thus, nominal data are observed but not measured, are unordered but non-equidistant, and have
no meaningful zero. The only numerical activities you can perform on nominal data is to state
that perception is (or isn't) equivalent to another (equity or inequity), and you can use this data to
amass them.
Neither would you be able to do any numerical tasks as they are saved for numerical data. With
nominal data, you can calculate frequencies, proportions, percentages, and central points.
Examples of Nominal data:
• English
• German
• French
• Punjabi
• American
• Indian
• Japanese
• German
➢ Ordinal Data
Ordinal values represent discrete and ordered units. It is therefore nearly the same as
nominal data, except that it’s ordering matters. Ordinal data is almost the same as
nominal data but not in the case of order as their categories can be ordered like 1st, 2nd,
etc. However, there is no continuity in the relative distances between adjacent categories.
Examples of Ordinal data:
✓ Opinion
o Agree
o Disagree
o Mostly agree
o Neutral
o Mostly disagree
✓ Time of day
o Morning
o Noon
o Night
➢ Numerical Data
This data type tries to quantify things and it does by considering numerical values that
make it countable in nature. The price of a smartphone, discount offered, number of
ratings on a product, the frequency of processor of a smartphone, or ram of that particular
phone, all these things fall under the category of Quantitative data types. It is one of the
simplest data types to understand. As the name says, it represents a numerical value and
helps answer questions like how many, how much, how long, etc. For example,
Page 31
✓ The number of apples
✓ The number of students in a class
✓ The height/weight of a person
✓ It attempts to quantify items by measuring numerical variables that make them count in
nature. The key here is that a numerical variable can take an infinite number of values.
For example, the height of a person can vary from x cm to y cm and can be further
broken down based on the fractional values.
➢ Interval Data
Interval Data are measured and ordered with the nearest items but have no meaningful zero.
The central point of an Interval scale is that the word 'Interval' signifies 'space in between',
which is the significant thing to recall, interval scales not only educate us about the order but
additionally about the value between every item.
Even though interval data can show up fundamentally the same as ratio data, the thing that
matters is in their characterized zero-points. If the zero-point of the scale has been picked
subjectively, at that point the data can't be ratio data and should be interval data. Examples of
Interval data:
➢ Ratio Data
Ratio Data are measured and ordered with equidistant items and a meaningful zero and never be
negative like interval data. An outstanding example of ratio data is the measurement of heights.
It could be measured in centimetres, inches, meters, or feet and it is not practicable to have a
negative height. Ratio data enlightens us regarding the order for variables, the contrasts among
them, and they have absolutely zero. It permits a wide range of estimations and surmisings to be
performed and drawn. Example of Ratio data:
In general, data is any set of characters that is gathered and translated for some purpose, usually
analysis. If data is not put into context, it doesn't do anything to a human or computer. There are
multiple types of data. Some of the more common types of data include the following:
✓ Single character
✓ Boolean (true or false)
✓ Text (string)
Page 32
✓ Number (integer or floating-point)
✓ Picture
✓ Sound
✓ Video
In a computer's storage, digital data is a sequence of bits (binary digits) that have the value
one or zero. Data is processed by the CPU, which uses logical operations to produce new
data (output) from source data (input).
❖ Analytic processes
✓ 1. Deployment
✓ 2. Business Understanding
✓ 3. Data Exploration
✓ 4. Data Preparation
✓ 5. Data Modeling
✓ 6. Data Evaluation
Page 33
Step 1: Deployment
• Here we need to:
– plan the deployment and monitoring and maintenance,
– In this phase,
• we deploy the results of the analysis.
– Whenever any requirement occurs, firstly we need to determine the business objective,
– Data collected from the various sources is described in terms of its application and the need for
the project in this phase.
Page 34
Step 4: Data Preparation
• From the data collected in the last step,
– we need to select data as per the need, clean it, construct it to get useful information and
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this phase.
– test cases are built for assessing the model and model is tested and implemented on the data in
this phase.
❖ Analytic tools
Thus the BDA tools are used through out the BDA applications development.
❖ Analysis vs Reporting
❖ Reporting: The process of organizing data into informational summaries in order to
monitor how different areas of a business are performing.****
❖ Analysis: The process of exploring data and reports in order to extract meaningful
insights, which can be used to better understand and improve business performance.
• Data reporting: Gathering data into one place and presenting it in visual representations
Page 35
❖ Comparing analysis with reporting
✓ Reporting is “the process of organizing data into informational summaries in order to
monitor how different areas of a business are performing.”
• Measuring core metrics and presenting them — whether in an email, a slidedeck, or online
dashboard — falls under this category.
✓ Analytics is “the process of exploring data and reports in order to extract meaningful
insights, which can be used to better understand and improve business performance.”
• Reporting helps companies to monitor their online business and be alerted to when data falls
outside of expected ranges.
✓ Good reporting
• should raise questions about the business from its end users.
• The goal of analysis is
• to answer questions by interpreting the data at a deeper level and providing actionable
recommendations.
• A firm may be focused on the general area of analytics (strategy, implementation, reporting,
etc.)
– but not necessarily on the specific aspect of analysis.
• It’s almost like some organizations run out of gas after the initial set-up-related activities and
don’t make it to the analysis stage
✓ Reports are like Robots n monitor and alter you and where as analysis is like parents - c
an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear
infection, etc).
Page 36
✓ Reporting and analysis can go hand-in-hand:
✓ Reporting provides no limited context about what is happening in the data. Context is
critical to good analysis.
✓ Reporting translate a raw data into information
✓ Reporting usually raises a question – What is happening ?
✓ Analysis transforms the data into insights - Why is it happening ? What you can do
about it?
Thus, Analysis and Reporting is synonym to each other with respect their need and utilizing in
the needy context.
The characteristics of the data analysis depend on different aspects such as volume, velocity, and
variety.
1. Programmatic
There might be a need to write a program for data analysis by using code to manipulate it or do
any kind of exploration because of the scale of the data.
2. Data-driven
A lot of data scientists depend on a hypothesis-driven approach to data analysis. For appropriate
data analysis, one can also avail the data to foster analysis. This can be of significant advantage
when there is a large amount of data. For example machine learning approaches can be used in
place of hypothetical analysis.
3. Attributes usage
For proper and accurate analysis of data, it can use a lot of attributes. In the past, analysts dealt
with hundreds of attributes or characteristics of the data source. With Big Data, there are now
thousands of attributes and millions of observations.
Page 37
4. Iterative
As whole data is broken into samples and samples are then analyzed, therefore data analytics can
be iterative in nature. Better compute power enables iteration of the models until data analysts
are satisfied. This has led to the development of new applications designed for addressing
analysis requirements and time frames.
✓ It provides infrastructures and platforms for other specific Big Data applications.
Page 38
b) Apache flink
• Apache flink is
– an open source platform
– which is a streaming data flow engine that provides communication fault tolerance and
– data distribution computation over data stream .
– flink is a top level project of Apache flink is scalable data analytics framework that is fully
compatible to hadoop .
– flink can execute both stream processing and batch processing easily.
– flink was designed as an alternative to map-reduce.
c) Kinesis
– Kinesis as an out of the box streaming data tool.
– Kinesis comprises of shards which Kafka calls partitions.
– For organizations that take advantage of real-time or near real-time access to large stores
of data,
– Amazon Kinesis is great.
– Kinesis Streams solves a variety of streaming data problems.
– One common use is the real-time aggregation of data which is followed by loading the
aggregate data into a data warehouse.
– Data is put into Kinesis streams.
– This ensures durability and elasticity.
Page 39