0% found this document useful (0 votes)

23 views

unit 2

Uploaded by

vanieswari762002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

unit 2

Uploaded by

vanieswari762002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

UNIT 2

https://www.geeksforgeeks.org/what-is-data-analytics/

What is Data Analytics?

Data Analytics is a systematic approach that transforms raw data into valuable insights. This
process encompasses a suite of technologies and tools that facilitate data collection, cleaning,
transformation, and modelling, ultimately yielding actionable information. This information
serves as a robust support system for decision-making. Data analysis plays a pivotal role in
business growth and performance optimization. It aids in enhancing decision-making
processes, bolstering risk management strategies, and enriching customer experiences. By
presenting statistical summaries, data analytics provides a concise overview of quantitative data.
While data analytics finds extensive application in the finance industry, its utility is not
confined to this sector alone. It is also leveraged in diverse fields such as agriculture, banking,
retail, and government, among others, underscoring its universal relevance and impact. Thus,
data analytics serves as a powerful tool for driving informed decisions and fostering growth
across various industries.
Process of Data Analytics
Data analysts, data scientists, and data engineers together create data pipelines which helps
to set up the model and do further analysis. Data Analytics can be done in the following steps
which are mentioned below:
1. Data Collection : It is the first step where raw data needs to be collected for analysis
purposes. It consists of two steps in which data collection can be done. If the data are from
different source systems then using data integration routines the data analysts have to
combine the different data whereas sometimes the data are the subset of the data set. In this
case, the data analyst would perform some steps to extract the useful subset and transfer it to
the other compartment in the system.
2. Data Cleansing : After collecting the data the next step is to clean the quality of the data as
the collected data consists of a lot of quality problems such as errors, duplicate entries and
white spaces which need to be corrected before moving to the next step. By running data
profiling and data cleansing tasks these errors can be corrected. These data are organised
according to the needs of the analytical model by the analysts.
3. Data Analysis and Data Interpretation: Analytical models are created using software
and other tools which interpret the data and understand it. The tools include Python,
Excel, R, Scala and SQL. Lastly this model is tested again and again until the model
works as it needs to be then in production mode the data set is run against the model.
4. Data Visualisation: Data visualisation is the process of creating visual representation of
data using the plots, charts and graphs which helps to analyse the patterns, trends and get
the valuable insights of the data. By comparing the datasets and analysing it data analysts
find the useful data from the raw data.
Types of Data Analytics
There are different types of data analysis in which raw data is converted into valuable insights.
Some of the types of data analysis are mentioned below:
1. Descriptive Data Analytics : Descriptive data Analytics is a type of data analysis which
summarises the data set and it is used to compare the past results, differentiate between the
weakness and strength, and identify the anomalies. Descriptive data analysis is used by the
companies to identify the problems in the data set as it helps in identifying the patterns.
2. Real-time Data Analytics: Real time data Analytics doesn’t use data from past events. It is
a type of data analysis which involves using the data when the data is immediately entered
in the database. This type of analysis is used by the companies to identify the trends and track
the competitors’ operations.
3. Diagnostic Data Analytics: Diagnostic Data Analytics uses past data sets to analyse the
cause of an anomaly. Some of the techniques used in diagnostic analysis are correlation
analysis, regression analysis and analysis of variance.The results which are provided by
diagnostic analysis help the companies to give accurate solutions to the problems
4. statistical model techniques to identify the trends and patterns. Predictive data analysis
is also used in sales forecasting, to estimate the risk and to predict customer behaviour.
5. Prescriptive Data Analytics: Prescriptive data Analytics is an analysis of selecting best
solutions to problems. This type of data analysis is used in loan approval, pricing models,
machine repair scheduling, analysing the decisions and so on. To automate decision
making companies use prescriptive data analysis.
https://www.investopedia.com/terms/d/data-analytics.asp

What Is Data Analytics?

The term data analytics refers to the science of analyzing raw data to make conclusions about
information. Many of the techniques and processes of data analytics have been automated into
mechanical processes and algorithms that work over raw data for human consumption. Data
analytics can be used by different entities, such as businesses, to optimize their performance and
maximize their profits. This is done by using software and other tools to gather and analyze raw
data.

Understanding Data Analytics

Data analytics is a broad term that encompasses many diverse types of data analysis. Any type of
information can be subjected to data analytics techniques to get insight that can be used to
improve things. Data analytics techniques can reveal trends and metrics that would otherwise be
lost in the mass of information. This information can then be used to optimize processes to
increase the overall efficiency of a business or system.

For example, manufacturing companies often record the runtime, downtime, and work queue for
various machines and then analyze the data to better plan workloads so the machines operate
closer to peak capacity.

Data analytics can do much more than point out bottlenecks in production. Gaming companies
use data analytics to set reward schedules for players that keep the majority of players active in
the game. Content companies use many of the same data analytics to keep you clicking, watching,
or re-organizing content to get another view or another click.

Steps in Data Analysis

The process involved in data analysis involves several steps:

1. Determine the data requirements or how the data is grouped. Data may be separated by
age, demographic, income, or gender. Data values may be numerical or divided by
category.
2. Collect the data. This can be done through a variety of sources such as computers, online
sources, cameras, environmental sources, or through personnel.
3. Organize the data after it's collected so it can be analyzed. This may take place on a
spreadsheet or other form of software that can take statistical data.
4. Clean up the data before it is analyzed. This is done by scrubbing it and ensuring there's
no duplication or error and that it is not incomplete. This step helps correct any errors
before the data goes on to a data analyst to be analyzed.

Types of Data Analytics

Data analytics is broken down into four basic types:

1. Descriptive analytics: This describes what has happened over a given period of time.
Have the number of views gone up? Are sales stronger this month than last?
2. Diagnostic analytics: This focuses more on why something happened. It involves more
diverse data inputs and a bit of hypothesizing. Did the weather affect beer sales? Did that
latest marketing campaign impact sales?
3. Predictive analytics: This moves to what is likely going to happen in the near term. What
happened to sales the last time we had a hot summer? How many weather models predict
a hot summer this year?

4. Prescriptive analytics: This suggests a course of action. For example, we should add an
evening shift to the brewery and rent an additional tank to increase output if the likelihood
of a hot summer is measured as an average of these five weather models and the average
is above 58%
https://aws.amazon.com/what-is/data-analytics/

What is data analytics?

Data analytics converts raw data into actionable insights. It includes a range of tools, technologies,
and processes used to find trends and solve problems by using data. Data analytics can shape
business processes, improve decision-making, and foster business growth.

Why is data analytics important?

Data analytics helps companies gain more visibility and a deeper understanding of their processes
and services. It gives them detailed insights into the customer experience and customer problems.
By shifting the paradigm beyond data to connect insights with action, companies can create
personalized customer experiences, build related digital products, optimize operations, and
increase employee productivity.

What is big data analytics?

Big data describes large sets of diverse data—structured, unstructured, and semi-structured—that
are continuously generated at high speed and in high volumes. Big data is typically measured in
terabytes or petabytes. One petabyte is equal to 1,000,000 gigabytes. To put this in perspective,
consider that a single HD movie contains around 4 gigabytes of data. One petabyte is the equivalent
of 250,000 films. Large datasets measure anywhere from hundreds to thousands to millions of
petabytes.

Big data analytics is the process of finding patterns, trends, and relationships in massive datasets.
These complex analytics require specific tools and technologies, computational power, and data
storage that support the scale.

How does big data analytics work?

Big data analytics follows five steps to analyze any large datasets:

1. Data collection
2. Data storage

3. Data processing

4. Data cleansing

5. Data analysis

Data collection

This includes identifying data sources and collecting data from them. Data collection follows ETL
or ELT processes.

ETL – Extract Transform Load

In ETL, the data generated is first transformed into a standard format and then loaded into storage.

ELT – Extract Load Transform

In ELT, the data is first loaded into storage and then transformed into the required format.

Data storage

Based on the complexity of data, data can be moved to storage such as cloud data warehouses or
data lakes. Business intelligence tools can access it when needed.

Comparison of data lakes with data warehouses

A data warehouse is a database optimized to analyze relational data coming from transactional
systems and business applications. The data structure and schema are defined in advance to
optimize for fast searching and reporting. Data is cleaned, enriched, and transformed to act as the
“single source of truth” that users can trust. Data examples include customer profiles and product
information.

A data lake is different because it can store both structured and unstructured data without any
further processing. The structure of the data or schema is not defined when data is captured; this
means that you can store all of your data without careful design, which is particularly useful when
the future use of the data is unknown. Data examples include social media content, IoT device
data, and nonrelational data from mobile apps.

Organizations typically require both data lakes and data warehouses for data analytics. AWS Lake
Formation and Amazon Redshift can take care of your data needs.

Data processing

When data is in place, it has to be converted and organized to obtain accurate results from
analytical queries. Different data processing options exist to do this. The choice of approach
depends on the computational and analytical resources available for data processing.

Centralized processing

All processing happens on a dedicated central server that hosts all the data.

Distributed processing

Data is distributed and stored on different servers.

Batch processing

Pieces of data accumulate over time and are processed in batches.

Real-time processing

Data is processed continually, with computational tasks finishing in seconds.

Data cleansing

Data cleansing involves scrubbing for any errors such as duplications, inconsistencies,
redundancies, or wrong formats. It’s also used to filter out any unwanted data for analytics.
Data analysis

This is the step in which raw data is converted to actionable insights. The following are four types
of data analytics:

1. Descriptive analytics

Data scientists analyze data to understand what happened or what is happening in the data
environment. It is characterized by data visualization such as pie charts, bar charts, line graphs,
tables, or generated narratives.

2. Diagnostic analytics

Diagnostic analytics is a deep-dive or detailed data analytics process to understand why something
happened. It is characterized by techniques such as drill-down, data discovery, data mining, and
correlations. In each of these techniques, multiple data operations and transformations are used for
analyzing raw data.

3. Predictive analytics

Predictive analytics uses historical data to make accurate forecasts about future trends. It is
characterized by techniques such as machine learning, forecasting, pattern matching, and
predictive modeling. In each of these techniques, computers are trained to reverse engineer
causality connections in the data.

4. Prescriptive analytics
Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely to
happen but also suggests an optimum response to that outcome. It can analyze the potential
implications of different choices and recommend the best course of action. It is characterized by
graph analysis, simulation, complex event processing, neural networks, and recommendation
engines.
https://www.geeksforgeeks.org/life-cycle-phases-of-data-analytics/

Life Cycle Phases of Data Analytics

DataAnalyticsLifecycle
The Data analytic lifecycle is designed for Big Data problems and data science projects. The
cycle is iterative to represent real project. To address the distinct requirements for performing
analysis on Big Data, step–by–step methodology is needed to organize the activities and
tasks involved with acquiring, processing, analyzing, and repurposing data.

Phase 1: Discovery –
The data science team learns and investigates the problem.
Develop context and understanding.
Come to know about data sources needed and available for the project.
The team formulates the initial hypothesis that can be later tested with data.
Phase 2: Data Preparation –
Steps to explore, preprocess, and condition data before modeling and analysis.
It requires the presence of an analytic sandbox, the team executes, loads, and transforms, to
get data into the sandbox.
Data preparation tasks are likely to be performed multiple times and not in predefined order.
Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.
Phase 3: Model Planning –
The team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
In this phase, the data science team develops data sets for training, testing, and production
purposes.
Team builds and executes models based on the work done in the model planning phase.
Several tools commonly used for this phase are – Matlab and STASTICA.
Phase 4: Model Building –
Team develops datasets for testing, training, and production purposes.
Team also considers whether its existing tools will suffice for running the models or if they
need more robust environment for executing models.
Free or open-source tools – Rand PL/R, Octave, WEKA.
Commercial tools – Matlab and STASTICA.
Phase 5: Communication Results –
After executing model team need to compare outcomes of modeling to criteria established
for success and failure.
Team considers how best to articulate findings and outcomes to various team members and
stakeholders, taking into account warning, assumptions.
Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
Phase 6: Operationalize –
The team communicates benefits of project more broadly and sets up pilot project to deploy
work in controlled way before broadening the work to full enterprise of users.
This approach enables team to learn about performance and related constraints of the model
in production environment on small scale which make adjustments before full deployment.
The team delivers final reports, briefings, codes.
Free or open source tools – Octave, WEKA, SQL, MADlib.
https://www.rudderstack.com/learn/data-analytics/data-analytics-lifecycle/

What is the data analytics lifecycle?

The data analytics lifecycle is a series of six phases that have each been identified as vital for
businesses doing data analytics. This lifecycle is based on the popular CRISP-DM analytics
process model, which is an open-standard analytics model developed by IBM. The phases of the
data analytics lifecycle include defining your business objectives, cleaning your data, building
models, and communicating with your stakeholders.

This lifecycle runs from identifying the problem you need to solve, to running your chosen models
against some sandboxed data, to finally operationalizing the output of these models by running
them on a production dataset. This will enable you to find the answer to your initial question and
use this answer to inform business decisions.

Phases of the data analytics lifecycle

Each phase in the data analytics lifecycle is influenced by the outcome of the preceding phase.
Because of this, it usually makes sense to perform each step in the prescribed order so that data
teams can decide how to progress: whether to continue to the next phase, redo the phase, or
completely scrap the process. By enforcing these steps, the analytics lifecycle helps guide the
teams through what could otherwise become a convoluted and directionless process with unclear
outcomes.
1. Discovery

This first phase involves getting the context around your problem: you need to know what problem
you are solving and what business outcomes you wish to see.

You should begin by defining your business objective and the scope of the work. Work out what
data sources will be available and useful to you (for example, Google Analytics, Salesforce, your
customer support ticketing system, or any marketing campaign information you might have
available), and perform a gap analysis of what data is required to solve your business problem
analysis compared with what data you have available, working out a plan to get any data you still
need.

Once your objective has been identified, you should formulate an initial hypothesis. Design your
analysis so that it will determine whether to accept or reject this hypothesis. Decide in advance
what the criteria for accepting or rejecting the hypothesis will be to ensure that your analysis is
rigorous and follows the scientific method.

2. Data preparation

In the next stage, you need to decide which data sources will be useful for the analysis, collect the
data from all these disparate sources, and load it into a data analytics sandbox so it can be used for
prototyping.When loading your data into the sandbox area, you will need to transform it. The two
main types of transformations are preprocessing transformations and analytics
transformations. Preprocessing means cleaning your data to remove things like nulls, defective
values, duplicates, and outliers. Analytics transformations can mean a variety of things, such as
standardizing or normalizing your data so it can be used more effectively with certain machine
learning algorithms, or preparing your datasets for human consumption (for example, transforming
machine labels into human-readable ones, such as “sku123” → “T-Shirt, brown”).

Depending on whether your transformations take place before or after the loading stage, this whole
process is known as either ETL (extract, transform, load) or ELT (extract, load, transform). You
can set up your own ETL pipeline to deal with all of this, or use an integrated customer data
platform to handle the task all within a unified environment.

It is important to note that the sub-steps detailed here don’t have to take place in separate systems.
For example, if you have all data sources in a data warehouse already, you can simply use a
development schema to perform your exploratory analysis and transformation work in that same
warehouse.

3. Model planning

A model in data analytics is a mathematical or programmatic description of the relationship

between two or more variables. It allows us to study the effects of different variables on our data
and to make statistical assumptions about the probability of an event happening.
The main categories of models used in data analytics are SQL models, statistical models, and
machine learning models. A SQL model can be as simple as the output of a SQL SELECT
statement, and these are often used for business intelligence dashboards. A statistical model shows
the relationship between one or more variables (a feature that some data warehouses incorporate
into more advanced statistical functions in their SQL processing), and a machine learning model
uses algorithms to recognize patterns in data and must be trained on other data to do so. Machine
learning models are often used when the analyst doesn’t have enough information to try to solve a
problem using easier steps.

You need to decide which models you want to test, operationalize, or deploy. To choose the most
appropriate model for your problem, you will need to do an exploration of your dataset, including
some exploratory data analysis to find out more about it. This will help guide you in your choice
of model because your model needs to answer the business objective that started the process and
work with the data available to you.

You may want to think about the following when deciding on a model:

How large is your dataset? While the more complex types of neural networks (with many hidden
layers) can solve difficult questions with minimal human intervention, be aware that with more
layers of complexity, a larger set of training data is required for the neural network's
approximations to be accurate. You may only have a small dataset available, or you may require
your dashboards to be fast, which generally requires smaller, pre-aggregated data.

How will the output be used? In the business intelligence use case, fast, pre-aggregated data is
great, but if the end users are likely to perform additional drill-downs or aggregations in their BI
solution, the prepared dataset has to support this. A big pitfall here is to accidentally calculate an
average of an already averaged metric.

Is the data labeled with column headings? If it is, you could use supervised learning, but if not,
unsupervised learning is your only option.
Do you want the outcome to be qualitative or quantitative? If your question expects a
quantitative answer (for example, “How many sales are forecast for next month?” or “How many
customers were satisfied with our product last month?”) then you should use a regression model.
However, if you expect a qualitative answer (for example, “Is this email spam?”, where the answer
can be Yes or No, or “Which of our five products are we likely to have the most success in
marketing to customer X?”), then you may want to use a classification or clustering model.

Is accuracy or speed of the model particularly important? If so, check whether your chosen
model will perform well. The size of your dataset will be a factor when evaluating the speed of a
particular model.

Is your data unstructured? Unstructured data cannot be easily stored in either relational or graph
databases and includes free text data such as emails or files. This type of data is most suited to
machine learning.

Have you analyzed the contents of your data? Analyzing the contents of your data can include
univariate analysis or multivariate analysis (such as factor analysis or principal component
analysis). This allows you to work out which variables have the largest effects and to identify new
factors (that are a combination of different existing variables) that have a big impact

4. Building and executing the model

Once you know what your models should look like, you can build them and begin to draw
inferences from your modeled data.

The steps within this phase of the data analytics lifecycle depend on the model you've chosen to
use.

SQL model

You will first need to find your source tables and the join keys. Next, determine where to build
your models. Depending on the complexity, building your model can range from saving SQL
queries in your warehouse and executing them automatically on a schedule, to building more
complex data modeling chains using tooling like dbt or Dataform. In that case, you should first
create a base model, and then create another model to extend it, so that your base model can be
reused for other future models. Now you need to test and verify your extended model, and then
publish the final model to its destination (for example, a business intelligence tool or reverse
ETL tool).

Statistical model

You should start by developing a dataset containing exactly the information required for the
analysis, and no more. Next, you will need to decide which statistical model is appropriate for your
use case. For example, you could use a correlation test, a linear regression model, or an analysis
of variance (ANOVA). Finally, you should run your model on your dataset and publish your
results.

Machine learning model

There is some overlap between machine learning models and statistical models, so you must begin
the same way as when using a statistical model and develop a dataset containing exactly the
information required for your analysis. However, machine learning models require you to create
two samples from this dataset: one for training the model, and another for testing the model.

There might be several good candidate models to test against the data — for example, linear
regression, decision trees, or support vector machines — so you may want to try multiple models
to see which produces the best result.

If you are using a machine learning model, it will need to be trained. This involves executing your
model on your training dataset, and tuning various parameters of your model so you get the best
predictive results. Once this is working well, you can execute your model on your real dataset,
which is used for testing your model. You can now work out which model gave the most accurate
result and use this model for your final results, which you will then need to publish.
Once you have built your models and are generating results, you can communicate these results to
your stakeholders.

5. Communicating results

You must communicate your findings clearly, and it can help to use data visualizations to achieve
this. Any communication with stakeholders should include a narrative, a list of key findings, and
an explanation of the value your analysis adds to the business. You should also compare the results
of your model with your initial criteria for accepting or rejecting your hypothesis to explain to
them how confident they can be in your analysis.

6. Operationalizing

Once the stakeholders are happy with your analysis, you can execute the same model outside of
the analytics sandbox on a production dataset.

You should monitor the results of this to check if they lead to your business goal being achieved.
If your business objectives are being met, deliver the final reports to your stakeholders, and
communicate these results more widely across the business.

https://www.javatpoint.com/life-cycle-phases-of-data-analytics

Life Cycle of Data Analytics

The Data analytics lifecycle was designed to address Big Data problems and data science projects.
The process is repeated to show the real projects. To address the specific demands for conducting
analysis on Big Data, the step-by-step methodology is required to plan the various tasks associated
with the acquisition, processing, analysis, and recycling of data.

Phase 1: Discovery
o The data science team is trained and researches the issue.
o Create context and gain understanding.
o Learn about the data sources that are needed and accessible to the project.
o The team comes up with an initial hypothesis, which can be later confirmed with evidence

Phase 2: Data Preparation -

o Methods to investigate the possibilities of pre-processing, analysing, and preparing data
before analysis and modelling.
o It is required to have an analytic sandbox. The team performs, loads, and transforms to
bring information to the data sandbox.
o Data preparation tasks can be repeated and not in a predetermined sequence.
o Some of the tools used commonly for this process include - Hadoop, Alpine Miner, Open
Refine, etc.

Phase 3: Model Planning -

o The team studies data to discover the connections between variables. Later, it selects the
most significant variables as well as the most effective models.
o In this phase, the data science teams create data sets that can be used for training for testing,
production, and training goals.
o The team builds and implements models based on the work completed in the modelling
planning phase.
o Some of the tools used commonly for this stage are MATLAB and STASTICA

Phase 4: Model Building -

o The team creates datasets for training, testing as well as production use.
o The team is also evaluating whether its current tools are sufficient to run the models or if
they require an even more robust environment to run models.
o Tools that are free or open-source or free tools Rand PL/R, Octave, WEKA.
o Commercial tools - MATLAB, STASTICA.
Phase 5: Communication Results -
o Following the execution of the model, team members will need to evaluate the outcomes
of the model to establish criteria for the success or failure of the model.
o The team is considering how best to present findings and outcomes to the various members
of the team and other stakeholders while taking into consideration cautionary tales and
assumptions.
o The team should determine the most important findings, quantify their value to the business
and create a narrative to present findings and summarize them to all stakeholders.

Phase 6: Operationalize -
o The team distributes the benefits of the project to a wider audience. It sets up a pilot project
that will deploy the work in a controlled manner prior to expanding the project to the entire
enterprise of users.
o This technique allows the team to gain insight into the performance and constraints related
to the model within a production setting at a small scale and then make necessary
adjustments before full deployment.
o The team produces the last reports, presentations, and codes.
o Open source or free tools such as WEKA, SQL, MADlib, and Octave.

https://intellipaat.com/blog/tutorial/data-analytics-tutorial/data-analytics-lifecycle/

What is Data Analytics Life Cycle?

Data is precious in today’s digital environment. It goes through several life stages, including
creation, testing, processing, consumption, and reuse. These stages are mapped out in the Data
Analytics Life Cycle for professionals working on data analytics initiatives. Each stage has its
own significance and characteristics.
Importance of Data Analytics Life Cycle

The data analytics Life Cycle encompasses the process of producing, collecting, processing,
using, and analyzing data in order to meet corporate objectives. It offers a systematic way for
managing data into useful information that can help achieve organizational or project goals;
additionally, it provides guidance and strategies for extracting this information and moving in the
appropriate direction in order to meet corporate objectives

Data Analytics Life Cycle Phases

The scientific method for creating a structured framework of the data analytics life cycle
involves six stages of architecture for data analytics. The framework is direct and cyclical,
meaning all big data analytics-related processes must be completed sequentially.

Notably, these phases are circular; therefore they may be undertaken either forwards or
backwards. Below are six data analytics phases that serve as fundamental processes in data
science projects.

Phase 1: Data Discovery and Formation

Every good journey begins with a purpose in mind. In this phase, you will identify your desired
data objectives and how best to attain them through data analytics Life Cycle implementation.
Evaluations and assessments should also be undertaken during this initial phase to develop a
basic hypothesis capable of solving business issues or problems.

In the initial step, data will be evaluated for its potential uses and demands – such as where it
comes from, what message you wish for it to send and how this incoming information benefits
your business.
As a data analyst, you will need to explore case studies using similar data analytics and, most
crucially, examine current company trends. Then you must evaluate all in-house infrastructure
and resources, as well as time and technological needs, in order to match the previously acquired
data.

Following the completion of the evaluations, the team closes this stage with hypotheses that will
be tested using data later on. This is the first and most critical step in the life cycle of big data
analytics.

Phase 2: Data Preparation and Processing

Data preparation and processing involves gathering, sorting, processing and purifying collected
information to make sure it can be utilized by subsequent steps of analysis. An important element
of this step is making sure all necessary information is readily accessible before moving ahead
with processing it further.

Following are methods of data acquisition

 Data Collection: Draw information from external sources.

 Data Entry: Within an organization, data entry refers to creating new points of information
using either digital technologies or manual input procedures.
 Signal Reception: Accumulating data from digital devices like the Internet of Things
devices and control systems.

An analytical sandbox is essential during the data preparation stage of data analytics Life Cycle.
This scalable platform is used by data analysts and scientists alike for processing their data sets;
once executed, loaded, or altered it resides securely inside this sandbox for later examination and
modification.
Phase 3: Design a Model

After you’ve defined your business goals and gathered a large amount of data (formatted,
unformatted, or semi-formatted), it’s time to create a model that uses the data to achieve the goal.
Model planning is the name given to this stage of the data analytics process.

There are numerous methods for loading data into the system and starting to analyze it:

 ETL (Extract, Transform, and Load) converts the information before loading it into a system
using a set of business rules.
 ELT (Extract, Load, and Transform) loads raw data into the sandbox before transforming it.
 ETLT (Extract, Transform, Load, Transform) is a combination of two layers of
transformation.

This step also involves teamwork to identify the approaches, techniques, and workflow to be
used in the succeeding phase to develop the model. The process of developing a model begins
with finding the relationship between data points to choose the essential variables and,
subsequently, create a suitable model.

Phase 4: Model Building

This stage of the data analytics life cycle involves creating datasets for testing, training, and
production. The data analytics professionals develop and operate the model they designed in the
previous stage with proper effort.

They use tools and methods, such as decision trees, regression techniques logistic regression),
and neural networks to create and run the model. The experts also run the model through a trial
run to see if it matches the datasets.

It assists them in determining whether the tools they now have will be enough to execute the
model or if a more robust system is required for it to function successfully.
Phase 5: Result Communication and Publication

Recall the objective you set for your company in phase 1. Now is the time to see if the tests you
ran in the previous phase matched those criteria.

The communication process begins with cooperation with key stakeholders to decide whether the
project’s outcomes are successful or not.

The project team is responsible for identifying the major conclusions of the analysis, calculating
the business value associated with the outcome, and creating a narrative to summarize and
communicate the results to stakeholders

Phase 6: Measuring Effectiveness

As your data analytics life cycle comes to an end, the final stage is to offer stakeholders a
complete report that includes important results, coding, briefings, and technical papers or
documents.

Furthermore, to assess the effectiveness of the study, the data is transported from the sandbox to
a live environment and observed to see if the results match the desired business aim.

If the findings meet the objectives, the reports and outcomes are finalized. However, if the
conclusion differs from the purpose stated in phase 1, then you can go back in the data analytics
life cycle to any of the previous phases to adjust your input and get a different result.

https://www.geeksforgeeks.org/data-analytics-and-its-type/

Types of Data Analytics

There are four major types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics uses
data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from modeling, machine
learning, data mining, and game theory that analyze current and historical facts to make
predictions about a future event. Techniques that are used for predictive analytics are:
 Linear Regression
 Time Series Analysis and Forecasting
 Data Mining
Basic Cornerstones of Predictive Analytics
 Predictive modeling
 Decision Analysis and optimization
 Transaction profiling

Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach future
events. It looks at past performance and understands the performance by mining historical data
to understand the cause of success or failure in the past. Almost all management reporting such
as sales, marketing, operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting the
behavior of a single customer, Descriptive analytics identifies many different relationships
between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
 Data Queries
 Reports
 Descriptive Statistics
 Data dashboard

Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule,
and machine learning to make a prediction and then suggests a decision option to take advantage
of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action benefits
from the predictions and showing the decision maker the implication of each decision option.
Prescriptive Analytics not only anticipates what will happen and when to happen but also why
it will happen. Further, Prescriptive Analytics can suggest decision options on how to take
advantage of a future opportunity or mitigate a future risk and illustrate the implication of each
decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics
to leverage operational and usage data combined with data of external factors such as economic
data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or for
the solution of any problem. We try to find any dependency and pattern in the historical data of
the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem, and
they also keep detailed information about their disposal otherwise data collection may turn out
individual for every problem and it will be very time-consuming. Common techniques used for
Diagnostic Analytics are:
 Data discovery
 Data mining
 Correlations
Steps in Data Analysis
 Define Data Requirements: This involves determining how the data will be grouped or
categorized. Data can be segmented based on various factors such as age, demographic,
income, or gender, and can consist of numerical values or categorical data.
 Data Collection: Data is gathered from different sources, including computers, online
platforms, cameras, environmental sensors, or through human personnel.
 Data Organization: Once collected, the data needs to be organized in a structured format to
facilitate analysis. This could involve using spreadsheets or specialized software designed
for managing and analyzing statistical data.
 Data Cleaning: Before analysis, the data undergoes a cleaning process to ensure accuracy
and reliability. This involves identifying and removing any duplicate or erroneous entries, as
well as addressing any missing or incomplete data. Cleaning the data helps to mitigate
potential biases and errors that could affect the analysis results

https://www.analytics8.com/blog/what-are-the-four-types-of-analytics-and-how-do-you-use-them/
What Are the Four Types of Analytics and How Do You Use Them?
Analytics is a broad term covering four different pillars in the modern analytics model: descriptive,
diagnostic, predictive, and prescriptive. Each type of analytics plays a role in how your business
can better understand what your data reveals and how you can use those insights to drive business
objectives. In this blog we will discuss what each type of analytics provides to a business, when to
use it and why, and how they all play a critical role in your organization’s analytics maturity.

Four Types of Analytics

What is Descriptive Analytics?

Descriptive analytics answer the question, “What happened?”. This type of analytics is by far the
most commonly used by customers, providing reporting and analysis centered on past events. It
helps companies understand things such as:

 How much did we sell as a company?

 What was our overall productivity?

 How many customers churned in the last quarter?

Descriptive analytics is used to understand the overall performance at an aggregate level and is by
far the easiest place for a company to start as data tends to be readily available to build reports and
applications.

It’s extremely important to build core competencies first in descriptive analytics before attempting
to advance upward in the data analytics maturity model. Core competencies include things such
as:

 Data modeling fundamentals and the adoption of basic star schema best practices,

 Communicating data with the right visualizations, and

 Basic dashboard design skills.

Diagnostic Analytics
What is Diagnostic Analytics?

Diagnostic analytics, just like descriptive analytics, uses historical data to answer a question. But
instead of focusing on “the what”, diagnostic analytics addresses the critical question of why an
occurrence or anomaly occurred within your data. Diagnostic analytics also happen to be the most
overlooked and skipped step within the analytics maturity model. Anecdotally, I see most
customers attempting to go from “what happened” to “what will happen” without ever taking the
time to address the “why did it happen” step. This type of analytics helps companies answer
questions such as:

 Why did our company sales decrease in the previous quarter?

 Why are we seeing an increase in customer churn?

 Why are a specific basket of products vastly outperforming their prior year sales figures?

Diagnostic analytics tends to be more accessible and fit a wider range of use cases than machine
learning/predictive analytics. You might even find that it solves some business problems you
earmarked for predictive analytics use cases.

Predictive Analytics

What is Predictive Analytics?

Predictive analytics is a form of advanced analytics that determines what is likely to happen based
on historical data using machine learning. Historical data that comprises the bulk of descriptive
and diagnostic analytics is used as the basis of building predictive analytics models. Predictive
analytics helps companies address use cases such as:

 Predicting maintenance issues and part breakdown in machines.

 Determining credit risk and identifying potential fraud.

 Predict and avoid customer churn by identifying signs of customer dissatisfaction

What is Prescriptive Analytics?

Prescriptive analytics is the fourth, and final pillar of modern analytics. Prescriptive analytics
pertains to true guided analytics where your analytics is prescribing or guiding you toward a
specific action to take. It is effectively the merging of descriptive, diagnostic, and predictive
analytics to drive decision making. Existing scenarios or conditions (think your current fleet of
freight trains) and the ramifications of a decision or occurrence (parts breakdown on the freight
trains) are applied to create a guided decision or action for the user to take (proactively buy more
parts for preventative maintenance).

Prescriptive analytics requires strong competencies in descriptive, diagnostic, and predictive

analytics which is why it tends to be found in highly specialized industries (oil and gas, clinical
healthcare, finance, and insurance to name a few) where use cases are well defined. Prescriptive
analytics help to address use cases such as:

 Automatic adjustment of product pricing based on anticipated customer demand and external
factors.

 Flagging select employees for additional training based on incident reports in the field.

Prescriptive analytics primary aim is to take the educated guess or assessment out of data analytics
and streamline the decision-making process.

https://online.hbs.edu/blog/post/types-of-data-analysis

WHAT IS DATA ANALYTICS IN BUSINESS?

Data analytics is the practice of examining data to answer questions, identify trends, and extract
insights. When data analytics is used in business, it’s often called business analytics.

You can use tools, frameworks, and software to analyze data, such as Microsoft Excel and Power
BI, Google Charts, Data Wrapper, Infogram, Tableau, and Zoho Analytics. These can help you
examine data from different angles and create visualizations that illuminate the story you’re trying
to tell.
Algorithms and machine learning also fall into the data analytics field and can be used to gather,
sort, and analyze data at a higher volume and faster pace than humans can. Writing algorithms is
a more advanced data analytics skill, but you don’t need deep knowledge of coding and statistical
modeling to experience the benefits of data-driven decision-making.

WHO NEEDS DATA ANALYTICS?

Any business professional who makes decisions needs foundational data analytics knowledge.
Access to data is more common than ever. If you formulate strategies and make decisions without
considering the data you have access to, you could miss major opportunities or red flags that it
communicates.

Professionals who can benefit from data analytics skills include:

 Marketers, who utilize customer data, industry trends, and performance data from past
campaigns to plan marketing strategies
 Product managers, who analyze market, industry, and user data to improve their companies’
products
 Finance professionals, who use historical performance data and industry trends to forecast their
companies’ financial trajectories
 Human resources and diversity, equity, and inclusion professionals, who gain insights into
employees’ opinions, motivations, and behaviors and pair it with industry trend data to make
meaningful changes within their organizations
4 KEY TYPES OF DATA ANALYTICS

1. Descriptive Analytics

Descriptive analytics is the simplest type of analytics and the foundation the other types are built
on. It allows you to pull trends from raw data and succinctly describe what happened or is currently
happening.

Descriptive analytics answers the question, “What happened?”

For example, imagine you’re analyzing your company’s data and find there’s a seasonal surge in
sales for one of your products: a video game console. Here, descriptive analytics can tell you, “This
video game console experiences an increase in sales in October, November, and early December
each year.”

Data visualization is a natural fit for communicating descriptive analysis because charts, graphs,
and maps can show trends in data—as well as dips and spikes—in a clear, easily understandable
way.

2. Diagnostic Analytics

Diagnostic analytics addresses the next logical question, “Why did this happen?”

Taking the analysis a step further, this type includes comparing coexisting trends or movement,
uncovering correlations between variables, and determining causal relationships where possible.

Continuing the aforementioned example, you may dig into video game console users’ demographic
data and find that they’re between the ages of eight and 18. The customers, however, tend to be
between the ages of 35 and 55. Analysis of customer survey data reveals that one primary
motivator for customers to purchase the video game console is to gift it to their children. The spike
in sales in the fall and early winter months may be due to the holidays that include gift-giving.

Diagnostic analytics is useful for getting at the root of an organizational issue.

3. Predictive Analytics

Predictive analytics is used to make predictions about future trends or events and answers the
question, “What might happen in the future?”

By analyzing historical data in tandem with industry trends, you can make informed predictions
about what the future could hold for your company.
For instance, knowing that video game console sales have spiked in October, November, and early
December every year for the past decade provides you with ample data to predict that the same
trend will occur next year. Backed by upward trends in the video game industry as a whole, this is
a reasonable prediction to make.

Making predictions for the future can help your organization formulate strategies based on likely
scenarios.

4. Prescriptive Analytics

Prescriptive analytics takes into account all possible factors in a scenario and suggests actionable
takeaways. This type of analytics can be especially useful when making data-driven decisions.

Rounding out the video game example: What should your team decide to do given the predicted
trend in seasonality due to winter gift-giving? Perhaps you decide to run an A/B test with two ads:
one that caters to product end-users (children) and one targeted to customers (their parents). The
data from that test can inform how to capitalize on the seasonal spike and its supposed cause even
further. Or, maybe you decide to increase marketing efforts in September with holiday-themed
messaging to try to extend the spike into another month.

While manual prescriptive analysis is doable and accessible, machine-learning algorithms are
often employed to help parse through large volumes of data to recommend the optimal next step.
Algorithms use “if” and “else” statements, which work as rules for parsing data. If a specific
combination of requirements is met, an algorithm recommends a specific course of action. While
there’s far more to machine-learning algorithms than just those statements, they—along with
mathematical equations—serve as a core component in algorithm training

https://aws.amazon.com/what-is/advanced-analytics/

What is advanced analytics?

Advanced analytics is the process of using complex machine learning (ML) and visualization
techniques to derive data insights beyond traditional business intelligence. Modern organizations
collect vast volumes of data and analyze it to discover hidden patterns and trends. They use the
information to improve business process efficiency and customer satisfaction. With advanced
analytics, you can take this one step further and use data for future and real-time decision-making.
Advanced analytics techniques also derive meaning from unstructured data like social media
comments or images. They can help your organization solve complex problems more efficiently.
Advancements in cloud computing and data storage have made advanced analytics more
affordable and accessible to all organizations.

What are the use cases of advanced analytics?

Your organization can use advanced analytics to solve complex challenges beyond traditional
business analysis and reporting. Here are some examples across industries.

Healthcare

Healthcare and life science companies analyze clinical and operational data to decrease care costs
while boosting diagnosis accuracy. For example, advanced analysis of medical images supports
precision diagnosis. Similarly, they use advanced analytics to turn patient, genomic,
transcriptomic, and other omics data into actionable insights. It accelerates clinical trials, enhances
research and innovation, and simplifies clinical multiomics.

Finance

Financial services can enhance operational processes and innovation using data-driven insights
from transformative technologies. For example, they can use advanced analytics for these
purposes:

 Optimize critical banking operations

 Drive transformation and reimagine business models in capital markets

 Modernize core systems and enhance risk modeling in insurance

The industry can perform data mining to transform experiences for stakeholders, employees,
intermediaries, and customers. Advanced analytics helps companies make better decisions for
profitability and customer satisfaction

Manufacturing

The manufacturing industry uses advanced analytics to improve overall equipment effectiveness
(OEE). Diagnostic and predictive analytics improve equipment maintenance and monitoring.
Additionally, the manufacturing sector can do the following:

 Improve processes by identifying and remedying bottlenecks

 Detect real-time anomalies in equipment

 Automate inspection, verification, and other time-consuming manufacturing processes

Retail

The retail industry uses advanced analytics technologies to create smart stores, streamline digital
commerce, and build toward an intelligent supply chain. They can derive insights from customer
interaction and behavior for many purposes:

 Improve merchandising decisions and develop effective merchandising strategies

 Boost customer lifetime value by personalizing product recommendations

 Optimize internal business operations to lower costs and improve margins

 Democratize access to data to innovate and accelerate positive outcome

What are the types of advanced analytics?

Advancements in data science have helped develop several distinct focus areas within the field of
analytics.
Cluster analytics

Cluster analysis organizes data points into groups based on similarities. It doesn't require initial
assumptions about the relationship between data points, so you can find new patterns and
associations in your data.

For instance, you can use cluster analysis to create demographic or psychographic categories
within customer bases. You can then plot the relationship between one quality and another. You
could trace whether there's a relationship between certain demographics of customers and their
buying habits

Cohort analytics

Like cluster analysis, cohort analysis divides large data sets into small segments. However, it tracks
a group's behavior over time. On the other hand, cluster analysis focuses on finding similarities in
the dataset without necessarily considering the temporal aspect.

Cohort analysis is often used in user behavior and retention studies. You can use it to trace how
each cohort responds to different events. This advanced analytics method improves customer
retention, user engagement, product adoption, and interaction.

Predictive analytics
Traditional descriptive analytics looks at historical data to identify trends and patterns. Predictive
modeling uses past data to predict future outcomes. You mainly use predictive analysis in risk-
related fields or when you want to find new opportunities. By seeing potential future scenarios,
you can make better decisions with confidence. It contributes to risk reduction and increases
operational efficiency.

Prescriptive analytics

Prescriptive analysis recommends actions you can take to affect a desired outcome. Beyond just
showing future trends, prescriptive analytics suggests different courses of action to best take
advantage of the predicted future scenario. For instance, imagine a business scenario where
predictive analytics tells you which customers are most likely to churn in the next quarter.
Prescriptive analytics suggests specific retention strategies tailored to each at-risk customer
segment, such as special discount offers, loyalty programs, or personalized communication
campaigns.

https://www.techtarget.com/searchbusinessanalytics/definition/advanced-analytics

What is advanced analytics?

Advanced analytics is a data analysis methodology using predictive modeling, machine

learning algorithms, deep learning, business process automation and other statistical methods to
analyze business information from a variety of data sources.

Advanced analytics uses data science beyond traditional business intelligence (BI) methods to
predict patterns, estimate the likelihood of future events and find insights in data that experts might
miss. Predictive analytics capabilities can help an organization be more efficient and increase its
accuracy in decision-making.

Data scientists often use advanced analytics tools to combine prescriptive analytics and predictive
analytics. Using different analytics types together adds options for enhanced visualization and
predictive models.

What are the benefits of advanced analytics?

In addition to enabling more efficient use of data assets and providing decision-makers with higher
confidence in data accuracy, advanced analytics offers the following benefits:

 Accurate forecasting. Using advanced analytics can confirm or refute prediction and forecast
models with better accuracy than traditional BI tools, which still carry an element of
uncertainty.

 Faster decision-making. Improving the accuracy of predictions allows executives to act more
quickly. They can be confident their quicker business decisions will achieve the desired results
and favorable outcomes can be repeated.
 Deeper insight. Advanced analytics offers a deeper level of actionable insight from data,
including customer preference, market trends and key business processes. Better insights
empower stakeholders to make data-driven decisions with direct effects on their strategy.

 Improved risk management. The higher level of accuracy advanced analytics provides
predictions can help businesses reduce their risk of costly mistakes.

 Anticipate problems and opportunities. Advanced analytics uses statistical models to

reveal potential problems on the business's current trajectory or identify new opportunities.
Stakeholders can quickly change course and achieve better outcomes.

What are some advanced analytics techniques?

Advanced analytics can help provide organizations with a competitive advantage. Techniques
range from basic statistical or trend analysis to more complex tasks requiring BI or specialized
tools. The most complex techniques can handle big data, apply machine learning techniques and
perform complex tasks. Some commonly used advanced analytics techniques include the
following:

Data mining. The data mining process sorts through large data sets to identify patterns and
establish relationships. It's a key part of successful analytics operations because BI and advanced
analytics applications use the data that mining generates to solve problems. It has applications
across a variety of industries including healthcare, government, scientific research, mathematics
and sports.

Sentiment analysis. At its core, sentiment analysis is about understanding emotions. It processes
text data to determine the attitude or emotion behind the words, which can be positive, negative or
neutral. In a business setting, sentiment analysis can help the business to understand how
customers feel about a brand based on their reviews, social media comments or direct feedback.
Tools used for sentiment analysis range from basic text analytics software to more advanced
natural language processing (NLP) tools, some of which use machine learning to improve
accuracy.
Cluster analysis. Cluster analysis is a method of grouping. It brings together similar items in a
data set. Data groups, or clusters, contain items more similar to each other than items in other
clusters. For example, a telecom company could use cluster analysis to group customers based on
their usage patterns. Then, they can target each group with a specific marketing strateg

Complex event processing. Complex event processing (CEP) involves analyzing multiple events
happening across various systems in real time to detect patterns. If CEP detects patterns of interest
or abnormal behaviors, it can trigger alerts for immediate action. A practical example is credit card
fraud detection: The system monitors transactions and flags any suspicious patterns for
investigation.

Recommender systems. Recommender systems use past behavior analysis to predict what a user
might want, and then personalize suggestions. An everyday example is when an online shopping
site suggests products a customer might prefer based on their browsing history, or when a
streaming service suggests a show the user may want to watch next.

Time series analysis. Time series analysis focuses on data changes over time. It looks at patterns,
trends and cycles in the data to predict future points. For instance, a retailer might use time series
analysis to forecast future sales based on past sales data. The results can help the retailer plan stock
levels and manage resources efficiently.

Big data analytics. Big data analytics is the process of examining large volumes
of structured, semistructured and unstructured data to uncover information such as hidden
patterns, correlations, market trends and customer preferences. It uses analytics systems to power
predictive models, statistical algorithms and what-if analysis.

Machine learning. The development of machine learning has dramatically increased the speed of
data processing and analysis, facilitating disciplines such as predictive analytics. Machine learning
uses AI to enable software applications to predict outcomes more accurately. The inputs use
historical data to predict new outputs. Common use cases include recommendation engines, fraud
detection and predictive maintenance.
Data visualization. Data visualization is the process of presenting data in graphical format. It
makes data analysis and sharing more accessible across organizations. Data scientists use
visualizations after writing predictive analytics or machine learning algorithms to visualize
outputs, monitor results and ensure models perform as intended. It's also a quick and effective way
to communicate information to others.

https://www.gartner.com/en/information-technology/glossary/advanced-analytics

Advanced Analytics

Advanced Analytics is the autonomous or semi-autonomous examination of data or content using

sophisticated techniques and tools, typically beyond those of traditional business intelligence (BI),
to discover deeper insights, make predictions, or generate recommendations. Advanced analytic
techniques include those such as data/text mining, machine learning, pattern matching, forecasting,
visualization, semantic analysis, sentiment analysis, network and cluster analysis, multivariate
statistics, graph analysis, simulation, complex event processing, neural networks.

https://www.geeksforgeeks.org/data-analytics-tools/

What is Data Analytics

Data analytics is the process of examining large datasets to uncover patterns, trends, correlations,
and insights that can be used to make informed decisions. It involves various techniques and
Data Analysis Tools to analyze and interpret data, often with the goal of improving business
performance, understanding customer behavior, optimizing processes, or gaining competitive
advantages. Data analytics encompasses a range of approaches, including descriptive analytics
(summarizing data to understand its current state), diagnostic analytics (identifying reasons
behind past outcomes), predictive analytics (forecasting future trends or outcomes), and
prescriptive analytics (suggesting actions to achieve desired outcomes).

Top 10 Data Analysis Tools

Looking for the top and best Data Analysis Tools for beginners and businesses in 2024?
Explore our curated list of the most powerful and user-friendly tools that can help you unlock
the full potential of your data. Whether you’re a data analyst, data scientist, or business
professional, these tools offer cutting-edge features and capabilities to enhance your data
analysis and decision-making process. Discover the right tool for your needs and stay ahead
in the competitive world of data analytics
1. Tableau
Tableau is an easy-to-use Data Analytics tool. Tableau has a drag-and-drop interface which
helps to create interactive visuals and dashboards. Organizations can use this to instantly develop
visuals that give context and meaning to the raw data, making the data very easy to understand.
Also, due to the simple and easy-to-use interface, one can easily use this tool regardless of their
technical ability. Furthermore, Tableau comes with a wide range of features and tools that help
you create the best visuals which are easy to understand.
The advantage of Tableau that overshadows all others is in its Quality Visuals embedded with
Interactive Information. But this doesn’t mean Tableau is perfect. Tableau is only meant
for Data Visualisation, so we can’t preprocess data using this tool. Also, it does have a bit of a
learning curve and is known for its high cost.
Features:
 Easy Drag and Drop Interface
 Mobile support for both iOS and Android
 The Data Discovery feature allows you to find hidden data
 You can use various Data sources like SQL Server, Oracle, etc
2. Power BI
Power BI is Microsoft’s Data Analysis Tools. It provides enhanced Interactive Visualisation
and capabilities of Business Intelligence. Power BI achieves all this while providing a Simple
and intuitive User Interface. Being a product of Microsoft, you can expect seamless integration
with various Microsoft products. It allows you to connect with Excel spreadsheets, cloud-
based data sources and on-premises data sources.
Power BI is known and loved for its groundbreaking features like Natural Language
queries, Power Query Editor Support, and intuitive User Interface. But Power BI does have
its downsides. It can not handle records that are bigger than 250 MB in size. Besides, it has
limited sharing capabilities, and you would need to pay extra to scale as per your needs,
Features:
 Great connectivity with Microsoft products
 Powerful Semantic Models
 Can meet both Personal and Enterprise needs
 Ability to create beautiful paginated reports
3. Apache Spark
Apache Spark is known for its speed in Data Processing is a Data Analysis Tools. Spark has in-
memory processing, which makes it incredibly fast. It is also open source which results in trust
and interoperability. The ability to handle enormous amounts of Data makes Spark distinguished.
It is quite easy and straightforward to learn, thanks to its API. This doesn’t end here. It also has
support for Distributed Computing Frameworks.
But Apache Spark does have some drawbacks. It doesn’t have an integrated File Management
System and has fewer algorithms than its competitors. Also, it faces issues if the files are tiny.
Features:
 Incredible Speed and Efficiency
 Great connectivity with support of Python, Scala, R, and SQL shells
 Ability to handle and manipulate data in real-time
 Can run on many platforms like Hadoop, Kubernetes, Cloud, and also standalone
4. TensorFlow
TensorFlow is a Machine Learning Library and among data analysis tools. This open-source
library was developed by Google and is a popular choice for many businesses looking forward
to supporting Machine Learning capabilities to their Data Analytics workflow
as Tensorflow can build and train Machine Learning Models. Tensorflow is the first choice of
many due to its wide recognition, which results in an adequate amount of tutorials, and support
for many Programming Languages. TensorFlow can also run on GPUs and TPUs, making the
task much faster.
But TensorFlow can be very hard to use for beginners, and you need Coding knowledge to use
it stand alone, and it has a steep learning curve. Tensorflow can also be quite tricky to install and
configure, depending on your system.
Features:
 Supports a lot of programming languages like Python, C++, JavaScript, and Java
 Can scale as needed with support for multiple CPUs, GPUs, or TPUs
 Offers a large community to solve problems and issues
 Features a built-in visualization tool for you to see how the model is performing
5. Hadoop
Hadoop by Apache is a Distributed Processing and Storage Solution and also used as a data
analysis tools. It is an open-source framework that stores and processes Big Data with the help
of the MapReduce Model. Hadoop is known for its scalability. It is also fault-tolerant and can
continue even after one or more nodes fail. Being Open Source, it can be used freely and
customized to suit specific needs, and Hadoop also supports various Data Formats.
But Hadoop does have some drawbacks. Hadoop requires powerful hardware for it to run
effectively. In addition, it features a steep learning curve making it hard for some users. This is
partly because some users find the MapReduce Model hard to grasp.
Features:
 Free to use as it is Open Source
 Can run on commodity hardware
 Built with fault-tolerance as it can operate even when some node fails
 Highly scalable with the ability to distribute data into multiple nodes
6. R
R is an Open Source Programming language widely used for Statistical Computing and Data
Analysis and can be consider as a data analysis tools. It is known for handling large Datasets and
its flexibility. The package library of R has various packages. Using these packages, R allows
the user to manipulate and visualize data. Besides, R also has packages for things like Data
cleaning, Machine Learning, and Natural Language Processing. These features make R very
capable.
Despite these features, R isn’t perfect. For example, R is significantly slower than languages
like C++ and Java. Besides, R is known to have a steep learning curve, especially if you are
unfamiliar with Programming.
Features:
 Ability to handle large Datasets
 Flexibility to be used in many areas like Data Visualisation, Data Processing
 Features built-in graphics capabilities for amazing visuals
 Offers an active community to answer questions and help in problem-solving
7. Python
Python is another Programming Language popular for Data Analysis and Machine
Learning.Python is used extremely in Data analysis tools. Python is widely recognized to have
easy syntax which makes it easy to learn. Along with the easy syntax, the package manager of
Python features a lot of important packages and libraries. This makes it suitable for Data
Analysis and Machine Learning. Another reason to use Python is its scalability.
This doesn’t mean Python is flawless. It is quite slow when we compare it to languages
like Java or C++; this is because Python is an interpreted language while the others are compiled.
Besides, Python is also infamous for its high memory consumption.
Features:
 Easy to learn and user-friendly
 Scalable with the ability to handle large datasets
 Extensive packages and libraries that increase the functionality
 Open Source and widely adopted which ensures problems can be fixed easily.
8. SAS
SAS stands for Statistical Analysis System. The SAS Software was developed by the SAS
Institute, and it is widely used for Business Analytics nowadays. SAS has both a Graphical
User Interface and a Terminal Interface. So, depending on the user’s skillsets, they can choose
either one. It also has the ability to handle large datasets. In addition, SAS is equipped with a lot
of Analytical Tools which makes it valid for a lot of applications.
Although SAS is very powerful, it has a big price tag and a steep learning curve, so it is quite
hard for beginners.
Features:
 Ability to handle large datasets
 Support for graphical and non-graphical interface
 Features tools to create high-quality visualizations
 Wide range of tools for predictive and statistical analysis
9. QlikSense
QilkSense is a Business and data analysis Tools that provides support for Data Visualisation and
Data Analysis. QuilkSense supports various Data sources from Spreadsheets, Databases, and
also Cloud Services. You can create amazing Dashboards and Visualisations. It comes with
Machine Learning features and uses AI to help the user understand the Data. Furthermore,
QlikSense also has features like Instant Search and Natural Language Processing.
But QilkSense does have some drawbacks. The data extraction of QilkSense is quite inflexible.
The Pricing Model is quite complicated, and it is quite sluggish when it comes to large datasets.
Features:
 Tools for stunning and interactive Data Visualisation
 Conversational AI-powered analytics with Qlik Insight Bot
 Features tools to create high-quality visualizations
 Provides Qlik Big Data Index which is a Data Indexing Engine
10. KNIME
KNIME is an Analytics Platform and a data analysis tools. It is Open Source and features an
User Interface which is intuitive. KNIME is built with scalability and also offers extensibility
via a well-defined API Plugin. You can also automate Spreadsheets, do Machine Learning, and
much more using KNIME. The best part is you don’t even need to code to do all this.
But KNIME does have its issues. The abundance of features can be overwhelming to some users.
Also, the Data Visualisation of KNIME is not the best and can be improved.
Features:
 Intuitive User Interface with drag and drop function
 Support for extensive analytics tools like Machine Learning, Data Mining, Big Data
Processing
 Provides tools to create high-quality visualizations

https://www.coursera.org/articles/big-data-technologies

4 types of big data technologies

Big data technologies can be categorized into four main types: data storage, data mining, data
analytics, and data visualization [2]. Each of these is associated with certain tools, and you’ll want
to choose the right tool for your business needs depending on the type of big data technology
required.
1. Data storage

Big data technology that deals with data storage has the capability to fetch, store, and manage big
data. It is made up of infrastructure that allows users to store the data so that it is convenient to access.
Most data storage platforms are compatible with other programs. Two commonly used tools are
Apache Hadoop and MongoDB.
 Apache Hadoop: Apache is the most widely used big data tool. It is an open-source software
platform that stores and processes big data in a distributed computing environment across hardware
clusters. This distribution allows for faster data processing. The framework is designed to reduce
bugs or faults, be scalable, and process all data formats.
 MongoDB: MongoDB is a NoSQL database that can be used to store large volumes of data. Using
key-value pairs (a basic unit of data), MongoDB categorizes documents into collections. It is written
in C, C++, and JavaScript, and is one of the most popular big data databases because it can manage
and store unstructured data with ease.
2. Data mining

Data mining extracts the useful patterns and trends from the raw data. Big data technologies such as
Rapidminer and Presto can turn unstructured and structured data into usable information.
 Rapidminer: Rapidminer is a data mining tool that can be used to build predictive models. It draws
on these two roles as strengths, of processing and preparing data, and building machine and deep
learning models. The end-to-end model allows for both functions to drive impact across the
organization [3].
 Presto: Presto is an open-source query engine that was originally developed by Facebook to run
analytic queries against their large datasets. Now, it is available widely. One query on Presto can
combine data from multiple sources within an organization and perform analytics on them in a matter
of minutes.
3. Data analytics

In big data analytics, technologies are used to clean and transform data into information that can be
used to drive business decisions. This next step (after data mining) is where users perform
algorithms, models, and predictive analytics using tools such as Apache Spark and Splunk.
 Apache Spark: Spark is a popular big data tool for data analysis because it is fast and efficient at
running applications. It is faster than Hadoop because it uses random access memory (RAM) instead
of being stored and processed in batches via MapReduce [4]. Spark supports a wide variety of data
analytics tasks and queries.
 Splunk: Splunk is another popular big data analytics tool for deriving insights from large datasets.
It has the ability to generate graphs, charts, reports, and dashboards. Splunk also enables users to
incorporate artificial intelligence (AI) into data outcomes.
4. Data visualization

Finally, big data technologies can be used to create stunning visualizations from the data. In data-
oriented roles, data visualization is a skill that is beneficial for presenting recommendations to
stakeholders for business profitability and operations—to tell an impactful story with a simple graph.
 Tableau: Tableau is a very popular tool in data visualization because its drag-and-drop interface
makes it easy to create pie charts, bar charts, box plots, Gantt charts, and more. It is a secure platform
that allows users to share visualizations and dashboards in real time.
 Looker: Looker is a business intelligence (BI) tool used to make sense of big data analytics and then
share those insights with other teams. Charts, graphs, and dashboards can be configured with a query,
such as monitoring weekly brand engagement through social media analytics.

https://www.geeksforgeeks.org/popular-big-data-technologies/
Popular Big Data Technologies

Big Data deals with large data sets or deals with the deals with complexities handled by
traditional data processing application software. It has three key concepts like volume, variety,
and velocity. In volume, determining the size of data and in variety, data will be categorized
means will determine the type of data like images, PDF, audio, video, etc. and in velocity, speed
of data transfer or speed of processing and analyzing data will be considered. Big data works on
large data sets, and it can be unstructured, semi-structured, and structured. It includes the
following key parameters while considering big data like capturing data, search, data storage,
sharing of data, transfer, data analysis, visualization, and querying, etc. In the case of analyzing,
it will be used in A/B testing, machine learning, and natural language processing, etc. In the case
of visualization, it will be used in charts, graphs, etc. In big data, the following technology will
be used in Business intelligence, cloud computing, and databases, etc.

Some Popular Big Data Technologies:

Here, we will discuss the overview of these big data technologies in detail and will mainly focus
on the overview part of each technology as mentioned above in the diagram.
1. Apache Cassandra: It is one of the No-SQL databases which is highly scalable and has high
availability. In this, we can replicate data across multiple data centers. Replication across
multiple data centers is supported. In Cassandra, fault tolerance is one of the big factors in which
failed nodes can be easily replaced without any downtime.
2. Apache Hadoop: Hadoop is one of the most widely used big data technology that is used to
handle large-scale data, large file systems by using Hadoop file system which is called HDFS,
and parallel processing like features using the MapReduce framework of Hadoop. Hadoop is a
scalable system that helps to provide a scalable solution capable of handling large capacities and
capabilities. For example: If you see real use cases like NextBio is using Hadoop MapReduce
and HBase to process multi-terabyte data sets off the human genome.
3. Apache Hive: It is used for data summarization and ad hoc querying which means for
querying and analyzing Big Data easily. It is built on top of Hadoop for providing data
summarization, ad-hoc queries, and the analysis of large datasets using SQL-like language called
HiveQL. It is not a relational database and not a language for real-time queries. It has many
features like: designed for OLAP, SQL type language called HiveQL, fast, scalable, and
extensible.
4. Apache Flume: It is a distributed and reliable system that is used to collect, aggregate, and
move large amounts of log data from many data sources toward a centralized data store.
5. Apache Spark: The main objective of spark for speeding up the Hadoop computational
computing software process, and It was introduced by Apache Software Foundation. Apache
Spark can work independently because it has its own cluster management, and It is not an
updated or modified version of Hadoop and if you delve deeper then you can say it is just one
way to implement Spark with Hadoop. The Main idea to implement Spark with Hadoop in two
ways is for storage and processing. So, in two ways Spark uses Hadoop for storage purposes just
because Spark has its own cluster management computation. In Spark, it includes interactive
queries and stream processing, and in-memory cluster computing is one of the key features.
6. Apache Kafka: It is a distributed publish-subscribe messaging system and more specifically
you can say it has a robust queue that allows you to handle a high volume of data, and you can
pass the messages from one point to another as you can say from one sender to receiver. You
can perform message computation in both offline and online modes, it is suitable for both. To
prevent data loss Kafka messages are replicated within the cluster. For real-time streaming data
analysis, it integrates Apache Storm and Spark and is built on top of the ZooKeeper
synchronization service.
7. MongoDB: It is based on cross-platform and works on a concept like collection and
document. It has document-oriented storage that means data will be stored in the form of JSON
form. It can be an index on any attribute. It has features like high availability, replication, rich
queries, support by MongoDB, Auto-Sharding, and Fast in-place updates.
8. ElasticSearch: It is a real-time distributed system, and open-source full-text search and
analytics engine. It has features like scalability factor is high and scalable structured and
unstructured data up to petabytes, It can be used as a replacement of MongoDB, RavenDB which
is based on document-based storage. To improve the search performance, it uses
denormalization. If you see the real use case then it is an enterprise search engine and big
organizations using it, for example- Wikipedia, GitHub.

https://www.javatpoint.com/big-data-technologies

Big Data Technologies

Before big data technologies were introduced, the data was managed by general programming
languages and basic structured query languages. However, these languages were not efficient
enough to handle the data because there has been continuous growth in each organization's
information and data and the domain. That is why it became very important to handle such huge
data and introduce an efficient and stable technology that takes care of all the client and large
organizations' requirements and needs, responsible for data production and control. Big data
technologies, the buzz word we get to hear a lot in recent times for all such needs.

In this article, we are discussing the leading technologies that have expanded their branches to help
Big Data reach greater heights. Before we discuss big data technologies, let us first understand
briefly about Big Data Technology.

What is Big Data Technology?

Big data technology is defined as software-utility. This technology is primarily designed to
analyze, process and extract information from a large data set and a huge set of extremely complex
structures. This is very difficult for traditional data processing software to deal with.

Among the larger concepts of rage in technology, big data technologies are widely associated with
many other technologies such as deep learning, machine learning, artificial intelligence (AI),
and Internet of Things (IoT) that are massively augmented. In combination with these
technologies, big data technologies are focused on analyzing and handling large amounts of real-
time data and batch-related data.

Types of Big Data Technology

Before we start with the list of big data technologies, let us first discuss this technology's board
classification. Big Data technology is primarily classified into the following two types:

Operational Big Data Technologies

This type of big data technology mainly includes the basic day-to-day data that people used to
process. Typically, the operational-big data includes daily basis data such as online transactions,
social media platforms, and the data from any particular organization or a firm, which is usually
needed for analysis using the software based on big data technologies. The data can also be referred
to as raw data used as the input for several Analytical Big Data Technologies.

Some specific examples that include the Operational Big Data Technologies can be listed as below:

o Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
o Online trading or shopping from e-commerce websites like Amazon, Flipkart, Walmart,
etc.
o Online data on social media sites, such as Facebook, Instagram, Whatsapp, etc.
o The employees' data or executives' particulars in multinational companies.
Analytical Big Data Technologies

Analytical Big Data is commonly referred to as an improved version of Big Data Technologies.
This type of big data technology is a bit complicated when compared with operational-big data.
Analytical big data is mainly used when performance criteria are in use, and important real-time
business decisions are made based on reports created by analyzing operational-real data. This
means that the actual investigation of big data that is important for business decisions falls under
this type of big data technology.

Some common examples that involve the Analytical Big Data Technologies can be listed as below:

o Stock marketing data

o Weather forecasting data and the time series analysis
o Medical health records where doctors can personally monitor the health status of an
individual
o Carrying out the space mission databases where every information of a mission is very
important

Top Big Data Technologies

We can categorize the leading big data technologies into the following four sections:

o Data Storage
o Data Mining
o Data Analytics
o Data Visualization
Data Storage

Let us first discuss leading Big Data Technologies that come under Data Storage:

o Hadoop: When it comes to handling big data, Hadoop is one of the leading technologies
that come into play. This technology is based entirely on map-reduce architecture and is
mainly used to process batch information. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and process data in a
distributed data processing environment parallel to commodity hardware and a basic
programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from various
machines with a faster speed and low cost. That is why Hadoop is known as one of the core
components of big data technologies. The Apache Software Foundation introduced it in
Dec 2011. Hadoop is written in Java programming language.
o MongoDB: MongoDB is another important component of big data technologies in terms
of storage. No relational properties and RDBMS properties apply to MongoDb because it
is a NoSQL database. This is not the same as traditional RDBMS databases that use
structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional RDBMS
databases. This enables MongoDB to hold massive amounts of data. It is based on a simple
cross-platform document-oriented design. The database in MongoDB uses documents
similar to JSON with the schema. This ultimately helps operational data storage options,
which can be seen in most financial organizations. As a result, MongoDB is replacing
traditional mainframes and offering the flexibility to handle a wide range of high-volume
data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
o RainStor: RainStor is a popular database management system designed to manage and
analyze organizations' Big Data requirements. It uses deduplication strategies that help
manage storing and handling vast amounts of data for reference.
RainStor was designed in 2004 by a RainStor Software Company. It operates just like
SQL. Companies such as Barclays and Credit Suisse are using RainStor for their big data
needs.
o Hunk: Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters
using virtual indexes. This helps us to use the spunk search processing language to analyze
data. Also, Hunk allows us to report and visualize vast amounts of data from Hadoop and
NoSQL data sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.
o Cassandra: Cassandra is one of the leading big data technologies among the list of top
NoSQL databases. It is open-source, distributed and has extensive column storage options.
It is freely available and provides high availability without fail. This ultimately helps in the
process of handling data efficiently on large commodity groups. Cassandra's essential
features include fault-tolerant mechanisms, scalability, MapReduce support, distributed
nature, eventual consistency, query language property, tunable consistency, and multi-
datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the Facebook
inbox search feature. It is based on the Java programming language.

Data Mining

Let us now discuss leading Big Data Technologies that come under Data Mining:
o Presto: Presto is an open-source and a distributed SQL query engine developed to run
interactive analytical queries against huge-sized data sources. The size of data sources can
vary from gigabytes to petabytes. Presto helps in querying the data in Cassandra, Hive,
relational databases and proprietary data storage systems.
Presto is a Java-based query engine that was developed in 2013 by the Apache Software
Foundation. Companies like Repro, Netflix, Airbnb, Facebook and Checkr are using this
big data technology and making good use of it.
o RapidMiner: RapidMiner is defined as the data science software that offers us a very
robust and powerful graphical user interface to create, deliver, manage, and maintain
predictive analytics. Using RapidMiner, we can create advanced workflows and scripting
support in a variety of programming languages.
RapidMiner is a Java-based centralized solution developed in 2001 by Ralf Klinkenberg,
Ingo Mierswa, and Simon Fischer at the Technical University of Dortmund's AI unit. It
was initially named YALE (Yet Another Learning Environment). A few sets of companies
that are making good use of the RapidMiner tool are Boston Consulting Group, InFocus,
Domino's, Slalom, and Vivint.SmartHome.
o ElasticSearch: When it comes to finding information, elasticsearch is known as an
essential tool. It typically combines the main components of the ELK stack (i.e., Logstash
and Kibana). In simple words, ElasticSearch is a search engine based on the Lucene library
and works similarly to Solr. Also, it provides a purely distributed, multi-tenant capable
search engine. This search engine is completely text-based and contains schema-free JSON
documents with an HTTP web interface.
ElasticSearch is primarily written in a Java programming language and was developed in
2010 by Shay Banon. Now, it has been handled by Elastic NV since 2012. ElasticSearch
is used by many top companies, such as LinkedIn, Netflix, Facebook, Google, Accenture,
StackOverflow, etc.

Data Analytics

Now, let us discuss leading Big Data Technologies that come under Data Analytics:
o Apache Kafka: Apache Kafka is a popular streaming platform. This streaming platform is
primarily known for its three core capabilities: publisher, subscriber and consumer. It is
referred to as a distributed streaming platform. It is also defined as a direct messaging,
asynchronous messaging broker system that can ingest and perform data processing on
real-time streaming data. This platform is almost similar to an enterprise messaging system
or messaging queue.
Besides, Kafka also provides a retention period, and data can be transmitted through a
producer-consumer mechanism. Kafka has received many enhancements to date and
includes some additional levels or properties, such as schema, Ktables, KSql, registry, etc.
It is written in Java language and was developed by the Apache software community in
2011. Some top companies using the Apache Kafka platform include Twitter, Spotify,
Netflix, Yahoo, LinkedIn etc.
o Splunk: Splunk is known as one of the popular software platforms for capturing,
correlating, and indexing real-time streaming data in searchable repositories. Splunk can
also produce graphs, alerts, summarized reports, data visualizations, and dashboards, etc.,
using related data. It is mainly beneficial for generating business insights and web
analytics. Besides, Splunk is also used for security purposes, compliance, application
management and control.
Splunk Inc. introduced Splunk in the year 2014. It is written in combination with AJAX,
Python, C ++ and XML. Companies such as Trustwave, QRadar, and 1Labs are making
good use of Splunk for their analytical and security needs.
o KNIME: KNIME is used to draw visual data flows, execute specific steps and analyze the
obtained models, results, and interactive views. It also allows us to execute all the analysis
steps altogether. It consists of an extension mechanism that can add more plugins, giving
additional features and functionalities.
KNIME is based on Eclipse and written in a Java programming language. It was developed
in 2008 by KNIME Company. A list of companies that are making use of KNIME
includes Harnham, Tyler, and Paloalto.
o Spark: Apache Spark is one of the core technologies in the list of big data technologies. It
is one of those essential technologies which are widely used by top companies. Spark is
known for offering In-memory computing capabilities that help enhance the overall speed
of the operational process. It also provides a generalized execution model to support more
applications. Besides, it includes top-level APIs (e.g., Java, Scala, and Python) to ease the
development process.
Also, Spark allows users to process and handle real-time streaming data using batching and
windowing operations techniques. This ultimately helps to generate datasets and data
frames on top of RDDs. As a result, the integral components of Spark Core are produced.
Components like Spark MlLib, GraphX, and R help analyze and process machine learning
and data science. Spark is written using Java, Scala, Python and R language. The Apache
Software Foundation developed it in 2009. Companies like Amazon, ORACLE, CISCO,
VerizonWireless, and Hortonworks are using this big data technology and making good
use of it.
o R-Language: R is defined as the programming language, mainly used in statistical
computing and graphics. It is a free software environment used by leading data miners,
practitioners and statisticians. Language is primarily beneficial in the development of
statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language for
their data analytics needs.
o Blockchain: Blockchain is a technology that can be used in several applications related to
different industries, such as finance, supply chain, manufacturing, etc. It is primarily used
in processing operations like payments and escrow. This helps in reducing the risks of
fraud. Besides, it enhances the transaction's overall processing speed, increases financial
privacy, and internationalize the markets. Additionally, it is also used to fulfill the needs
of shared ledger, smart contract, privacy, and consensus in any Business Network
Environment.
Blockchain technology was first introduced in 1991 by two researchers, Stuart
Haber and W. Scott Stornetta. However, blockchain has its first real-world application
in Jan 2009 when Bitcoin was launched. It is a specific type of database based on Python,
C++, and JavaScript. ORACLE, Facebook, and MetLife are a few of those top companies
using Blockchain technology.
Data Visualization

Let us discuss leading Big Data Technologies that come under Data Visualization:

o Tableau: Tableau is one of the fastest and most powerful data visualization tools used by
leading business intelligence industries. It helps in analyzing the data at a very faster speed.
Tableau helps in creating the visualizations and insights in the form of dashboards and
worksheets.
Tableau is developed and maintained by a company named TableAU. It was introduced in
May 2013. It is written using multiple languages, such as Python, C, C++, and Java. Some
of the list's top companies are Cognos, QlikQ, and ORACLE Hyperion, using this tool.
o Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich libraries
and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js, etc. This
helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotly company. It is based on JavaScript. Paladins and
Bitbank are some of those companies that are making good use of Plotly.

Emerging Big Data Technologies

Apart from the above mentioned big data technologies, there are several other emerging big data
technologies. The following are some essential technologies among them:

o TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible ecosystem

tools, and community resources that help researchers implement the state-of-art in Machine
Learning. Besides, this ultimately allows developers to build and deploy machine learning-
powered applications in specific environments.
TensorFlow was introduced in 2019 by Google Brain Team. It is mainly based on C++,
CUDA, and Python. Companies like Google, eBay, Intel, and Airbnb are using this
technology for their business requirements.
o Beam: Apache Beam consists of a portable API layer that helps build and maintain
sophisticated parallel-data processing pipelines. Apart from this, it also allows the
execution of built pipelines across a diversity of execution engines or runners.
Apache Beam was introduced in June 2016 by the Apache Software Foundation. It is
written in Python and Java. Some leading companies like Amazon, ORACLE, Cisco, and
VerizonWireless are using this technology.
o Docker: Docker is defined as the special tool purposely developed to create, deploy, and
execute applications easier by using containers. Containers usually help developers pack
up applications properly, including all the required components like libraries and
dependencies. Typically, containers bind all components and ship them all together as a
package.
Docker was introduced in March 2003 by Docker Inc. It is based on the Go language.
Companies like Business Insider, Quora, Paypal, and Splunk are using this technology.
o Airflow: Airflow is a technology that is defined as a workflow automation and scheduling
system. This technology is mainly used to control, and maintain data pipelines. It contains
workflows designed using the DAGs (Directed Acyclic Graphs) mechanism and consisting
of different tasks. The developers can also define workflows in codes that help in easy
testing, maintenance, and versioning.
Airflow was introduced in May 2019 by the Apache Software Foundation. It is based on
a Python language. Companies like Checkr and Airbnb are using this leading technology.
o Kubernetes: Kubernetes is defined as a vendor-agnostic cluster and container management
tool made open-source in 2014 by Google. It provides a platform for automation,
deployment, scaling, and application container operations in the host clusters.
Kubernetes was introduced in July 2015 by the Cloud Native Computing Foundation. It
is written in the Go language. Companies like American Express, Pear Deck,
PeopleSource, and Northwestern Mutual are making good use of this technology.

These are emerging technologies. However, they are not limited because the ecosystem of big data
is constantly emerging. That is why new technologies are coming at a very fast pace based on the
demand and requirements of IT industries.
https://www.techtarget.com/searchdatamanagement/feature/15-big-data-tools-and-
technologies-to-know-about

Big data tools and technologies

1. Airflow

Airflow is a workflow management platform for scheduling and running complex data pipelines in
big data systems. It enables data engineers and other users to ensure that each task in a workflow
is executed in the designated order and has access to the required system resources. Airflow is also
promoted as easy to use: Workflows are created in the Python programming language, and it can
be used for building machine learning models, transferring data and various other purposes.

The platform originated at Airbnb in late 2014 and was officially announced as an open source
technology in mid-2015; it joined the Apache Software Foundation's incubator program the
following year and became an Apache top-level project in 2019. Airflow also includes the
following key features:

 A modular and scalable architecture built around the concept of directed acyclic graphs
(DAGs), which illustrate the dependencies between the different tasks in workflows.

 A web application UI to visualize data pipelines, monitor their production status and
troubleshoot problems.

 Ready-made integrations with major cloud platforms and other third-party services.

2. Delta Lake

Databricks Inc., a software vendor founded by the creators of the Spark processing engine,
developed Delta Lake and then open sourced the Spark-based technology in 2019 through the
Linux Foundation. The company describes Delta Lake as "an open format storage layer that
delivers reliability, security and performance on your data lake for both streaming and batch
operations."
Delta Lake doesn't replace data lakes; rather, it's designed to sit on top of them and create a single
home for structured, semistructured and unstructured data, eliminating data silos that can stymie
big data applications. Furthermore, using Delta Lake can help prevent data corruption, enable
faster queries, increase data freshness and support compliance efforts, according to Databricks.
The technology also comes with the following features:

 Support for ACID transactions, meaning those with atomicity, consistency, isolation and
durability.

 The ability to store data in an open Apache Parquet format.

 A set of Spark-compatible APIs.

3. Drill

The Apache Drill website describes it as "a low latency distributed query engine for large-scale
datasets, including structured and semi-structured/nested data." Drill can scale across thousands of
cluster nodes and is capable of querying petabytes of data by using SQL and standard connectivity
APIs.

Designed for exploring sets of big data, Drill layers on top of multiple data sources, enabling users
to query a wide range of data in different formats, from Hadoop sequence files and server logs to
NoSQL databases and cloud object storage. It can also do the following:

 Access most relational databases through a plugin.

 Work with commonly used BI tools, such as Tableau and Qlik.

 Run in any distributed cluster environment, although it requires Apache's ZooKeeper software
to maintain information about clusters.

4. Druid

Druid is a real-time analytics database that delivers low latency for queries, high concurrency,
multi-tenant capabilities and instant visibility into streaming data. Multiple end users can query
the data stored in Druid at the same time with no impact on performance, according to its
proponents.
Written in Java and created in 2011, Druid became an Apache technology in 2018. It's generally
considered a high-performance alternative to traditional data warehouses that's best suited to
event-driven data. Like a data warehouse, it uses column-oriented storage and can load files in
batch mode. But it also incorporates features from search systems and time series databases,
including the following:

 Native inverted search indexes to speed up searches and data filtering.

 Time-based data partitioning and querying.

 Flexible schemas with native support for semistructured and nested data.

5. Flink

Another Apache open source technology, Flink is a stream processing framework for distributed,
high-performing and always-available applications. It supports stateful computations over both
bounded and unbounded data streams and can be used for batch, graph and iterative processing.

One of the main benefits touted by Flink's proponents is its speed: It can process millions of events
in real time for low latency and high throughput. Flink, which is designed to run in all common
cluster environments, also includes the following features:

 In-memory computations with the ability to access disk storage when needed.

 Three layers of APIs for creating different types of applications.

 A set of libraries for complex event processing, machine learning and other common big data
use cases.

6. Hadoop

A distributed framework for storing data and running applications on clusters of commodity
hardware, Hadoop was developed as a pioneering big data technology to help handle the growing
volumes of structured, unstructured and semistructured data. First released in 2006, it was almost
synonymous with big data early on; it has since been partially eclipsed by other technologies but
is still widely used.
Hadoop has four primary components:

 The Hadoop Distributed File System (HDFS), which splits data into blocks for storage on the
nodes in a cluster, uses replication methods to prevent data loss and manages access to the
data.

 YARN, short for Yet Another Resource Negotiator, which schedules jobs to run on cluster
nodes and allocates system resources to them.

 Hadoop MapReduce, a built-in batch processing engine that splits up large computations and
runs them on different nodes for speed and load balancing.

 Hadoop Common, a shared set of utilities and libraries.

Initially, Hadoop was limited to running MapReduce batch applications. The addition of YARN
in 2013 opened it up to other processing engines and use cases, but the framework is still closely
associated with MapReduce. The broader Apache Hadoop ecosystem also includes various big
data tools and additional frameworks for processing, managing and analyzing big data.

7. Hive

Hive is SQL-based data warehouse infrastructure software for reading, writing and managing large
data sets in distributed storage environments. It was created by Facebook but then open sourced to
Apache, which continues to develop and maintain the technology.

Hive runs on top of Hadoop and is used to process structured data; more specifically, it's used for
data summarization and analysis, as well as for querying large amounts of data. Although it can't
be used for online transaction processing, real-time updates, and queries or jobs that require low-
latency data retrieval, Hive is described by its developers as scalable, fast and flexible.

Other key features include the following:

 Standard SQL functionality for data querying and analytics.

 A built-in mechanism to help users impose structure on different data formats.

 Access to HDFS files and ones stored in other systems, such as the Apache HBase database.

8. HPCC Systems

HPCC Systems is a big data processing platform developed by LexisNexis before being open
sourced in 2011. True to its full name -- High-Performance Computing Cluster Systems -- the
technology is, at its core, a cluster of computers built from commodity hardware to process,
manage and deliver big data.

A production-ready data lake platform that enables rapid development and data exploration, HPCC
Systems includes three main components:

 Thor, a data refinery engine that's used to cleanse, merge and transform data, and to profile,
analyze and ready it for use in queries.

 Roxie, a data delivery engine used to serve up prepared data from the refinery.

 Enterprise Control Language, or ECL, a programming language for developing applications.

9. Hudi

Hudi (pronounced hoodie) is short for Hadoop Upserts Deletes and Incrementals. Another open
source technology maintained by Apache, it's used to manage the ingestion and storage of large
analytics data sets on Hadoop-compatible file systems, including HDFS and cloud object storage
services.

First developed by Uber, Hudi is designed to provide efficient and low-latency data ingestion
and data preparation capabilities. Moreover, it includes a data management framework that
organizations can use to do the following:

 Simplify incremental data processing and data pipeline development.

 Improve data quality in big data systems.

 Manage the lifecycle of data sets.

10. Iceberg
Iceberg is an open table format used to manage data in data lakes, which it does partly by tracking
individual data files in tables rather than by tracking directories. Created by Netflix for use with
the company's petabyte-sized tables, Iceberg is now an Apache project. According to the project's
website, Iceberg typically "is used in production where a single table can contain tens of petabytes
of data."

Designed to improve on the standard layouts that exist within tools such as Hive, Presto, Spark
and Trino, the Iceberg table format has functions similar to SQL tables in relational databases.
However, it also accommodates multiple engines operating on the same data set. Other notable
features include the following:

 Schema evolution for modifying tables without having to rewrite or migrate data.

 Hidden partitioning of data that avoids the need for users to maintain partitions.

 A time travel capability that supports reproducible queries using the same table snapshot.

11. Kafka

Kafka is a distributed event streaming platform that, according to Apache, is used by more than
80% of Fortune 100 companies and thousands of other organizations for high-performance data
pipelines, streaming analytics, data integration and mission-critical applications. In simpler terms,
Kafka is a framework for storing, reading and analyzing streaming data.

The technology decouples data streams and systems, holding the data streams so they can then be
used elsewhere. It runs in a distributed environment and uses a high-performance TCP network
protocol to communicate with systems and applications. Kafka was created by LinkedIn before
being passed on to Apache in 2011.

The following are some of the key components in Kafka:

 A set of five core APIs for Java and the Scala programming language.

 Fault tolerance for both servers and clients in Kafka clusters.

 Elastic scalability to up to 1,000 brokers, or storage servers, per cluster.

12. Kylin

Kylin is a distributed data warehouse and analytics platform for big data. It provides an online
analytical processing (OLAP) engine designed to support extremely large data sets. Because Kylin
is built on top of other Apache technologies -- including Hadoop, Hive, Parquet and Spark -- it can
easily scale to handle those large data loads, according to its backers.

It's also fast, delivering query responses measured in milliseconds. In addition, Kylin provides an
ANSI SQL interface for multidimensional analysis of big data and integrates with
Tableau, Microsoft Power BI and other BI tools. Kylin was initially developed by eBay, which
contributed it as an open source technology in 2014; it became a top-level project within Apache
the following year. Other features it provides include the following:

 Precalculation of multidimensional OLAP cubes to accelerate analytics.

 Job management and monitoring functions.

 Support for building customized UIs on top of the Kylin core.

13. Pinot

Pinot is a real-time distributed OLAP data store built to support low-latency querying by analytics
users. Its design enables horizontal scaling to deliver that low latency even with large data sets and
high throughput. To provide the promised performance, Pinot stores data in a columnar format and
uses various indexing techniques to filter, aggregate and group data. In addition, configuration
changes can be done dynamically without affecting query performance or data availability.

According to Apache, Pinot can handle trillions of records overall while ingesting millions of data
events and processing thousands of queries per second. The system has a fault-tolerant architecture
with no single point of failure and assumes all stored data is immutable, although it also works
with mutable data. Started in 2013 as an internal project at LinkedIn, Pinot was open sourced in
2015 and became an Apache top-level project in 2021.

The following features are also part of Pinot:

 Near-real-time data ingestion from streaming sources, plus batch ingestion from HDFS, Spark
and cloud storage services.

 A SQL interface for interactive querying and a REST API for programming queries.

 Support for running machine learning algorithms against stored data sets for anomaly
detection.

14. Presto

Formerly known as PrestoDB, this open source SQL query engine can simultaneously handle both
fast queries and large data volumes in distributed data sets. Presto is optimized for low-latency
interactive querying and it scales to support analytics applications across multiple petabytes of
data in data warehouses and other repositories.

Development of Presto began at Facebook in 2012. When its creators left the company in 2018,
the technology split into two branches: PrestoDB, which was still led by Facebook, and PrestoSQL,
which the original developers launched. That continued until December 2020, when PrestoSQL
was renamed Trino and PrestoDB reverted to the Presto name. The Presto open source project is
now overseen by the Presto Foundation, which was set up as part of the Linux Foundation in 2019.

Presto also includes the following features:

 Support for querying data in Hive, various databases and proprietary data stores.

 The ability to combine data from multiple sources in a single query.

 Query response times that typically range from less than a second to minutes.

15. Samza

Samza is a distributed stream processing system that was built by LinkedIn and is now an open
source project managed by Apache. According to the project website, Samza enables users to build
stateful applications that can do real-time processing of data from Kafka, HDFS and other sources.

The system can run on top of Hadoop YARN or Kubernetes and also offers a standalone
deployment option. The Samza site says it can handle "several terabytes" of state data, with low
latency and high throughput for fast data analysis. Via a unified API, it can also use the same code
written for data streaming jobs to run batch applications. Other features include the following:

 Built-in integration with Hadoop, Kafka and several other data platforms.

 The ability to run as an embedded library in Java and Scala applications.

 Fault-tolerant features designed to enable rapid recovery from system failures.

16. Spark

Apache Spark is an in-memory data processing and analytics engine that can run on clusters
managed by Hadoop YARN, Mesos and Kubernetes or in a standalone mode. It enables large-
scale data transformations and analysis and can be used for both batch and streaming applications,
as well as machine learning and graph processing use cases. That's all supported by the following
set of built-in modules and libraries:

 Spark SQL, for optimized processing of structured data via SQL queries.

 Spark Streaming and Structured Streaming, two stream processing modules.

 MLlib, a machine learning library that includes algorithms and related tools.

 GraphX, an API that adds support for graph applications.

Data can be accessed from various sources, including HDFS, relational and NoSQL databases,
and flat-file data sets. Spark also supports various file formats and offers a diverse set of APIs for
developers.

But its biggest calling card is speed: Spark's developers claim it can perform up to 100 times faster
than traditional counterpart MapReduce on batch jobs when processing in memory. As a result,
Spark has become the top choice for many batch applications in big data environments, while also
functioning as a general-purpose engine. First developed at the University of California, Berkeley,
and now maintained by Apache, it can also process on disk when data sets are too large to fit into
the available memory.
17. Storm

Another Apache open source technology, Storm is a distributed real-time computation system
that's designed to reliably process unbounded streams of data. According to the project website, it
can be used for applications that include real-time analytics, online machine learning and
continuous computation, as well as extract, transform and load jobs.

Storm clusters are akin to Hadoop ones, but applications continue to run on an ongoing basis unless
they're stopped. The system is fault-tolerant and guarantees that data will be processed. In addition,
the Apache Storm site says it can be used with any programming language, message queueing
system and database. Storm also includes the following elements:

 A Storm SQL feature that enables SQL queries to be run against streaming data sets.

 Trident and Stream API, two other higher-level interfaces for processing in Storm.

 Use of the Apache ZooKeeper technology to coordinate clusters.

18. Trino

As mentioned above, Trino is one of the two branches of the Presto query engine. Known as
PrestoSQL until it was rebranded in December 2020, Trino "runs at ludicrous speed," in the words
of the Trino Software Foundation. That group, which oversees Trino's development, was originally
formed in 2019 as the Presto Software Foundation; its name was also changed as part of the
rebranding.

Trino enables users to query data regardless of where it's stored, with support for natively running
queries in Hadoop and other data repositories. Like Presto, Trino also is designed for the following:

 Both ad hoc interactive analytics and long-running batch queries.

 Combining data from multiple systems in queries.

 Working with Tableau, Power BI, programming language R, and other BI and analytics tools.

Also available to use in big data systems: NoSQL databases

NoSQL databases are another major type of big data technology. They break with conventional
SQL-based relational database design by supporting flexible schemas, which makes them well
suited for handling huge volumes of all types of data -- particularly unstructured and
semistructured data that isn't a good fit for the strict schemas used in relational systems.

NoSQL software emerged in the late 2000s to help address the increasing amounts of diverse data
that organizations were generating, collecting and looking to analyze as part of big data initiatives.
Since then, NoSQL databases have been widely adopted and are now used in enterprises across
industries. Many are open source or source available technologies that are also offered in
commercial versions by vendors, while some are proprietary products controlled by a single
vendor. Despite the name, many NoSQL technologies do support some SQL capabilities. As a
result, NoSQL more commonly means "not only SQL" now.

In addition, NoSQL databases themselves come in various types that support different big data
applications. These are the four major NoSQL categories, with examples of the available
technologies in each one:

 Document databases. They store data elements in document-like structures, using formats
such as JSON, BSON and XML. Examples of document databases include Couchbase Server,
CouchDB and MongoDB.

 Graph databases. They connect data "nodes" in graph-like structures to emphasize the
relationships between data elements. Examples of graph databases include AllegroGraph,
Amazon Neptune, ArangoDB, Neo4j and TigerGraph.

 Key-value stores. They pair unique keys and associated values in a relatively simple data
model that can scale easily. Examples of key-value stores include Aerospike, Amazon
DynamoDB, Redis and Riak.

 Wide column stores. They store data across tables that can contain very large numbers of
columns to handle lots of data elements. Examples of wide column stores include Accumulo,
Bigtable, Cassandra, HBase and ScyllaDB.
Multimodel databases have also been created with support for different NoSQL approaches, as
well as SQL in some cases; MarkLogic Server and Microsoft's Azure Cosmos DB are examples.
Many other NoSQL vendors have added multimodel support to their databases. For example,
MongoDB now supports graph, geospatial and time series data, and Redis offers document and
time series modules. Those two technologies and many others also now include vector database
capabilities to support vector search functions in generative AI applications.

https://www.geeksforgeeks.org/10-most-popular-big-data-analytics-tools/

Big Data Analytics Tools

There are hundreds of data analytics tools out there in the market today but the selection of the
right tool will depend upon your business NEED, GOALS, and VARIETY to get business in the
right direction. Now, let’s check out the top 10 analytics tools in big data.

1. APACHE Hadoop

It’s a Java-based open-source platform that is being used to store and process big data. It is built
on a cluster system that allows the system to process data efficiently and let the data run parallel.
It can process both structured and unstructured data from one server to multiple
computers. Hadoop also offers cross-platform support for its users. Today, it is the best big
data analytic tool and is popularly used by many tech giants such as Amazon, Microsoft, IBM,
etc.
Features of Apache Hadoop:
 Free to use and offers an efficient storage solution for businesses.
 Offers quick access via HDFS (Hadoop Distributed File System).
 Highly flexible and can be easily implemented with MySQL, and JSON.
 Highly scalable as it can distribute a large amount of data in small segments.
 It works on small commodity hardware like JBOD or a bunch of disks.
2. Cassandra

APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large
amounts of data. It’s one of the most popular tools for data analytics and has been praised by
many tech companies due to its high scalability and availability without compromising speed
and performance. It is capable of delivering thousands of operations every second and can
handle petabytes of resources with almost zero downtime. It was created by Facebook back in
2008 and was published publicly.
Features of APACHE Cassandra:
 Data Storage Flexibility: It supports all forms of data i.e. structured, unstructured, semi-
structured, and allows users to change as per their needs.
 Data Distribution System: Easy to distribute data with the help of replicating data on multiple
data centers.
 Fast Processing: Cassandra has been designed to run on efficient commodity hardware and
also offers fast storage and data processing.
 Fault-tolerance: The moment, if any node fails, it will be replaced without any delay.

3. Qubole

It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc
analysis in machine learning. Qubole is a data lake platform that offers end-to-end service with
reduced time and effort which are required in moving data pipelines. It is capable of configuring
multi-cloud services such as AWS, Azure, and Google Cloud. Besides, it also helps in lowering
the cost of cloud computing by 50%.

Features of Qubole:
 Supports ETL process: It allows companies to migrate data from multiple sources in one
place.
 Real-time Insight: It monitors user’s systems and allows them to view real-time insights
 Predictive Analysis: Qubole offers predictive analysis so that companies can take actions
accordingly for targeting more acquisitions.
 Advanced Security System: To protect users’ data in the cloud, Qubole uses an advanced
security system and also ensures to protect any future breaches. Besides, it also allows
encrypting cloud data from any potential threat.

4. Xplenty

It is a data analytic tool for building a data pipeline by using minimal codes in it. It offers a wide
range of solutions for sales, marketing, and support. With the help of its interactive graphical
interface, it provides solutions for ETL, ELT, etc. The best part of using Xplenty is its low
investment in hardware & software and its offers support via email, chat, telephonic and
virtual meetings. Xplenty is a platform to process data for analytics over the cloud and
segregates all the data together.
Features of Xplenty:
 Rest API: A user can possibly do anything by implementing Rest API
 Flexibility: Data can be sent, and pulled to databases, warehouses, and salesforce.
 Data Security: It offers SSL/TSL encryption and the platform is capable of verifying
algorithms and certificates regularly.
 Deployment: It offers integration apps for both cloud & in-house and supports deployment
to integrate apps over the cloud.

5. Spark

APACHE Spark is another framework that is used to process data and perform numerous tasks
on a large scale. It is also used to process data via multiple computers with the help of
distributing tools. It is widely used among data analysts as it offers easy-to-use APIs that provide
easy data pulling methods and it is capable of handling multi-petabytes of data as well.
Recently, Spark made a record of processing 100 terabytes of data in just 23 minutes which
broke the previous world record of Hadoop (71 minutes). This is the reason why big tech giants
are moving towards spark now and is highly suitable for ML and AI today.
Features of APACHE Spark:
 Ease of use: It allows users to run in their preferred language. (JAVA, Python, etc.)
 Real-time Processing: Spark can handle real-time streaming via Spark Streaming
 Flexible: It can run on, Mesos, Kubernetes, or the cloud.

6. Mongo DB

Came in limelight in 2010, is a free, open-source platform and a document-oriented (NoSQL)

database that is used to store a high volume of data. It uses collections and documents for
storage and its document consists of key-value pairs which are considered a basic unit of Mongo
DB. It is so popular among developers due to its availability for multi-programming languages
such as Python, Jscript, and Ruby.
Features of Mongo DB:
 Written in C++: It’s a schema-less DB and can hold varieties of documents inside.
 Simplifies Stack: With the help of mongo, a user can easily store files without any disturbance
in the stack.
 Master-Slave Replication: It can write/read data from the master and can be called back for
backup.

7. Apache Storm

A storm is a robust, user-friendly tool used for data analytics, especially in small companies.
The best part about the storm is that it has no language barrier (programming) in it and can
support any of them. It was designed to handle a pool of large data in fault-tolerance and
horizontally scalable methods. When we talk about real-time data processing, Storm leads the
chart because of its distributed real-time big data processing system, due to which today many
tech giants are using APACHE Storm in their system. Some of the most notable names are
Twitter, Zendesk, NaviSite, etc.
Features of Storm:
 Data Processing: Storm process the data even if the node gets disconnected
 Highly Scalable: It keeps the momentum of performance even if the load increases
 Fast: The speed of APACHE Storm is impeccable and can process up to 1 million messages
of 100 bytes on a single node.
8. SAS

Today it is one of the best tools for creating statistical modeling used by data analysts. By
using SAS, a data scientist can mine, manage, extract or update data in different variants from
different sources. Statistical Analytical System or SAS allows a user to access the data in any
format (SAS tables or Excel worksheets). Besides that it also offers a cloud platform for business
analytics called SAS Viya and also to get a strong grip on AI & ML, they have introduced new
tools and products.
Features of SAS:
 Flexible Programming Language: It offers easy-to-learn syntax and has also vast libraries
which make it suitable for non-programmers
 Vast Data Format: It provides support for many programming languages which also include
SQL and carries the ability to read data from any format.
 Encryption: It provides end-to-end security with a feature called SAS/SECURE.

9. Data Pine

Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In a short
period of time, it has gained much popularity in a number of countries and it’s mainly used for
data extraction (for small-medium companies fetching data for close monitoring). With the help
of its enhanced UI design, anyone can visit and check the data as per their requirement and offer
in 4 different price brackets, starting from $249 per month. They do offer dashboards by
functions, industry, and platform.

Features of Datapine:
 Automation: To cut down the manual chase, datapine offers a wide array of AI assistant and
BI tools.
 Predictive Tool: datapine provides forecasting/predictive analytics by using historical and
current data, it derives the future outcome.
 Add on: It also offers intuitive widgets, visual analytics & discovery, ad hoc reporting,
etc.
10. Rapid Miner

It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code platform
and users aren’t required to code for segregating data. Today, it is being heavily used in many
industries such as ed-tech, training, research, etc. Though it’s an open-source platform but has a
limitation of adding 10000 data rows and a single logical processor. With the help of Rapid
Miner, one can easily deploy their ML models to the web or mobile (only when the user interface
is ready to collect real-time figures).
Features of Rapid Miner:
 Accessibility: It allows users to access 40+ types of files (SAS, ARFF, etc.) via URL
 Storage: Users can access cloud storage facilities such as AWS and dropbox
 Data validation: Rapid miner enables the visual display of multiple results in history for
better evaluation.

https://www.simplilearn.com/what-is-big-data-analytics-article

Uses and Examples of Big Data Analytics

There are many different ways that Big Data analytics can be used in order to improve businesses
and organizations. Here are some examples:

 Using analytics to understand customer behavior in order to optimize the customer experience

 Predicting future trends in order to make better business decisions

 Improving marketing campaigns by understanding what works and what doesn't

 Increasing operational efficiency by understanding where bottlenecks are and how to fix them

 Detecting fraud and other forms of misuse sooner

These are just a few examples — the possibilities are really endless when it comes to Big Data
analytics. It all depends on how you want to use it in order to improve your business.

Benefits and Advantages of Big Data Analytics

1. Risk Management

Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to identify
fraudulent activities and discrepancies. The organization leverages it to narrow down a list of
suspects or root causes of problems.

2. Product Development and Innovations

Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines and armed
forces across the globe, uses Big Data analytics to analyze how efficient the engine designs are
and if there is any need for improvements.

3. Quicker and Better Decision Making Within Organizations

Use Case: Starbucks uses Big Data analytics to make strategic decisions. For example, the
company leverages it to decide if a particular location would be suitable for a new outlet or not.
They will analyze several different factors, such as population, demographics, accessibility of the
location, and more.

4. Improve Customer Experience

Use Case: Delta Air Lines uses Big Data analysis to improve customer experiences. They monitor
tweets to find out their customers’ experience regarding their journeys, delays, and so on. The
airline identifies negative tweets and does what’s necessary to remedy the situation. By publicly
addressing these issues and offering solutions, it helps the airline build good customer relations.
The Lifecycle Phases of Big Data Analytics

Now, let’s review how Big Data analytics works:

 Stage 1 - Business case evaluation - The Big Data analytics lifecycle begins with a business
case, which defines the reason and goal behind the analysis.

 Stage 2 - Identification of data - Here, a broad variety of data sources are identified.

 Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to
remove corrupt data.

 Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then
transformed into a compatible form.

 Stage 5 - Data aggregation - In this stage, data with the same fields across different datasets are
integrated.

 Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to discover
useful information.

 Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data
analysts can produce graphic visualizations of the analysis.

 Stage 8 - Final analysis result - This is the last step of the Big Data analytics lifecycle, where
the final results of the analysis are made available to business stakeholders who will take action.

Different Types of Big Data Analytics

Here are the four types of Big Data analytics:

1. Descriptive Analytics

This summarizes past data into a form that people can easily read. This helps in creating reports,
like a company’s revenue, profit, sales, and so on. Also, it helps in the tabulation of social media
metrics.

Use Case: The Dow Chemical Company analyzed its past data to increase facility utilization across
its office and lab space. Using descriptive analytics, Dow was able to identify underutilized space.
This space consolidation helped the company save nearly US $4 million annually.

2. Diagnostic Analytics

This is done to understand what caused a problem in the first place. Techniques like drill-
down, data mining, and data recovery are all examples. Organizations use diagnostic analytics
because they provide an in-depth insight into a particular problem.

Use Case: An e-commerce company’s report shows that their sales have gone down, although
customers are adding products to their carts. This can be due to various reasons like the form didn’t
load correctly, the shipping fee is too high, or there are not enough payment options available. This
is where you can use diagnostic analytics to find the reason.

3. Predictive Analytics

This type of analytics looks into the historical and present data to make predictions of the future.
Predictive analytics uses data mining, AI, and machine learning to analyze current data and make
predictions about the future. It works on predicting customer trends, market trends, and so on.

Use Case: PayPal determines what kind of precautions they have to take to protect their clients
against fraudulent transactions. Using predictive analytics, the company uses all the historical
payment data and user behavior data and builds an algorithm that predicts fraudulent activities.
4. Prescriptive Analytics

This type of analytics prescribes the solution to a particular problem. Perspective analytics works
with both descriptive and predictive analytics. Most of the time, it relies on AI and machine
learning.

Use Case: Prescriptive analytics can be used to maximize an airline’s profit. This type of analytics
is used to build an algorithm that will automatically adjust the flight fares based on numerous
factors, including customer demand, weather, destination, holiday seasons, and oil prices.

Big Data Analytics Tools

Here are some of the key big data analytics tools :

 Hadoop - helps in storing and analyzing data

 MongoDB - used on datasets that change frequently

 Talend - used for data integration and management

 Cassandra - a distributed database used to handle chunks of data

 Spark - used for real-time processing and analyzing large amounts of data

 STORM - an open-source real-time computational system

 Kafka - a distributed streaming platform that is used for fault-tolerant storage

Big Data Industry Applications

Here are some of the sectors where Big Data is actively used:
 Ecommerce - Predicting customer trends and optimizing prices are a few of the ways e-
commerce uses Big Data analytics

 Marketing - Big Data analytics helps to drive high ROI marketing campaigns, which result in
improved sales

 Education - Used to develop new and improve existing courses based on market requirements

 Healthcare - With the help of a patient’s medical history, Big Data analytics is used to predict
how likely they are to have health issues

 Media and entertainment - Used to understand the demand of shows, movies, songs, and more
to deliver a personalized recommendation list to its users

 Banking - Customer income and spending patterns help to predict the likelihood of choosing
various banking offers, like loans and credit cards

 Telecommunications - Used to forecast network capacity and improve customer experience

 Government - Big Data analytics helps governments in law enforcement, among other things

Bloody Appalachia Digital 8-21-23
100% (1)
Bloody Appalachia Digital 8-21-23
156 pages
Data Analytics for Beginners: Introduction to Data Analytics
From Everand
Data Analytics for Beginners: Introduction to Data Analytics
Anthony S. Williams
4/5 (18)
Data Processing and Analysis
100% (3)
Data Processing and Analysis
38 pages
Data Analytics - TYBCS
No ratings yet
Data Analytics - TYBCS
6 pages
Data analytics_1
No ratings yet
Data analytics_1
21 pages
Ba Unit 1a
No ratings yet
Ba Unit 1a
18 pages
AA THeory and Methods
No ratings yet
AA THeory and Methods
40 pages
Data Analytics
No ratings yet
Data Analytics
7 pages
What Is Data Analytics
No ratings yet
What Is Data Analytics
3 pages
Data Analytics
No ratings yet
Data Analytics
16 pages
1 Introduction to Data Analytics
No ratings yet
1 Introduction to Data Analytics
14 pages
What is Data Analytics
No ratings yet
What is Data Analytics
44 pages
Data Analytics
No ratings yet
Data Analytics
5 pages
Data Analytics Unit1
No ratings yet
Data Analytics Unit1
24 pages
Data Analysis
No ratings yet
Data Analysis
34 pages
Data Analytics Unit1-4
No ratings yet
Data Analytics Unit1-4
195 pages
Unit 2 DS
No ratings yet
Unit 2 DS
30 pages
Chapter 1 DA
No ratings yet
Chapter 1 DA
73 pages
2.1_Data_Analytics[1]
No ratings yet
2.1_Data_Analytics[1]
16 pages
Data Analytics
100% (3)
Data Analytics
11 pages
Unit 1 Topic 1 Intro
No ratings yet
Unit 1 Topic 1 Intro
30 pages
Data Analytics - Definition Uses Examples Process
No ratings yet
Data Analytics - Definition Uses Examples Process
3 pages
PrE7 Chapter 8 Data Analytics
No ratings yet
PrE7 Chapter 8 Data Analytics
20 pages
Data Analytics
No ratings yet
Data Analytics
32 pages
UNIT - 2 Data Analysis
No ratings yet
UNIT - 2 Data Analysis
19 pages
Data Analytics 1
No ratings yet
Data Analytics 1
3 pages
M3 - Business Data Analysis
No ratings yet
M3 - Business Data Analysis
31 pages
Unit II
No ratings yet
Unit II
91 pages
DATA ANALYTICS
No ratings yet
DATA ANALYTICS
9 pages
All About Data Science
No ratings yet
All About Data Science
35 pages
Data Handling
No ratings yet
Data Handling
7 pages
DA 1st Week
No ratings yet
DA 1st Week
3 pages
DA UNIT 1
No ratings yet
DA UNIT 1
12 pages
Data-Analysis-Chapter 1-compressed
No ratings yet
Data-Analysis-Chapter 1-compressed
20 pages
2 Da
100% (1)
2 Da
17 pages
Additional Notes BADS
No ratings yet
Additional Notes BADS
9 pages
ACC 157 SAS No. 24
No ratings yet
ACC 157 SAS No. 24
6 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Unit 1-2
No ratings yet
Unit 1-2
8 pages
1overview of Data Analysis
No ratings yet
1overview of Data Analysis
3 pages
DA Unit 2
No ratings yet
DA Unit 2
12 pages
What Are Data Analytics
No ratings yet
What Are Data Analytics
19 pages
Week-1-Lecture
No ratings yet
Week-1-Lecture
26 pages
DataAnalyticsChapter1Vision PDF
No ratings yet
DataAnalyticsChapter1Vision PDF
36 pages
2.Data analysis Vs analytics
No ratings yet
2.Data analysis Vs analytics
6 pages
Overview of Data Analysis
No ratings yet
Overview of Data Analysis
11 pages
Unit I (Notes 2)
No ratings yet
Unit I (Notes 2)
16 pages
Business Analytics: Leveraging Data for Insights and Competitive Advantage
From Everand
Business Analytics: Leveraging Data for Insights and Competitive Advantage
Ronald BLaha
No ratings yet
What Is Data Analysis
No ratings yet
What Is Data Analysis
6 pages
Types of Data Analysis: Techniques and Methods
No ratings yet
Types of Data Analysis: Techniques and Methods
4 pages
Data Analytics
No ratings yet
Data Analytics
16 pages
Data Analyst
No ratings yet
Data Analyst
9 pages
Overview of Data Analysis
No ratings yet
Overview of Data Analysis
4 pages
Data Analysis Is The Process of Gathering
No ratings yet
Data Analysis Is The Process of Gathering
5 pages
ITGY403 Lesson 1
No ratings yet
ITGY403 Lesson 1
16 pages
UNITWISE-IMP-NOTES
No ratings yet
UNITWISE-IMP-NOTES
34 pages
Dataanalyticsunit-1 (2) 104014
No ratings yet
Dataanalyticsunit-1 (2) 104014
51 pages
002 - Discover Data Analysis - Overview of Data Analysis
No ratings yet
002 - Discover Data Analysis - Overview of Data Analysis
4 pages
Data Analytics-Wps Office
No ratings yet
Data Analytics-Wps Office
21 pages
It Is The Process of Checking and Adjusting The Data For Omissions
No ratings yet
It Is The Process of Checking and Adjusting The Data For Omissions
5 pages
Big - Data Unit-2
100% (2)
Big - Data Unit-2
64 pages
Mail System-Literature Survey
No ratings yet
Mail System-Literature Survey
3 pages
التحقيق الجنائي الرقمي
No ratings yet
التحقيق الجنائي الرقمي
24 pages
Smart and Sustainable Supply Chain and Logistics - Challenges, Methods and Best Practices
No ratings yet
Smart and Sustainable Supply Chain and Logistics - Challenges, Methods and Best Practices
282 pages
Instant Access to Principles of Microeconomics N. Gregory Mankiw ebook Full Chapters
No ratings yet
Instant Access to Principles of Microeconomics N. Gregory Mankiw ebook Full Chapters
54 pages
Organizational Development: Books To Be Read
No ratings yet
Organizational Development: Books To Be Read
49 pages
Habitats Niches and Species Interactions Jigsaw Activity
No ratings yet
Habitats Niches and Species Interactions Jigsaw Activity
3 pages
Folder 994K PDF
No ratings yet
Folder 994K PDF
32 pages
Annex I Summary of Product Characteristics
No ratings yet
Annex I Summary of Product Characteristics
60 pages
Linux in Action David Clinton pdf download
100% (1)
Linux in Action David Clinton pdf download
34 pages
A Study of Cognitive Style of Junior College Students of Science Stream With Respect To Gender and Locality
No ratings yet
A Study of Cognitive Style of Junior College Students of Science Stream With Respect To Gender and Locality
8 pages
Technical Information About Steering Gears and Steering Pumps
100% (1)
Technical Information About Steering Gears and Steering Pumps
37 pages
11 Free Vintage Patterns How To Sew Retro-Inspired Clothing For Ladies Free Ebook
50% (4)
11 Free Vintage Patterns How To Sew Retro-Inspired Clothing For Ladies Free Ebook
41 pages
Short Circuit Cal 1
100% (1)
Short Circuit Cal 1
46 pages
Lecture 6 Control of Pests and Diseases 202203
No ratings yet
Lecture 6 Control of Pests and Diseases 202203
95 pages
Selim Hasan, The Excavations at Giza, Vol. II
No ratings yet
Selim Hasan, The Excavations at Giza, Vol. II
338 pages
Thesis 15
100% (2)
Thesis 15
4 pages
Lesson 1 - Electromagnetic Waves
No ratings yet
Lesson 1 - Electromagnetic Waves
39 pages
2 AQUA Domestic Pump0712 PDF
No ratings yet
2 AQUA Domestic Pump0712 PDF
111 pages
c2023-474880-faaa
No ratings yet
c2023-474880-faaa
4 pages
TCS & P&P PDF
No ratings yet
TCS & P&P PDF
27 pages
Unit 2
No ratings yet
Unit 2
3 pages
Boud and Feletti 2007 The Challenge of Problem Based Learning
No ratings yet
Boud and Feletti 2007 The Challenge of Problem Based Learning
27 pages
PDF Defoe and the new sciences Bacon download
100% (2)
PDF Defoe and the new sciences Bacon download
73 pages
T1005 Sales Kit
No ratings yet
T1005 Sales Kit
20 pages
Energy-Dispersive X-Ray Spectros
No ratings yet
Energy-Dispersive X-Ray Spectros
25 pages
OOT (RGPV) IV Sem CS
No ratings yet
OOT (RGPV) IV Sem CS
5 pages
Numpy ML - AI
No ratings yet
Numpy ML - AI
135 pages
Fórmula... Bases de Supositorios. SPG Supposi-Base. SDS 2532 (Medisca)
No ratings yet
Fórmula... Bases de Supositorios. SPG Supposi-Base. SDS 2532 (Medisca)
8 pages
DPPS Neet Disha Physics Original_Part80
No ratings yet
DPPS Neet Disha Physics Original_Part80
4 pages