Unit-5 DS
Unit-5 DS
Unit-5 DS
Volume The amount of data matters. With big data, you’ll have to process high volumes of
low-density, unstructured data. This can be data of unknown value, such as
Twitter data feeds, clickstreams on a web page or a mobile app, or sensor-enabled
equipment. For some organizations, this might be tens of terabytes of data. For
others, it may be hundreds of petabytes.
Velocity Velocity is the fast rate at which data is received and (perhaps) acted on.
Normally, the highest velocity of data streams directly into memory versus being
written to disk. Some internet-enabled smart products operate in real time or near
real time and will require real-time evaluation and action.
Variety Variety refers to the many types of data that are available. Traditional data types
were structured and fit neatly in a relational database. With the rise of big data,
data comes in new unstructured data types. Unstructured and semistructured data
types, such as text, audio, and video, require additional preprocessing to derive
meaning and support metadata.
Platform
What is a big data platform?
The constant stream of information from various sources is becoming more intense[4],
especially with the advance in technology. And this is where big data platforms come in to
store and analyze the ever-increasing mass of information.
A big data platform is an integrated computing solution that combines numerous software
systems, tools, and hardware for big data management. It is a one-stop architecture that
solves all the data needs of a business regardless of the volume and size of the data at hand.
Due to their efficiency in data management, enterprises are increasingly adopting big data
platforms to gather tons of data and convert them into structured, actionable business
insights[5].
Currently, the marketplace is flooded with numerous Open source and commercially
available big data platforms. They boast different features and capabilities for use in a big
data environment.
Storage
With vast amounts of data generated daily, the greatest challenge is storage (especially when
the data is in different formats) within legacy systems. Unstructured data cannot be stored in
traditional databases.
Processing
Processing big data refers to the reading, transforming, extraction, and formatting of useful
information from raw information. The input and output of information in unified formats
continue to present difficulties.
Security
Intelligent data analysis refers to the use of analysis, classification, conversion, extraction
organization, and reasoning methods to extract useful knowledge from data. This data
analytics intelligence process generally consists of the data preparation stage, the data mining
stage, and the result validation and explanation stage.
Data preparation involves the integration of required data into a dataset that will be used for
data mining; data mining involves examining large databases in order to generate new
information; result validation involves the verification of patterns produced by data mining
algorithms; and result explanation involves the intuitive communication of results.
Big data intelligence involves the use of Artificial Intelligence and Machine Learning to
make big data analytics actionable and transform big data into insights, and provides
engagement capabilities for data scientists, enterprise analytics strategists, data intelligence
warehouse architects, and implementation and development experts. Enterprise data
intelligence is used in business intelligence operations, analyzing sales, evaluating
inventories, and building customer data intelligence.
4.Nature of Data
Categorical data are values or observations that can be sorted into groups or categories. There
are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic
ordering to its categories. For example, housing is a categorical variable having two
categories (own and rent). An ordinal variable has an established ordering. For example, age
as a variable with three orderly categories (young, adult, and elder).
Numerical data are values or observations that can be measured. There are two kinds of
numerical values, discrete and continuous. Discrete data are values or observations that can
be counted and are distinct and separate. For example, number of lines in a code. Continuous
data are values or observations that may take on any value within a finite or infinite interval.
For example, an economic time series such as historic gold prices.
The kinds of datasets used in this book are as follows:
E-mails (unstructured, discrete)
Digital images (unstructured, discrete)
Stock market logs (structured, continuous)
Historic gold prices (structured, continuous)
5.Analytic Processes and Tools
1. Collect Data
Data collection looks different for every organization. With today’s technology, organizations
can gather both structured and unstructured data from a variety of sources — from cloud
storage to mobile applications to in-store IoT sensors and beyond. Some data will be stored in
data warehouses where business intelligence tools and solutions can access it easily. Raw or
unstructured data that is too diverse or complex for a warehouse may be assigned metadata
and stored in a data lake.
2. Process Data
Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing
exponentially, making data processing a challenge for organizations. One processing option is
batch processing, which looks at large data blocks over time. Batch processing is useful when
there is a longer turnaround time between collecting and analyzing data. Stream processing
looks at small batches of data at once, shortening the delay time between collection and
analysis for quicker decision-making. Stream processing is more complex and often more
expensive.
3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger results; all data
must be formatted correctly, and any duplicative or irrelevant data must be eliminated or
accounted for. Dirty data can obscure and mislead, creating flawed insights.
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes
can turn big data into big insights. Some of these big data analysis methods include:
Data mining sorts through large datasets to identify patterns and relationships by identifying
anomalies and creating data clusters.
Predictive analytics uses an organization’s historical data to make predictions about the
future, identifying upcoming risks and opportunities.
Deep learning imitates human learning patterns by using artificial intelligence and machine
learning to layer algorithms and find patterns in the most complex and abstract data.
Try Tableau for free to create beautiful visualizations with your data.
TRY TABLEAU FOR FREE
Big data analytics cannot be narrowed down to a single tool or technology. Instead, several
types of tools work together to help you collect, process, cleanse, and analyze big data. Some
of the major players in big data ecosystems are listed below.
Hadoop is an open-source framework that efficiently stores and processes big datasets on
clusters of commodity hardware. This framework is free and can handle large amounts of
structured and unstructured data, making it a valuable mainstay for any big data operation.
NoSQL databases are non-relational data management systems that do not require a fixed
scheme, making them a great option for big, raw, unstructured data. NoSQL stands for “not
only SQL,” and these databases can handle a variety of data models.
MapReduce is an essential component to the Hadoop framework serving two functions. The
first is mapping, which filters data to various nodes within the cluster. The second is
reducing, which organizes and reduces the results from each node to answer a query.
YARN stands for “Yet Another Resource Negotiator.” It is another component of second-
generation Hadoop. The cluster management technology helps with job scheduling and
resource management in the cluster.
Spark is an open source cluster computing framework that uses implicit data parallelism and
fault tolerance to provide an interface for programming entire clusters. Spark can handle both
batch and stream processing for fast computation.
Tableau is an end-to-end data analytics platform that allows you to prep, analyze,
collaborate, and share your big data insights. Tableau excels in self-service visual analysis,
allowing people to ask new questions of governed big data and easily share those insights
across the organization.
Big data brings big benefits, but it also brings big challenges such new privacy and security
concerns, accessibility for business users, and choosing the right solutions for your business
needs. To capitalize on incoming data, organizations will have to address the following:
Making big data accessible: Collecting and processing data becomes more difficult as the
amount of data grows. Organizations must make data easy and convenient for data owners of
all skill levels to use.
Maintaining quality data: With so much data to maintain, organizations are spending more
time than ever before scrubbing for duplicates, errors, absences, conflicts, and
inconsistencies.
Keeping data secure. As the amount of data grows, so do privacy and security concerns.
Organizations will need to strive for compliance and put tight data processes in place before
they take advantage of big data.
Finding the right tools and platforms. New technologies for processing and analyzing big
data are developed all the time. Organizations must find the right technology to work within
their established ecosystems and address their particular needs. Often, the right solution is
also a flexible solution that can accommodate future infrastructure changes
6.Analysis vs Reporting
Key differences between analytics vs reporting
Differences between analytics and reporting can significantly benefit your business. If you
want to use both to their full potential and not miss out on essential parts of either one
knowing the difference between the two is important. Some key differences are:
Analytics Reporting
Analytics is the method of examining and Reporting is an action that includes all the needed
analyzing summarized data to make business information and data and is put together in an organized
decisions. way.
Questioning the data, understanding it, Identifying business events, gathering the required
investigating it, and presenting it to the end information, organizing, summarizing, and presenting
users are all part of analytics. existing data are all part of reporting.
The purpose of analytics is to draw The purpose of reporting is to organize the data into
conclusions based on data. meaningful information.
Analytics is used by data analysts, scientists, Reporting is provided to the appropriate business
and business people to make effective leaders to perform effectively and efficiently within a
decisions. firm.
Analytics and reporting can be used to reach a number of different goals. Both of these can be
very helpful to a business if they are used correctly.
1. APACHE Hadoop
It’s a Java-based open-source platform that is being used to store and process big data. It is
built on a cluster system that allows the system to process data efficiently and let the data
run parallel. It can process both structured and unstructured data from one server to
multiple computers. Hadoop also offers cross-platform support for its users. Today, it is
the best big data analytic tool and is popularly used by many tech giants such as Amazon,
Microsoft, IBM, etc.
Features of Apache Hadoop:
2. Cassandra
3. Qubole
It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc
analysis in machine learning. Qubole is a data lake platform that offers end-to-end service
with reduced time and effort which are required in moving data pipelines. It is capable of
configuring multi-cloud services such as AWS, Azure, and Google Cloud. Besides, it also
helps in lowering the cost of cloud computing by 50%.
Features of Qubole:
Supports ETL process: It allows companies to migrate data from multiple
sources in one place.
Real-time Insight: It monitors user’s systems and allows them to view real-time
insights
Predictive Analysis: Qubole offers predictive analysis so that companies can
take actions accordingly for targeting more acquisitions.
Advanced Security System: To protect users’ data in the cloud, Qubole uses an
advanced security system and also ensures to protect any future breaches.
Besides, it also allows encrypting cloud data from any potential threat.
4. Xplenty
It is a data analytic tool for building a data pipeline by using minimal codes in it. It offers a
wide range of solutions for sales, marketing, and support. With the help of its interactive
graphical interface, it provides solutions for ETL, ELT, etc. The best part of using Xplenty
is its low investment in hardware & software and its offers support via email, chat,
telephonic and virtual meetings. Xplenty is a platform to process data for analytics over
the cloud and segregates all the data together.
Features of Xplenty:
Rest API: A user can possibly do anything by implementing Rest API
Flexibility: Data can be sent, and pulled to databases, warehouses, and
salesforce.
Data Security: It offers SSL/TSL encryption and the platform is capable of
verifying algorithms and certificates regularly.
Deployment: It offers integration apps for both cloud & in-house and supports
deployment to integrate apps over the cloud.
5. Spark
APACHE Spark is another framework that is used to process data and perform numerous
tasks on a large scale. It is also used to process data via multiple computers with the help
of distributing tools. It is widely used among data analysts as it offers easy-to-use APIs that
provide easy data pulling methods and it is capable of handling multi-petabytes of
data as well. Recently, Spark made a record of processing 100 terabytes of data in just 23
minutes which broke the previous world record of Hadoop (71 minutes). This is the
reason why big tech giants are moving towards spark now and is highly suitable for ML
and AI today.
Features of APACHE Spark:
Ease of use: It allows users to run in their preferred language. (JAVA, Python,
etc.)
Real-time Processing: Spark can handle real-time streaming via Spark
Streaming
Flexible: It can run on, Mesos, Kubernetes, or the cloud.
6. Mongo DB
7. Apache Storm
A storm is a robust, user-friendly tool used for data analytics, especially in small
companies. The best part about the storm is that it has no language barrier (programming)
in it and can support any of them. It was designed to handle a pool of large data in fault-
tolerance and horizontally scalable methods. When we talk about real-time data
processing, Storm leads the chart because of its distributed real-time big data processing
system, due to which today many tech giants are using APACHE Storm in their system.
Some of the most notable names are Twitter, Zendesk, NaviSite, etc.
Features of Storm:
Data Processing: Storm process the data even if the node gets disconnected
Highly Scalable: It keeps the momentum of performance even if the load
increases
Fast: The speed of APACHE Storm is impeccable and can process up to 1
million messages of 100 bytes on a single node.
8. SAS
Today it is one of the best tools for creating statistical modeling used by data analysts. By
using SAS, a data scientist can mine, manage, extract or update data in different variants
from different sources. Statistical Analytical System or SAS allows a user to access the data
in any format (SAS tables or Excel worksheets). Besides that it also offers a cloud platform
for business analytics called SAS Viya and also to get a strong grip on AI & ML, they have
introduced new tools and products.
Features of SAS:
Flexible Programming Language: It offers easy-to-learn syntax and has also
9. Data Pine
Data pine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In
a short period of time, it has gained much popularity in a number of countries and it’s
mainly used for data extraction (for small-medium companies fetching data for close
monitoring). With the help of its enhanced UI design, anyone can visit and check the data
as per their requirement and offer in 4 different price brackets, starting from $249 per
month. They do offer dashboards by functions, industry, and platform.
Features of Data pine:
Automation: To cut down the manual chase, datapine offers a wide array of AI
It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code
platform and users aren’t required to code for segregating data. Today, it is being heavily
used in many industries such as ed-tech, training, research, etc. Though it’s an open-source
platform but has a limitation of adding 10000 data rows and a single logical processor.
With the help of Rapid Miner, one can easily deploy their ML models to the web or mobile
(only when the user interface is ready to collect real-time figures).
Features of Rapid Miner:
Accessibility: It allows users to access 40+ types of files (SAS, ARFF, etc.) via
URL
Storage: Users can access cloud storage facilities such as AWS and dropbox
Data validation: Rapid miner enables the visual display of multiple results in
history for better evaluation.
Conclusion:
Big data has been in limelight for the past few years and will continue to dominate the
market in almost every sector for every market size. The demand for big data is booming at
an enormous rate and ample tools are available in the market today, all you need is the right
approach and choose the best data analytic tool as per the project’s requirement.
8.Statistical Concepts:
Sampling Distribution
What is sampling distribution?
Sampling distribution is a statistic that determines the probability of an event based on data
from a small group within a large population. Its primary purpose is to establish
representative results of small samples of a comparatively larger population. Since the
population is too large to analyze, you can select a smaller group and repeatedly sample or
analyze them. The gathered data, or statistic, is used to calculate the likely occurrence, or
probability, of an event.
Using a sampling distribution simplifies the process of making inferences, or conclusions,
about large amounts of data.
Types of distributions
There are three standard types of sampling distributions in statistics:
1. Sampling distribution of mean
The most common type of sampling distribution is the mean. It focuses on calculating the
mean of every sample group chosen from the population and plotting the data points. The
graph shows a normal distribution where the center is the mean of the sampling distribution,
which represents the mean of the entire population.
2. Sampling distribution of proportion
This sampling distribution focuses on proportions in a population. You select samples and
calculate their proportions. The means of the sample proportions from each group represent
the proportion of the entire population
3. T-distribution
A T-distribution is a sampling distribution that involves a small population or one where you
don't know much about it. It is used to estimate the mean of the population and other statistics
such as confidence intervals, statistical differences and linear regression. The T-distribution
uses a t-score to evaluate data that wouldn't be appropriate for a normal distribution.
The formula for t-score is:
t = [ x - μ ] / [ s / sqrt( n ) ]
In the formula, "x" is the sample mean and "μ" is the population mean and signifies standard
deviation.
Re-Sampling
What is resampling?
Resampling is a series of techniques used in statistics to gather more information about a
sample. This can include retaking a sample or estimating its accuracy. With these additional
techniques, resampling often improves the overall accuracy and estimates any uncertainty
within a population.
Sampling vs. resampling
Sampling is the process of selecting certain groups within a population to gather data.
Resampling often involves performing similar testing
methods with sample sizes within that group. This can mean testing the same sample, or
reselecting samples that can provide more information about a population. There are several
differences between sampling and resampling, including:
Methods
Resampling uses methods like the bootstrapping technique and permutation tests. With
sampling, there are four main methods:
Simple random sampling: Simple random sampling is when every person or data piece
within a population or a group has an equal chance of selection. You might generate random
numbers or have another random selection process.
Systematic sampling: Systematic sampling is often still random, but people might receive
numbers or values at the start. The person holding the experiment then might select intervals
to divide the group, like every third person.
Stratified sampling: Stratified sampling is when you divide the main population into several
subgroups based on certain qualities. This can mean collecting samples from groups of
different ages, cultures or other demographics.
Cluster sampling: Cluster sampling is similar to stratified sampling, as you can divide
populations into separate subgroups. Rather than coordinated groups with similar qualities,
you select these groups randomly, often causing differences in results.
Statistical Inference
Business Analysis
Artificial Intelligence
Financial Analysis
Fraud Detection
Machine Learning
Share Market
Pharmaceutical Sector
Prediction Error
In statistics, prediction error refers to the difference between the predicted values made by
some model and the actual values.
Prediction error is often used in two settings:
1. Linear regression: Used to predict the value of some continuous response variable.
We typically measure the prediction error of a linear regression model with a metric known as
RMSE, which stands for root mean squared error.
It is calculated as:
RMSE = √Σ(ŷi – yi)2 / n
where:
Σ is a symbol that means “sum”
ŷi is the predicted value for the ith observation
yi is the observed value for the ith observation
n is the sample size
2. Logistic Regression: Used to predict the value of some binary response variable.
One common way to measure the prediction error of a logistic regression model is with a
metric known as the total misclassification rate.
It is calculated as:
Total misclassification rate = (# incorrect predictions / # total predictions)
The lower the value for the misclassification rate, the better the model is able to predict the
outcomes of the response variable.
The following examples show how to calculate prediction error for both a linear regression
model and a logistic regression model in practice.
Example 1: Calculating Prediction Error in Linear Regression
Suppose we use a regression model to predict the number of points that 10 players will score
in a basketball game.
The following table shows the predicted points from the model vs. the actual points the
players scored:
We would calculate the root mean squared error (RMSE) as:
RMSE = √Σ(ŷi – yi)2 / n
RMSE = √(((14-12)2+(15-15)2+(18-20)2+(19-16)2+(25-20)2+(18-19)2+(12-16)2+(12-
20)2+(15-16)2+(22-16)2) / 10)
RMSE = 4
The root mean squared error is 4. This tells us that the average deviation between the
predicted points scored and the actual points scored is 4.
This value is quite high, which indicates that the model doesn’t do a very good job of
predicting whether or not a player will get drafted.