What Is Big Data Analytics
What Is Big Data Analytics
Big data analytics is the use of advanced analytic techniques against very large, diverse big data
sets that include structured, semi-structured and unstructured data, from different sources, and in
different sizes from terabytes to zetta bytes.
What is big data exactly? It can be defined as data sets whose size or type is beyond the ability of
traditional relational databases to capture, manage and process the data with low latency.
Characteristics of big data include high volume, high velocity and high variety. Sources of data
are becoming more complex than those for traditional data because they are being driven by
artificial intelligence (AI), mobile devices, social media and the Internet of Things (IoT). For
example, the different types of data originate from sensors, devices, video/audio, networks, log
files, transactional applications, web and social media — much of it generated in real time and at
a very large scale.
With big data analytics, you can ultimately fuel better and faster decision-making, modelling and
predicting of future outcomes and enhanced business intelligence. As you build your big data
solution, consider open source software such as Apache Hadoop, Apache Spark and the entire
Hadoop ecosystem as cost-effective, flexible data processing and storage tools designed to
handle the volume of data being generated today.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having
multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are:Wavelet transforms and PCA (Principal Component
Analysis).