1st Internal Solved
1st Internal Solved
b. Explain data noise, outliers, data anomaly and duplicate data with 5M L2 CO1
example. Why filtering require during pre-processing.
Q. Answers Marks
No
1a Describe the data, web data and Big Data. Explain the 3Vs characteristics of Big Data? 7M
Definitions of Data
Data is information, usually in the form of facts or statistics that one can analyze or use
for further calculations.
Data is information that can be stored and used by a computer program.
Data is information presented in numbers, letters, or other form.
Data is information from series of observations, measurements or facts.
Data is information from series of behavioral observations, measurements or facts.
Velocity refers to the speed of generation of data. It is a measure of how fast the data generates
and processes. To meet the demands and the challenges of processing Big Data, the velocity of 4M
generation of data plays a crucial role.
Variety Big Data comprises of a variety of data. Data is generated from multiple sources. The
variety is due to the availability of large number of heterogeneous platforms in the industry. It is
important characteristic that needs to be known for proper processing of data and helps in
effective use of data according to their formats.
Veracity take into account the quality of data captured. uncertain or imprecise data
1b Define Big Data architecture. Draw five layers in architecture design and explain functions 8M
in each layer.
Big Data architecture is defined as: “Big Data architecture is the logical and/or physical
layout/ structure of how Big Data will be stored, accessed and managed within a Big Data or 2M
IT environment.
Architecture logically defines how Big Data solution will work, the core components
(hardware, database, software, storage) used, flow of information, security and more.
Figure shows the logical layers and the functions which are considered in Big Data
architecture. Data processing architecture consists of five layers
2M
Figure Design of logical layers in a data processing architecture and functions in the
layers.
Logical layer 1 (L1) is for identifying data sources, which are external, internal or both. L1
considers the following aspects in a design:
Amount of data needed at ingestion layer 2 (L2)
Push from L1 or pull by L2 as per the mechanism for the usages.
Source data-types: Database, files, web or service
Source formats, i.e., semi-structured, unstructured or structured.
The layer 2 (L2) is for data-ingestion. Ingestion is the process of obtaining and importing data
for immediate use or transfer. L2 considers the following aspects:
Obtaining and importing data using ELT (Extract Load and Transform).
Data Pre-processing (validation, transformation or transcoding) requirement.
Data semantics (such as replace, append, aggregate, compact).
4M
Ingestion and ETL processes either in batches or in real time, which means store and use
the data as generated. Batch processing is using discrete datasets at scheduled or periodic
intervals of time.
The L3 layer is for storage of data from L2 layer. L3 considers the following aspects:
Data storage type (historical or incremental), formats, compression, frequency of
incoming data, querying patterns and data consumption requirements for L4 or L5.
Data storage using Hadoop distributed file system or NoSQL data stores- HBase,
Cassandra, MongoDB.
2b Explain data noise, outliers, data anomaly and duplicate data with example. Why filtering require 5M
during pre-processing.
Noise
Noise in data refers to data giving additional meaningless information besides true / actual
information. Noise refers to difference in the value measured from true value due to additional
influences. Result of data analysis is adversely affected due to noisy data.
Ex: Consider noise in wind velocity and direction readings. The velocity at certain instances will
appear too high and sometimes too low. The directions at certain instances will appear inclined
towards the north and sometimes towards the south.
Outliers
Outliers refers to data, which appears to not to belong to the dataset. Outliers need to be
removed from the dataset; else the result will be effected by a small or large amount. If valid data
is identified as outlier, then also the results will be affected. The outliers are a result of human
data-entry errors, programming bugs.
Ex: In the students grade-sheets in one subject 4/5 in the 4th semester. A result in a semester
shows 9.0/10 in place of 3.0/10. Data 9.0 is an outlier. The student semester grade point average
(SGPA) will be erroneously declared and the student may be even declared to have failed in that
semester.
4M
Missing Values
Missing value implies data not appearing in the data set.
Ex: Consider missing values in the sales figures of chocolates. The values not sent for certain
dates. This may be due to the failure of power supply at the machine or network problems on
specific days in a month. The chocolate sales not added for a day can be added in the next day’s
sales data. The effect on the average sales per day is not significant. However, if the failure
occurred on last day of a month, then the analysis will be erroneous.
Duplicate Values
Duplicate value implies the same data appearing two or more times in a dataset.
Ex: Consider duplicate values in the sales figures of chocolates. This may be due to some
problem in the system. When the number of duplicates values are sent and added, then sales
result analysis will get affected. It can even result in false alarms to a service, which affects
supply chain.
Assume network problems on certain instances. so may not get an acknowledgement of the sales
figures from the server, leading to resending the sales record once again. Then the sales figures
of chocolates get recorded twice at that instance. The chocolate sales data gets added twice in a
specific day’s sales data. The calculation of monthly sales data is adversely affected.
2c Describe the pre-processing steps, data cleaning, transforming, modeling and visualizing data. 4M
Data Cleaning refers to the process of removing or correcting incomplete, incorrect, inaccurate
or irrelevant parts of the data after detecting them.
Ex: In students grade-sheets correcting the grade outliers.
Data Transforming
Data reduction enables the transformation of acquired information into an ordered, correct and
simplified form. The reductions enable ingestion of meaningful data in the datasets. The basic
concept is the reduction of multitudinous amount of data and use meaningful parts.
Data wrangling refers to the process of transforming and mapping the data.
Ex: mapping enables data into another format, which makes it valuable for analytics and data
visualizations. 4M
2M
Figure Data store export from machines, files, computers, web servers and web
services
Data Store first pre-processes from machine and file data sources. Pre-processing transforms
the data in table or partition schema or supported data formats, for example, JSON, CSV,
AVRO. Data then exports in compressed or uncompressed data formats. 1M
Cloud offers various services, IaaS, PaaS & SaaS. These services can be accessed through a
cloud client (client application), such as web browser, SQL or other client. Figure shows
data-store export from machines, files, computers, web servers and web services. The data
exports to clouds, such as IBM, Microsoft, Oracle, Amazon, Rackspace, TCS, Tata
Communications or Hadoop cloud services.
Following are the five application areas for popularity of Big Data:
1. Leading marketers using Customer Value Analytics (CVA) to deliver the consistent
customer experiences. CVA using the inputs of evaluated purchase patterns, preferences,
quality, price and post sales servicing requirements.
2. Operational analytics for optimizing company operations.
3. Detection of frauds and compliances. Ex: Fraud is borrowing money on already mortgage
assets, compliances means returning the loan and interest installments by the borrowers.
4. New products and innovations in service. Ex: A company develops software and then
offers services like Uber.
5. Enterprise data warehouse optimization.
Big Data usages has the following features-for enabling detection and prevention of frauds:
1. Fusing of existing data at an enterprise data warehouse with the data from sources such
as social media, websites, blogs, e-mails and thus enriching existing data.
2. Using multiple sources of data and connecting with many applications.
3. Analyzing data which enable structured reports and visualization.
4. Providing high volume data mining, new innovative applications thus leading to new
business intelligence and knowledge discovery.
5. Faster detection of threats and predict frauds by using various data and information
publicly available.
SQL
(Structured Query Language) SQL is a language for viewing or changing databases, for data
access control, schema creation, and data modifications.
Figure shows co-existence of data at server, SQL, RDBMS with NoSQL and Big Data at
Hadoop, Spark, Mesos, S3 or compatible Clusters.
1M
Figure Coexistence of RDBMS for traditional server data, NoSQL and Hadoop, Spark and
compatible Big Data Clusters.
4b Explain Traditional and Big Data analytics architecture reference model. 5M
Data Analytics
Analysis brings order, structure and meaning to the collection of data. Analytics uses
historical data and forecasts new values or results. Data analysis helps in finding business
intelligence and in decision making.
Data Analytics Definition
Analysis of data is a process of inspecting, cleaning, transforming and modeling data with the
goal of discovering useful information, suggesting conclusions and supporting decision making.
Phases in Analytics
1. Descriptive analytics enables deriving the additional value from visualizations and reports. 3M
2. Predictive analytics is advanced analytics which enables extraction of new facts and
knowledge, and then predicts/forecasts.
3. Prescriptive analytics enable derivation of the additional value and undertake better
decisions for new options to maximize the profits
Figure shows an overview of a reference model for analytics architecture. The figure also shows
the Big Data file systems, machine learning algorithms and query languages and usage of the
Hadoop ecosystem.
2M
Big Data driven approaches help in research in medicine. Following are some findings: building
the health profiles of individual patients and predicting models for diagnosing better and offer
better treatment.
2M
1. Aggregating large volume and variety of information from multiple sources the DNAs,
proteins and metabolites to cells, tissues, organs, organisms and ecosystems that can enhance
the understanding of biology of diseases. Big Data creates patterns and models by data
mining and help in better understanding and research.
2. Deploying wearable devices data, the devices data records during active as well as inactive
periods, provide better understanding of patient health and better risk profiling the user for
certain diseases.