0% found this document useful (0 votes)
95 views12 pages

1st Internal Solved

The document is a test paper for the 7th semester course on Big Data and Analytics. It contains 4 questions with multiple sub-questions. Question 1 asks students to describe data, web data, Big Data, and the 3V characteristics of Big Data. Question 2 explains massively parallel processing, cloud computing in Big Data, and issues like data noise and filtering. Questions 3 and 4 contain sub-questions on topics like data storage, Big Data usage in marketing, distributed databases, and Big Data analytics architectures. The test aims to evaluate students' understanding of fundamental Big Data concepts.

Uploaded by

Niroop K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views12 pages

1st Internal Solved

The document is a test paper for the 7th semester course on Big Data and Analytics. It contains 4 questions with multiple sub-questions. Question 1 asks students to describe data, web data, Big Data, and the 3V characteristics of Big Data. Question 2 explains massively parallel processing, cloud computing in Big Data, and issues like data noise and filtering. Questions 3 and 4 contain sub-questions on topics like data storage, Big Data usage in marketing, distributed databases, and Big Data analytics architectures. The test aims to evaluate students' understanding of fundamental Big Data concepts.

Uploaded by

Niroop K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

7th Semester (2018 Scheme)

SHREE DEVI INSTITUTE OF TECHNOLOGY


Kenjar, Mangalore -574142
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
I Internal Test – November 2021

Sem/Sec: VII A & B Max Marks: 30


Course Name : BIG DATA AND ANALYTICS Duration: 1 Hour
Course Code : 18CS72 Date: 15/11/2021

Question Marks RBT CO


Note: Answer any One Full Question from each part.
Number Level
PART- A
1 a. Describe the data, web data and Big Data. Explain the 3Vs 7M L2 CO1
characteristics of Big Data?

b. Define Big Data architecture. Draw five layers in architecture 8M L1 CO1


design and explain functions in each layer.
OR
2 a. Explain Massively Parallel Processing and Cloud Computing in Big 6M L2 CO1
Data scenario.

b. Explain data noise, outliers, data anomaly and duplicate data with 5M L2 CO1
example. Why filtering require during pre-processing.

c. Describe the pre-processing steps, data cleaning, transforming, 4M L2 CO1


modeling and visualizing data.
PART- B
3 a. With the figure show how data-store export using machines, files, 7M L2 CO1
computers, web servers and web services.

b. Describe the ways of usages of Big Data analytics in marketing, 8M L2 CO1


sales and advertising.
OR
4 a. Define distributed databases. How do they differ from distributed 5M L1 CO1
Data Stores?

b. Explain Traditional and Big Data analytics architecture reference 5M L1 CO1


model.

c. Describe how Big Data analytics facilitate in Healthcare & 5M L2 CO1


Medicine.
SCHEME OF EVALUATION:

Sem: VII Max. Marks:30


Course Name / Code: BIG DATA AND ANALYTICS (18CS72) Date: 15/11/2021

Q. Answers Marks
No
1a Describe the data, web data and Big Data. Explain the 3Vs characteristics of Big Data? 7M

Definitions of Data
 Data is information, usually in the form of facts or statistics that one can analyze or use
for further calculations.
 Data is information that can be stored and used by a computer program.
 Data is information presented in numbers, letters, or other form.
 Data is information from series of observations, measurements or facts.
 Data is information from series of behavioral observations, measurements or facts.

Definition of Web Data


Web data is the data present on the web servers (or enterprise severs) in the form of text,
images, videos, audios and multimedia files for web users. A user (client software) interacts
with this data. A client can access data of response from a server. Internet applications 3M
including web sites, web services, web portals, online business applications, emails, chats,
tweets and social networks provide and consume the web data.

Big Data Definitions


 Big Data is high-volume, high-velocity and/or high-variety information asset that
requires new forms of processing for enhanced decision making, insight discovery and
process optimization.
 A collection of data sets so large or complex that traditional data processing applications
are inadequate.
 Data of a very large size, typically to the extent that its manipulation and management
present significant logistical challenges.
 Data sets whose size is beyond the ability of typical database software tool to capture,
store, manage and analyze

Explain the 3Vs characteristics of Big Data


Volume which is related to size of the data. Size defines the amount or quantity of data, which is
generated from an application. The size determines the processing considerations needed for
handling that data.

Velocity refers to the speed of generation of data. It is a measure of how fast the data generates
and processes. To meet the demands and the challenges of processing Big Data, the velocity of 4M
generation of data plays a crucial role.

Variety Big Data comprises of a variety of data. Data is generated from multiple sources. The
variety is due to the availability of large number of heterogeneous platforms in the industry. It is
important characteristic that needs to be known for proper processing of data and helps in
effective use of data according to their formats.

Veracity take into account the quality of data captured. uncertain or imprecise data
1b Define Big Data architecture. Draw five layers in architecture design and explain functions 8M
in each layer.

Big Data architecture is defined as: “Big Data architecture is the logical and/or physical
layout/ structure of how Big Data will be stored, accessed and managed within a Big Data or 2M
IT environment.
Architecture logically defines how Big Data solution will work, the core components
(hardware, database, software, storage) used, flow of information, security and more.

Figure shows the logical layers and the functions which are considered in Big Data
architecture. Data processing architecture consists of five layers

1. Identification of data sources,


2. Acquisition, ingestion, extraction, pre-processing, transformation of data,
3. Data storage at files, severs, cluster or cloud,
4. Data-processing and
5. Data consumption in the number programs and tools.

2M

Figure Design of logical layers in a data processing architecture and functions in the
layers.

Logical layer 1 (L1) is for identifying data sources, which are external, internal or both. L1
considers the following aspects in a design:
 Amount of data needed at ingestion layer 2 (L2)
 Push from L1 or pull by L2 as per the mechanism for the usages.
 Source data-types: Database, files, web or service
 Source formats, i.e., semi-structured, unstructured or structured.

The layer 2 (L2) is for data-ingestion. Ingestion is the process of obtaining and importing data
for immediate use or transfer. L2 considers the following aspects:
 Obtaining and importing data using ELT (Extract Load and Transform).
 Data Pre-processing (validation, transformation or transcoding) requirement.
 Data semantics (such as replace, append, aggregate, compact).
4M
 Ingestion and ETL processes either in batches or in real time, which means store and use
the data as generated. Batch processing is using discrete datasets at scheduled or periodic
intervals of time.

The L3 layer is for storage of data from L2 layer. L3 considers the following aspects:
 Data storage type (historical or incremental), formats, compression, frequency of
incoming data, querying patterns and data consumption requirements for L4 or L5.
 Data storage using Hadoop distributed file system or NoSQL data stores- HBase,
Cassandra, MongoDB.

L4 considers the following aspects:


 Data processing software such as MapReduce, Hive, Pig, Spark, Spark Mahout, Spark
Streaming.
 Processing in scheduled batches or real time or hybrid.
 Processing as per synchronous or asynchronous processing requirements at L5

L5 considers the consumption of data for the following:


 Data integration.
 Datasets usages for reporting and visualization.
 Datasets usages for Analytics (real time, near real time, scheduled batches), BPs, BIs,
knowledge discovery.
 Export of datasets to cloud, web or other systems.
2a Explain Massively Parallel Processing and Cloud Computing in Big Data scenario. 6M

Massively Parallel Processing Platforms


Many programs are so large and complex that it is impossible to execute them on single
computer system. it is required to enhance (scale up) the computer system or use massive
parallel processing platforms (MPPs).
Parallelization of tasks can be done at several levels:
1. Distributing separate tasks onto separate threads on the same CPU
2. Distributing separate tasks onto separate CPUs on the same computer and 3M
3. Distributing separate tasks onto separate computers.
When making use of the advantage of multiple computers, software needs to be able to
parallelize tasks. The computational problem is broken into discrete pieces of sub-tasks that can
be processed simultaneously. Total time taken will be much less than with a single compute
resource.
Cloud Computing
Cloud computing is a type of Internet-based computing that provides shared processing
resources and data to the computers and other devices on demand. Cloud usages circumvent the
single point failure. Its multiple nodes perform automatically and interchangeably. It offers high
data security compared to other distributed technologies.
Ex: Amazon Web Service (AWS), Elastic Compute Cloud (EC2), Microsoft Azure or Apache
CloudStack, Amazon Simple Storage Service (S3).

Cloud computing features are:


 On-demand service
 Resource pooling
 Scalability
 Accountability and
 Broad network access

Cloud services can be classified into three fundamentals types:


3M
1. Infrastructure as a Service (IaaS): Providing access to resources, such as hard disks,
network connections, databases storage, data center and virtual server spaces is Infrastructure
as s Service (IaaS).
Ex: Tata Communications, Amazon data centers and virtual servers. Apache CloudStack
offers public cloud services, provides highly scalable Infrastructure as a Service (IaaS).

2. Platform as a Service (PaaS): It implies providing the runtime environment to allow


developers to build applications and services, is Platform as a Service (PaaS). Software at
the clouds support and manage the services, storage, networking, deploying, testing,
collaborating, hosting and maintaining applications.
Ex: Hadoop Cloud Service, Oracle Big Data Cloud Services.

3. Software as a Service (SaaS): Providing software applications as a service to end users is


known as Software as a Service (SaaS). Software applications are hosted by a service
provider and made available to customers over the Internet.
Ex: SQL, Google SQL, Oracle Big Data SQL, IBM BigSQL, HPE Vertica, and Microsoft
Polybase.

2b Explain data noise, outliers, data anomaly and duplicate data with example. Why filtering require 5M
during pre-processing.

Noise
Noise in data refers to data giving additional meaningless information besides true / actual
information. Noise refers to difference in the value measured from true value due to additional
influences. Result of data analysis is adversely affected due to noisy data.
Ex: Consider noise in wind velocity and direction readings. The velocity at certain instances will
appear too high and sometimes too low. The directions at certain instances will appear inclined
towards the north and sometimes towards the south.

Outliers
Outliers refers to data, which appears to not to belong to the dataset. Outliers need to be
removed from the dataset; else the result will be effected by a small or large amount. If valid data
is identified as outlier, then also the results will be affected. The outliers are a result of human
data-entry errors, programming bugs.
Ex: In the students grade-sheets in one subject 4/5 in the 4th semester. A result in a semester
shows 9.0/10 in place of 3.0/10. Data 9.0 is an outlier. The student semester grade point average
(SGPA) will be erroneously declared and the student may be even declared to have failed in that
semester.
4M
Missing Values
Missing value implies data not appearing in the data set.
Ex: Consider missing values in the sales figures of chocolates. The values not sent for certain
dates. This may be due to the failure of power supply at the machine or network problems on
specific days in a month. The chocolate sales not added for a day can be added in the next day’s
sales data. The effect on the average sales per day is not significant. However, if the failure
occurred on last day of a month, then the analysis will be erroneous.

Duplicate Values
Duplicate value implies the same data appearing two or more times in a dataset.
Ex: Consider duplicate values in the sales figures of chocolates. This may be due to some
problem in the system. When the number of duplicates values are sent and added, then sales
result analysis will get affected. It can even result in false alarms to a service, which affects
supply chain.
Assume network problems on certain instances. so may not get an acknowledgement of the sales
figures from the server, leading to resending the sales record once again. Then the sales figures
of chocolates get recorded twice at that instance. The chocolate sales data gets added twice in a
specific day’s sales data. The calculation of monthly sales data is adversely affected.

Pre-processing need are:


1. Dropping out of range, inconsistent and outlier values 1M
2. Filtering unreliable, irrelevant and redundant information
3. Data cleaning, editing, reduction and/or wrangling.
4. Data validation, transformation or transcoding.
5. ELT processing

2c Describe the pre-processing steps, data cleaning, transforming, modeling and visualizing data. 4M

Data Cleaning refers to the process of removing or correcting incomplete, incorrect, inaccurate
or irrelevant parts of the data after detecting them.
Ex: In students grade-sheets correcting the grade outliers.

Data Transforming
Data reduction enables the transformation of acquired information into an ordered, correct and
simplified form. The reductions enable ingestion of meaningful data in the datasets. The basic
concept is the reduction of multitudinous amount of data and use meaningful parts.
Data wrangling refers to the process of transforming and mapping the data.
Ex: mapping enables data into another format, which makes it valuable for analytics and data
visualizations. 4M

Data modeling and visualizing data


Data modeling makes sure that the data is stored in a database and accurately represented. It
includes the data objects, associations and rules. Data modeling creates a clear picture of the data
and identifies missing and redundant data.
Data visualization and in the world of Big Data, it’s becoming massively important.
Representing through graphs, charts and maps, make data insightful, so able to see hidden trends
and patterns. It makes data more understandable and usable. Data can be presented in the form of
rectangular tables, or it can be presented in colorful graphs of various types. When processing for
data visualization of Excel format file, the data conversion will be done from .csv file to .xlsx
format.
3a With the figure show how data-store export using machines, files, computers, web servers and 7M
web services.

2M

Figure Data store export from machines, files, computers, web servers and web
services

Data Store first pre-processes from machine and file data sources. Pre-processing transforms
the data in table or partition schema or supported data formats, for example, JSON, CSV,
AVRO. Data then exports in compressed or uncompressed data formats. 1M
Cloud offers various services, IaaS, PaaS & SaaS. These services can be accessed through a
cloud client (client application), such as web browser, SQL or other client. Figure shows
data-store export from machines, files, computers, web servers and web services. The data
exports to clouds, such as IBM, Microsoft, Oracle, Amazon, Rackspace, TCS, Tata
Communications or Hadoop cloud services.

Export of Data to AWS and Rackspace Clouds


Following are the steps for export to an EC2 (AWS) instance:
2M
1. A process pre-processes the data from table in MySQL database and creates a CSV file.
2. An EC2 instance provides an AWS data pipeline.
3. The CSV file exports to Amazon S3 using pipeline. The CSV file then copies into S3
bucket.
4. AWS notification service (SNS) sends notification on completion.

Following are the steps for export to Rackspace


1. One or more databases create a database instance. The process of creation can be
configured to create an instance. Each database can have a number of users.
2. Default port number for binding of MySQL is port 3306.
3. A command
2M
mysqldump – u root – p database_name > database_name.sql
exports to Rackspace cloud.
4. When a database is at a remote host then a command
mysqldump – h host_name – u user_name – p database_name > database_name.sql
exports to the cloud database.
3b Describe the ways of usages of Big Data analytics in marketing, sales and advertising. 8M

Big Data in Marketing and Sales


Marketing is the creation, communication and delivery of value to customers. Customer
Value (CV) depends on 3 factors – quality, service and price.

Following are the five application areas for popularity of Big Data:
1. Leading marketers using Customer Value Analytics (CVA) to deliver the consistent
customer experiences. CVA using the inputs of evaluated purchase patterns, preferences,
quality, price and post sales servicing requirements.
2. Operational analytics for optimizing company operations.
3. Detection of frauds and compliances. Ex: Fraud is borrowing money on already mortgage
assets, compliances means returning the loan and interest installments by the borrowers.
4. New products and innovations in service. Ex: A company develops software and then
offers services like Uber.
5. Enterprise data warehouse optimization.

Big data is providing marketing insights into


1. Most effective content at each stage of a sales cycle,
2. Investment in improving the customer relationship management (CRM),
3. Addition to strategies for increasing customer lifetime value (CLTV),
4. Lowering of customer acquisition cost (CAC), 6M

Big Data usages has the following features-for enabling detection and prevention of frauds:
1. Fusing of existing data at an enterprise data warehouse with the data from sources such
as social media, websites, blogs, e-mails and thus enriching existing data.
2. Using multiple sources of data and connecting with many applications.
3. Analyzing data which enable structured reports and visualization.
4. Providing high volume data mining, new innovative applications thus leading to new
business intelligence and knowledge discovery.
5. Faster detection of threats and predict frauds by using various data and information
publicly available.

Big Data in Advertising


The impact of Big Data is tremendous on the digital advertising industry. Data technology and
analytics provide insights, patterns and models which relate the media exposure of purchase
activity of all consumers using digital channels.
Success from advertisements depends on collection, analyzing and mining. The new insights
enable personalization and targeting the online, social media and mobile for advertisements
2M
called hyper-localized advertising.
Advertising nowadays no longer limits to TV, radio and print. Advertisers use along with these
multiple devices and mediums.
Example: Advertisement of the introduction of new course by an institution or introduction of
new flights by an Airline needs media other than TV.
Advertising on digital medium needs optimization. Too much usage can also effect negatively.
Phone call, SMS, e-mail based advertisements can be nuisance if sent without proper researching
on the potential targets. The analytics help in this direction.
4a Define distributed databases. How do they differ from distributed Data Stores? 5M

Distributed Database Management System


Is a collection of logically interrelated databases at multiple systems over a computer network.
The features of a distributed database system are:
1. A collection of logically related databases.
2. Cooperation between databases in a transparent manner. Means each user within the
system may access all of the data within all of the databases as if they were a single
database.
3. Location Independent which means the user is unaware of where the data is located, and
it is possible to move the data from one physical location to another without affecting the
user.

SQL
(Structured Query Language) SQL is a language for viewing or changing databases, for data
access control, schema creation, and data modifications.

Large Data Storage using RDBMS


RDBMS tables store data in a structured form. The tables have rows and columns. A set of
keys and relational keys access the fields at tables, and retrieve data using queries (insert,
modify, append, join or delete). 4M

In-Memory Column Formats Data


Data in a column are kept together in-memory in columnar format.

In-Memory Row Format Databases


Each row record has corresponding values in multiple columns and the values store at the
consecutive memory addresses. In-Memory row format allows much faster data processing
during OLTP.

Enterprise Data-Store & Data Warehouse


Enterprise data server use data from several distributed sources. Enterprise data, after data
cleaning process, integrate with the server data at data warehouse.

Big Data Storage


Big Data Store uses NoSQL. NoSQL is also used in cloud data store.

Figure shows co-existence of data at server, SQL, RDBMS with NoSQL and Big Data at
Hadoop, Spark, Mesos, S3 or compatible Clusters.

1M

Figure Coexistence of RDBMS for traditional server data, NoSQL and Hadoop, Spark and
compatible Big Data Clusters.
4b Explain Traditional and Big Data analytics architecture reference model. 5M

DBMS or RDBMS manages the traditional databases.

Data Analytics
Analysis brings order, structure and meaning to the collection of data. Analytics uses
historical data and forecasts new values or results. Data analysis helps in finding business
intelligence and in decision making.
Data Analytics Definition
Analysis of data is a process of inspecting, cleaning, transforming and modeling data with the
goal of discovering useful information, suggesting conclusions and supporting decision making.
Phases in Analytics

1. Descriptive analytics enables deriving the additional value from visualizations and reports. 3M
2. Predictive analytics is advanced analytics which enables extraction of new facts and
knowledge, and then predicts/forecasts.
3. Prescriptive analytics enable derivation of the additional value and undertake better
decisions for new options to maximize the profits

Analytics integrate with the enterprise server or data warehouse.

Figure shows an overview of a reference model for analytics architecture. The figure also shows
the Big Data file systems, machine learning algorithms and query languages and usage of the
Hadoop ecosystem.

2M

Figure Traditional and Big Data analytics architecture reference model

4c Describe how Big Data analytics facilitate in Healthcare & Medicine. 5M

Big Data and Healthcare


Big Data analytics in health care use the following data sources: clinical records, pharmacy
records, electronic medical records, diagnosis logs and notes and additional data such as social
interactions, medical leaves from job, deviation from person usual activities.

Healthcare analytics using Big Data can facilitate the following:


1. Provisioning of value-based and customer-centric healthcare: Means cost effective care
by improving healthcare quality using latest knowledge, usages of electronic health and
medical records and improving coordination among the healthcare providing agencies which
reduce avoidable overuse and healthcare costs.
2. Utilizing the “Internet of Things” for health care: This enables the monitoring of devices
data of patients parameters, such as, glucose, BP, ECGs and necessities of visiting
physicians. 3M
3. Preventing fraud, waste, abuse in healthcare industry and reduce healthcare costs: Uses
Big Data predictive analytics and help resolve excessive or duplicate claims. The analytics of
patient records and billing help in detecting anomalies such as overutilization of services in
short intervals, different hospitals in different locations simultaneously, or identical
prescriptions for the same patient.
4. Improving outcomes by accurately diagnosing patient conditions. Early diagnosis,
predicting problems such as congestive heart failure, anticipating and avoiding
complications, matching treatments with outcomes and predicting patients at risk for disease
or readmission.
5. Monitoring patients in real time: Using Machine learning algorithms to process real-time
events. Provides physicians with insights to help them make life-saving decisions. The
process automation sends the alerts to care providers and informs them instantly about
changes in the conditions of a patient.

Big Data in Medicine

Big Data driven approaches help in research in medicine. Following are some findings: building
the health profiles of individual patients and predicting models for diagnosing better and offer
better treatment.
2M
1. Aggregating large volume and variety of information from multiple sources the DNAs,
proteins and metabolites to cells, tissues, organs, organisms and ecosystems that can enhance
the understanding of biology of diseases. Big Data creates patterns and models by data
mining and help in better understanding and research.
2. Deploying wearable devices data, the devices data records during active as well as inactive
periods, provide better understanding of patient health and better risk profiling the user for
certain diseases.

Faculty In charge HOD

You might also like