0% found this document useful (0 votes)
117 views8 pages

BDAM - Assignment 1 - Group 2

The document discusses several topics related to data management and computing technologies: 1) How in-memory computing works by storing data in RAM across clusters for faster parallel processing compared to disk storage. 2) How Google's search engine operates using a distributed network, web crawler, indexer and query processor. 3) Benefits and cons of Google Cloud Bigtable as a scalable NoSQL database. 4) Definitions of data virtualization, storage virtualization, and functions of a VM manager.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views8 pages

BDAM - Assignment 1 - Group 2

The document discusses several topics related to data management and computing technologies: 1) How in-memory computing works by storing data in RAM across clusters for faster parallel processing compared to disk storage. 2) How Google's search engine operates using a distributed network, web crawler, indexer and query processor. 3) Benefits and cons of Google Cloud Bigtable as a scalable NoSQL database. 4) Definitions of data virtualization, storage virtualization, and functions of a VM manager.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

BDAM

ASSIGNMENT - 1

MANAGEMENT DEVELOPMENT INSTITUTE


GURGAON

SUBMITTED BY:
GROUP 2
Dikshika Arya (19PT1-07)
Jigyasa Monga (19PT1-12)
Pankhuri Bhatnagar (19PT1-18)
1. How things work in Memory computing?

 In-memory computing means using a type of middleware software that allows one
to store data in RAM which is faster than traditional spinning disk, across a cluster
of computers, and process it in parallel.

 RAM storage and parallel distributed processing are two fundamental pillars of in-
memory computing.

 A single modern computer can hardly have enough RAM to hold a significant
dataset, but that’s not enough to store many of today’s operational datasets that
easily measure in terabytes.

 To overcome this problem in-memory computing software is designed from the


ground up to store data in a distributed fashion, where the entire dataset is divided
into individual computers’ memory, each storing only a portion of the overall
dataset. Once data is partitioned - parallel distributed processing becomes a
technical necessity simply because data is stored this way.

 Developing technology that enables in-memory computing and parallel processing


is highly challenging.

 By storing data in RAM and processing it in parallel, it supplies real-time insights


that enable businesses to deliver immediate actions and responses. That’s what
makes it ideal for implementation in transactional and analytical applications
sharing the same data infrastructure

2. How Google works:

● Google runs on a distributed network of thousands of low-cost computers and can


therefore carry out fast parallel processing.
● Parallel processing is a method of computation in which many calculations can be
performed simultaneously, significantly speeding up data processing.
● Google has three distinct parts:
o Googlebot, a web crawler that finds and fetches web pages.
o The indexer that sorts every word on every page and stores the resulting
index of words in a huge database.
o The query processor, which compares your search query to the index and
recommends the documents that it considers most relevant.
● Pros: -
o Highly scalable data warehouse
o Easily integrated into analytics tools like Data Studio
o Easy to use with SQL support
o Can be used for all batch jobs or aggregations.

● Cons: -
o High Price.
o It does not handle external dependencies.

3. Cloud Bigtable

 A fully managed, scalable NoSQL database service for large analytical and
operational workloads.
 It is a compressed, high performance, proprietary data storage system built on
Google File System, Chubby Lock Service, SSTable and a few other Google
technologies.

● Pros: -
o Consistent sub-10ms latency—handle millions of requests per second.
o Ideal for use cases such as personalization, ad tech, fintech, digital media, and IoT.
o Seamlessly scale to match your storage needs; no downtime during reconfiguration.
o Designed with a storage engine for machine learning applications leading to better
predictions

4. What is data and storage virtualization? Functions of VM


manager?

Data virtualization is an approach to data management that allows an application to


retrieve and manipulate data without requiring technical details about the data, such as
how it is formatted at source, or where it is physically located, and can provide a single
customer view (or single view of any other entity) of the overall data.

Benefits of data visualization:

 Reduce risk of data errors


 Reduce systems workload through not moving data around
 Increase speed of access to data on a real-time basis
 Significantly reduce development and support time
 Increase governance and reduce risk through the use of policies
 Reduce data storage required

Storage virtualization

 Storage virtualization is the process of grouping the physical storage from multiple
network storage devices so that it looks like a single storage device.
 The process involves abstracting and covering the internal functions of a storage
device from the host application, host servers or a general network in order to
facilitate the application and network-independent management of storage.
 Storage virtualization is also known as cloud storage.
 Some of the benefits of storage virtualization include automated management,
expansion of storage capacity, reduced time in manual supervision, easy updates
and reduced downtime
Functions of VM manager:

 Create virtual machines from installation media or from a virtual machine template.
 Delete virtual machines.
 Power off virtual machines.
 Import virtual machines.
 Deploy and clone virtual machines.
 Perform live migration of virtual machines.
 Import and manage ISOs.

5. Hyper-V technology, Intel VT-x?

Hyper-V technology

 Hyper-V is a form of hypervisor-based virtualization technology, which is used for


creating, running, and managing virtual machines (VMs). Hyper-V is a Type-1
hypervisor, which means that the hypervisor runs directly on the physical hardware
(host machine) and hosts multiple VMs (guest machines) sharing the virtualized
hardware resources from the physical server.
 Even though one physical server can host multiple VMs and those VMs share the
same set of physical resources, they do not affect one another’s performance. This is
due to the fact that each VM in a virtual environment runs in isolation from other
VMs.
Intel VT-x

 Intel VT (Virtualization Technology) is the company's hardware assistance for


processors running virtualization platforms.
 Intel VT includes a series of extensions for hardware virtualization. The Intel VT-x
extensions are probably the best recognized extensions, adding migration, priority
and memory handling capabilities to a wide range of Intel processors. By
comparison, the VT-d extensions add virtualization support to Intel chipsets that
can assign specific I/O devices to specific virtual machines (VM)s, while the VT-c
extensions bring better virtualization support to I/O devices such as network
switches.

6. Discuss streaming data access and management.


Streaming data refers to the data that is continuously generated, usually in high volumes
and at high velocity. It is the continuous flow of data generated by various sources.
In streaming data access technology, instead of reading data as packets or chunks, data is
read continuously with a constant bitrate. The application starts reading data from the start
of a file and keeps on reading it in a sequential manner without random seeks.

By using stream processing technology, data streams can be processed, stored, analyzed,
and acted upon as it is generated in real-time.

Streaming Data Architecture-

Streaming Data Architecture can be considered as a framework of software components


which are built to ingest and process large volumes of streaming data from multiple
sources. Under the Streaming Data Architecture, it consumes data immediately as it is
generated, persists it to storage, and can include various additional components like tools
for real-time processing, data manipulation and analytics.

Streaming stacks can be built on an assembly line of open-source and proprietary solutions
to specific problems including stream processing, storage, data integration and real-time
analytics.

 The Message Broker / Stream Processor

Message Broker is the element that takes data from a source, called a producer,
translates it into a standard message format, and streams it on an ongoing basis.
Other components can listen in and consume the messages passed on by the broker.

Streaming brokers support very high performance, have massive capacity of


message traffic, and are highly focused on streaming with little support for data
transformations or task scheduling.

 Batch and Real-time ETL tools

Data streams from one or more message brokers need to be aggregated,


transformed and structured before data can be analyzed with SQL-based analytics
tools. This action is performed by ETL tool or platform which receives queries from
users, fetches events from message queues and applies the query and generates a
result. It also performs additional joins, transformations on aggregations on the
data. It may result in an API call, an action, a visualization, an alert or even in a new
data stream.

 Data Analytics / Serverless Query Engine

After the preparation of streaming data by the stream processor, it is analyzed to


provide value. There are various approaches and tools used for streaming data
analytics.
 Streaming Data Storage

Various Data Storage options are being used for storing streaming data like
Database or Data Warehouse, in the message broker or in Data Lake. Data Lake
option provides a flexible and inexpensive option for storing event data however, it
offers its own technical challenges.

Various modern streaming architectures are also being adopted which rely on full stack approach
(in contrast to patching together open source technologies) which further provides the benefits of
performance, high availability, fault tolerance, flexibility, etc.

You might also like