0% found this document useful (0 votes)

20 views21 pages

Comp Vivek 1702

The document outlines a comprehensive analysis of big data concepts, including data warehouse design, MapReduce programming, and the advantages of using data lakes over traditional data warehouses. It discusses SQL queries for data retrieval and analysis, as well as the MapReduce algorithm for calculating average authors per paper. Additionally, it evaluates the strengths and limitations of MapReduce, proposing best practices for real-time data processing and analytics.

Uploaded by

sinchana6470

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views21 pages

Comp Vivek 1702

Uploaded by

sinchana6470

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

COMP 1702 BIGDATA

Vivek Mahanthesh M
001367027
MSc In Data Science

1
Table of Contents
TASK A: DATA WAREHOUSE DESIGN: ............................................................................ 3
Task B: MapReduce Programming .................................................................................. 12
Task c: Big Data Project Analysis ............................................................................................. 15
c1: ......................................................................................................................................... 15
c2. ......................................................................................................................................... 17
c3: ......................................................................................................................................... 18
References: .......................................................................................................................... 19

2
TASK A: DATA WAREHOUSE DESIGN:

Let us consider a store which sells various products across the different categories. This
warehouse has three different tables i.e. customers, sale and product. The “customer” table has
the information about the customer and it has 4 columns which includes customer_id,
customer_name, city and state. The second table is “sale” which has all the information about

3
sales which has 6 columns i.e. sale_id, product_id, quantity, sale_date, store_id, revenue. The
next table is “product” which consists of 3 columns i.e. product_id, product_name, unit_price.

The “SELECT COUNT (*) AS total products FROM product” query retrives the data. The
COUNT (*) function is used to count the total number of rows in the product table. The “AS
total_products” is the alias. “FROM product” specifies the table in which the data is retrieved
from the product table.The output displays the total number of rows in the product table.

Here “SELECT YEAR(sale_date) AS year, COUNT(sale_id) AS total_sale” selects two

columns i.e. YEAR(sale_date) calculates the year from the ‘sale_date’ column and ‘AS year’
is the alias name. Then, “COUNT(sale_id) calculates the number of rows in each group, which

4
represents the total number of sales in each year. Further “FROM sale” specifies the data isi
retrieved from the ‘sale’ table. Next, GROUP BY YEAR(sale_date) clause gives the year from
‘sale_data’ column. It calculates the total number of sales each year.

This query “SELECT SUM(revenue) AS total_revenue” which retrieves data from the
database. It uses the SUM() function to calculate the total sum of the revenue in sale table. The
AS total_revenue is the alias total_revenue to the result of the SUM() function. This means that
the sum of revenues will be labeled as total_revenue in the output.Next “ FROM sale” tells that
the data is retrieved from the ‘sale’ table. This gives the output of total revenue from the ‘sale’
table.

5
The “SELECT p.product_name, s.revenue” selects two columns i.e. product_name from the
product table (p) and revenue from the sale table (s).It shows that the result of the query will
have the product name and revenue for each product sale.Next ”FROM product p” retrieves
the data from the ‘product’ table with alias name p. Further, “JOIN sale s ON p.product_id =
s.sale_id” joins the product table (p) with the sale table (s) on the given condition that
product_id in product table matches sale_id in the sale table. This gets to link each product
with its corresponding sale information. Next, “WHERE s.revenue>299.99” which displays
the rows where revenue in the ‘sale’ table is greater than 299.99. The output gives the product
names and revenue for sales where revenue exceeds 299.99.

The “SELECT p.product_name,s.revenue” of the query selects two columns product_name

from the product table (p) and revenue from the sale table (s).It shows the product name and

6
revenue for each product sale. Next, “FROM product p” specifies that the data will be retrieved
from the product table, with the alias p and “JOIN sale s ON p.product_id = s.sale_id” joins
the product table (p) with the sale table (s). This allows us to link each product with its
corresponding sale information. Next comes the condition “WHERE s.revenue BETWEEN
399.99 AND 599.99”clause gives the results to include only those things where the revenue in
the sale table (s) is within the range of 399.99 and 599.99. The output display the product name
and the revenue between 399.99 and 599.99.

The “SELECT customer_name” selects the customer_name column from the Customers table.
This means that the result of the query will include the names of customers.Next, “FROM
Customers” specifies that the data will be retrieved from the Customers table and “WHERE
state = ‘TN’ clause filter the result for only those where the ‘state’ in the ‘Customers’ is qual to
‘TN’. This displays the result where it includes the customers who are located in the state TN.

7
The given query “SELECT p.product_name, COUNT(s.sale_id) AS total_sold” selects two
columns i.e. product_name from the product table (p).The COUNT() function is applied to the
sale_id column from the sale table (s). This calculates the total number of sales for each
product. The “AS total_sold” is the alias and total_sold to the result of the COUNT() function,
representing the total number of sales for each product.”FROM product p” retrieves the data
from the ‘product’ table with alias ‘p’. Next “JOIN sale s ON p.product_id = s.sale_id” joins
the product table (p) with the sale table (s) on the given condition that the product_id in the
product table matches the sale_id in the sale table.Further, “GROUP BY p.product_name”
clause groups the results by the product_name column from the product table. (COUNT())
function will be used to count the number of rows which has the same product.Next, “ORDER
BY total_sold DESC” orders the result on the total_sold column in descending order and
“LIMIT 3” clause tells to display the result which include only top 3 rows.

8
9
“SELECT c.city, COUNT(s.sale_id) AS products_sold” selects two columns city from the
customers table (c) and the COUNT() function applied to the sale_id column from the sale
table (s). This gives the total number of sales for each city. The AS products_sold part of the
statement assigns the alias products_sold to the result of the COUNT() function,which shows
the total number of products sold for each city.” FROM customers c”specifies that the data will
be retrieved from the customers table, with the table alias as c.” JOIN sale s ON c.customer_id
= s.sale_id” joins the customers table (c) with the sale table (s) on the given condition that the
customer_id in the customers table matches the sale_id in the sale table and “ GROUP BY
c.city” clause groups the results by the city column from the customers table. The result we get
is the product sold for each city.

10
” SELECT c.city, COUNT(s.sale_id) AS products_sold” selects the city from the customers
table and counts the number of sales IDs (sale_id) from the sale table and "products_sold" is
the alias. “FROM customers c” specifies that the data will be selected from the "customers"
table and it's aliased as "c".” JOIN sale s ON c.customer_id = s.sale_id” joins the "customers"
table with the "sale" table on the given condition that the customer_id from the "customers"
table matches the sale_id from the "sale" table. “WHERE s.sale_date BETWEEN '2023-01-01'
AND '2023-02-01' filters the rows from the joined tables based on the condition that the
sale_date is within a specified date range . This condition restricts the sales data considered in
the query to only those that occurred within the specified date range.” GROUP BY c.city”
groups the result set by the city column from the "customers" table. The output will be count
of the product sold in each city within the date.

“SELECT c.city, COUNT(s.sale_id) AS products_sold” selects the city from the "customers"
table (c) and counts the number of sale IDs (sale_id) from the "sale" table (s). This count is
aliased as "products_sold". “FROM customers c” specifies that the data will be selected from
the "customers" table”. “JOIN sale s ON c.customer_id = s.sale_id” joins the "customers" table

11
with the "sale" table on the condition that the customer_id from the "customers" table matches
the sale_id from the "sale" table. “WHERE s.quantity > 5” filters the joined data, specifying
that only sales transactions where the quantity of products sold is greater than 5 . “GROUP BY
c.city” groups the filtered dataset by the city column from the "customers" table. The output
will be the products sold in each city where the quantity sold is greater than 5.

Task B: MapReduce Programming

class Mapper:

def map(self, line):

# Split the line by '|'

authors, title, conference, year = line.split('|')

# Count the number of authors

num_authors = len(authors.split(','))

# Emit (year, (num_authors, 1))

emit (year, (num_authors, 1))

class Reducer:

def reduce(self, key, values):

total_authors = 0

total_papers = 0

12
# Sum up the number of authors and papers

for authors, count in values:

total_authors += authors

total_papers += count

# Calculate the average number of authors per paper

avg_authors = total_authors / total_papers

# Emit the result

emit (key, avg_authors)

The above is the MapReduce algorithm to find average number of authors per paper for each
year.

Map Stage-

In this map stage, a key-value pair of each paper is emitted where the key is year of paper and
value is tuple containing number of authors for that paper and 1 to keep count of number of
papers for that year.

Each line is split by ‘|; character to get authors, title, conference and year fields then the number
of authors is counted by splitting author’s field by ‘,’ and to get the length of result.

Reduce Stage-

In the reduce stage, the values are aggregated for each year. The calculation of average number
of authors per paper for that year is done by summing up the number of authors and count of
papers.

13
The summing up process is done by number of authors and count of papers by iterating over
the values. Then average is calculated by dividing the total number of authors by total number
of papers and emit result as a key - value pair where key is the year and value is average number
of authors.

The algorithm is efficient as-

1. Input data is distributed across machines in Hadoop cluster and each mapper processes
a subset of data in parallel.
2. Shuffle and sort phase groups all values for specific key and sends them to same
reducer.
3. Reducers run in parallel each processing a subset of keys.
4. It performs a single pass over input data which makes it efficient.

A combiner can be used to perform partial aggregation on mapper outputs before sending then
to reducers. It reduces the amount of data transferred across network improving the
performance.

class Combiner:

def combine(self, key, values):

total_authors = 0

total_papers = 0

# Sum up the number of authors and papers

for authors, count in values:

total_authors += authors

14
total_papers += count

# Emit the partially aggregated value

yield (key, (total_authors, total_papers))

This is how a combiner can be implemented.

The combiner gets same input as reducer then it sums up num_authors values and count of
papers. Then it emits a new key-value pair with same year key and value that is tuple containing
summed total_authors and total_papers.

Combiners helps in increasing the performance especially for large datasets or clusters with
high network.

But sometimes it can introduce overhead and complexity so the benefits should be compared
against drawbacks for each use case.

Task c: Big Data Project Analysis

c1:
Building a Data Lake would be more suitable than traditional Data Warehouse.

The reasons can be as follows:

1. Flexibility and agility- Data lake allows storing raw, unstructured and semi-structured
data without schema definition or data transformation. It enables to store a variety of
data types including online news, social media and internal enterprise data etc without
any restrictions. Hence processing large amount of data efficiently.(Lakshmanan,2021)
2. Scalability- data lakes are built to scale horizontally to handle large volume of data. It
can seamlessly scale to store evolving data without disturbance.

15
3. Cost-effectiveness- in case of traditional data warehouses it requires structured data
modelling and storage whereas data lakes have cost effective storage solution such as
cloud object storage like (Amazon S3, Azure) or distributed file system (Hadoop). This
reduces storage costs for large scale data.
4. Analytics flexibility- data lakes supports a variety of analytics and data processing tools
to help analysts and data scientists explore and work on new or raw data
directly.(Inmon,2016)
5. Future-proofing- by storing data in raw format, new data sources merge or when
analytical techniques gets evolved the bank can change its data lake architecture to
bring changes without the requirement of data migration.

Approach for implementing Data lakes:

1. Data ingestion- data ingestion pipelines are developed to ingest data from APIs such
as social media, online news, and internal enterprise systems. Stream and batch
processing is used to ingest data efficiently.(Narkhede,2017)
2. Data storage- a suitable cloud storage service is chosen such as amazon, hdfs or
azure. Data lake architecture is done to store expected volume of data and to also
check scalability, fault tolerance and data durability.(Shvachko,2010)
3. Data governance and metadata management- data governance policies and
metadata management practices are implemented to guarantee data quality, security,
and compliance. Tagging and annotation is done to help data in data discovery,
lineage tracking and accessing the control.(Hai,2018)
4. Data processing and analytics- to perform batch and stream processing several data
processing tools like Apache spark, and Hadoop MapReduce is used with data lake
infrastructure. This enables data scientists to run complex queries directly on data
lake.(Karau,2015)
5. Security and access control- by implementing specific security measures such as
encryption and to access the control policies to save from data trading and to ensure
regulations and prevent unauthorized use.(Hai,2018)
6. Monitoring and performance optimization – monitoring and logging solutions are
deployed to track data lake performance, data ingestion rates and resource

16
utilization. Data lake is simultaneously optimized for performance, reliability and
cost efficiency based on patterns and metrics.(Shvachko,2010)

c2.
Strengths of MapReduce:

1. Scalability- it is highly scalable, and it processes large quantity of data by spreading it

across a cluster of nodes.(Lammel,2008)
2. Fault tolerance- Hadoop MapReduce gives fault tolerance by handling node failures
and again running tasks on other nodes.
3. Parallel processing – it has parallel processing of data which splits tasks into smaller
data that is processed on different nodes.
4. Batch processing – data is processed in discrete batches or chunks here.

Limitations of MapReduce:

1. Latency- low latency processing is not satisfactory as it is mainly designed for high
throughput batch processing. The overhead of job scheduling, data shuffling and task
execution brings latency but does not meet requirement of real time performance.
2. Batch processing model- processing is done in batches, therefore it collects data over a
period and processes it together. It is not suitable for scenarios which requires
immediate response like detecting discussions about financial products on social media
in real time.(Fadika,2010)
3. Complexity- complexity lies in developing and managing MapReduce jobs which
requires expertise in distributed systems and programming skills. This stops rapid
development and deployment of real time analytics.

17
Best approach:

1. Stream processing framework- frameworks such as Apache Kafka can be implemented

to ingest social media data in real time and perform analysis then and there
itself.(Narkhede,2017)
2. Event driven architecture- an event driven architecture is devised where social media
events are triggered in real time processing and analysis workflow.(Luckham,2001)
3. Microservice- processing tasks are broken into smaller, independent microservices that
scales independently and handles several different varieties of data
processing.(Wolff,2016)
4. Data pipeline- real- time data pipeline integrates social media data with other data and
analytics tools to regenerate time independent values.(Samimi,2021)
5. Continuous deployment- continuous practices helps in iterating on analytical models
and it deploys updates on production systems on real time.

c3:
Designing a hosting strategy requires consideration of scalability, availability and global
accessibility.

1. Cloud infrastructure- multiple cloud providers are considered like AWS, Azure and
GCP. Multiple instances are deployed over multiple global regions like US, Europe to
have low latency and high availability of users. VPCs are formed to isolate
infrastructure and network security.
2. Data storage- cloud based distributes storage solutions like amazon s3, azure bob or
google cloud storage for storing large volume of data. Data replication is used across
multiple regions for data redundancy and disaster recovery. Encryption is implemented
to secure data in cloud.
3. Resources- auto-scaling groups are used to scale and compute resources based on
workload demands. Technologies like Kubernetes and docker are used for efficient
resource utilization and deployment.
4. Data processing framework- cloud providers are used for running distributed data
processing frameworks like Apache spark.
5. Availability- global load balancers are implemented to distribute incoming traffic
among multiple regions. Failovers are configured between regions to seamlessly
redirect traffic in case of disruptions.

18
6. Networking – CDN services are utilised to cache and deliver static assets to users
globally with low latency. Private networks are established between regions to secure
data transfer.
7. Monitoring and management- centralized monitoring and logging solutions are
implemented to track system performance, resource utilization and security incidents.
Automated alerts and notifications are raised to respond to performance degradation or
security breaches.

References:
1. Lakshmanan, G. T., Malik, P., & Mordohai, P. (2021). Data lakes and data lakes. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery
https://www.wiley.com/enus/Wiley+Interdisciplinary+Reviews%3A+Data+Mining+a
nd+Knowledge+Discovery-p-9780JRNL72827
2. Inmon, W. H., & Krishnan, K. (2016). Building the unstructured data warehouse.
Technics Publications.
https://www.abebooks.co.uk/9781935504047/Building-Unstructured-Data-
Warehouse-Architecture-1935504045/plp
3. Narkhede, N., Shapira, G., & Palino, T. (2017). Kafka: The Definitive Guide: Real-Time
Data and Stream Processing at Scale. O'Reilly Media, Inc.
https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/

19
4. Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010, May). The hadoop distributed
file system. In 2010 IEEE 26th symposium on mass storage systems and technologies
(MSST) (pp. 1-10). IEEE. https://ieeexplore.ieee.org/document/5496972
5. Hai, R., Quix, C., & Geisler, S. (2018). Data lake architecture: A blueprint for data lake
management. Journal of Data and Information Quality (JDIQ), 10(3), 1-18.
https://www.researchgate.net/publication/350656318_The_Data_Lake_Architecture_Fra
mework_A_Foundation_for_Building_a_Comprehensive_Data_Lake_Architecture
6. Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning Spark: Lightning-
Fast Data Analysis. O'Reilly Media, Inc.
7. https://www.oreilly.com/library/view/learning-spark/9781449359034/
8. Lämmel, R. (2008). Google's MapReduce programming model—Revisited. Science of
computer programming, 70(1), 1-30.
https://www.sciencedirect.com/science/article/pii/S0167642307001281
9. Fadika, Z., & Govindaraju, M. (2010, April). LBVIZ: A batch apache hadoop job
analyzer. In 2010 IEEE International Symposium on Parallel & Distributed Processing,
Workshops and Phd Forum (IPDPSW) (pp. 1-8). IEEE.
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=63646dbd7dc4a323b
1186f2772497d393d1de042
10. Etzion, O., & Niblett, P. (2011). Event processing in action. Manning Publications Co.
https://www.manning.com/books/event-processing-in-action
11. Wolff, E. (2016). Microservices: Flexible software architecture. Addison-Wesley
Professional.
https://www.oreilly.com/library/view/microservices-flexible-software/9780134650449/

12. Chen, L. (2018). Continuous delivery: Huge benefits, but challenges too. IEEE Software,
32(2),50-54.
https://www.researchgate.net/publication/271635510_Continuous_Delivery_Huge_Ben
efits_but_Challenges_Too

20
21

Security Analytics With Apache Metron
0% (2)
Security Analytics With Apache Metron
3 pages
SQL Handbook - Cracknontech
No ratings yet
SQL Handbook - Cracknontech
24 pages
TLE - ICT 10 Summative Test
91% (11)
TLE - ICT 10 Summative Test
3 pages
Blinkit & Zepto Interview Questions
No ratings yet
Blinkit & Zepto Interview Questions
21 pages
Lec 7
No ratings yet
Lec 7
40 pages
Lecture 3: SQL and Project Issues: Friday, January 10th, 2003
No ratings yet
Lecture 3: SQL and Project Issues: Friday, January 10th, 2003
39 pages
SQL File
No ratings yet
SQL File
26 pages
Lecture 4: More SQL: Monday, January 13th, 2003
No ratings yet
Lecture 4: More SQL: Monday, January 13th, 2003
58 pages
Industrial Indicator: CI-1500A/1560A Service Manual
No ratings yet
Industrial Indicator: CI-1500A/1560A Service Manual
29 pages
L18 - Rockwell Software Studio 5000® and Logix PDF
No ratings yet
L18 - Rockwell Software Studio 5000® and Logix PDF
145 pages
SQL For Data Analysis Cheat Sheet Letter
No ratings yet
SQL For Data Analysis Cheat Sheet Letter
3 pages
Group 6 Report
No ratings yet
Group 6 Report
11 pages
SQL Cheat Sheet DS
No ratings yet
SQL Cheat Sheet DS
22 pages
DB 2
No ratings yet
DB 2
62 pages
Lab4 - DML3 - DML4
No ratings yet
Lab4 - DML3 - DML4
6 pages
Use Built
No ratings yet
Use Built
7 pages
SQL-Data Analytcs
No ratings yet
SQL-Data Analytcs
13 pages
Unit 3 SQL
No ratings yet
Unit 3 SQL
72 pages
Advanced Concepts in SQL
No ratings yet
Advanced Concepts in SQL
5 pages
SQL
No ratings yet
SQL
20 pages
Benja's Notes
No ratings yet
Benja's Notes
40 pages
IP XII Quick Notes - Querying in MYSQL
No ratings yet
IP XII Quick Notes - Querying in MYSQL
11 pages
QUERIES FOR PRACTICE (Simple)
No ratings yet
QUERIES FOR PRACTICE (Simple)
9 pages
Gopi SQL Ibm
No ratings yet
Gopi SQL Ibm
6 pages
SQL Essentials: Mark Mcilroy
No ratings yet
SQL Essentials: Mark Mcilroy
36 pages
Prashant Tripathi (Ajay Kumar Garg Engineering College) SQL Assignemnt
No ratings yet
Prashant Tripathi (Ajay Kumar Garg Engineering College) SQL Assignemnt
8 pages
Week 3 - SQL Continued
No ratings yet
Week 3 - SQL Continued
19 pages
Group and Aggregation Introduction
No ratings yet
Group and Aggregation Introduction
21 pages
Lab 7 - (Queries II)
No ratings yet
Lab 7 - (Queries II)
8 pages
Aaaaaa
No ratings yet
Aaaaaa
15 pages
Ex - No 6
No ratings yet
Ex - No 6
6 pages
UU-COM-4008 Reading Material Week 3
No ratings yet
UU-COM-4008 Reading Material Week 3
9 pages
SQL Project Coding
No ratings yet
SQL Project Coding
17 pages
CRUD
No ratings yet
CRUD
29 pages
3.note 3
No ratings yet
3.note 3
10 pages
SQL Interview Q&A
No ratings yet
SQL Interview Q&A
9 pages
Document 14
No ratings yet
Document 14
4 pages
Course 9 SQL
No ratings yet
Course 9 SQL
6 pages
Practical File of SQLLLL Bhavishya-1
No ratings yet
Practical File of SQLLLL Bhavishya-1
19 pages
SQL Case Study Report
No ratings yet
SQL Case Study Report
5 pages
LEC08 AdvancedSQL
No ratings yet
LEC08 AdvancedSQL
64 pages
SQL Cheat Sheet My Analytics School
No ratings yet
SQL Cheat Sheet My Analytics School
21 pages
Retail Sales Analysis Project SQL 1744909207
No ratings yet
Retail Sales Analysis Project SQL 1744909207
6 pages
Ade 1737191501
No ratings yet
Ade 1737191501
29 pages
Interview - 7 - IMP
No ratings yet
Interview - 7 - IMP
26 pages
IOT Model Paper
No ratings yet
IOT Model Paper
5 pages
Vaishali SQL Assessment
No ratings yet
Vaishali SQL Assessment
25 pages
CS 145 Midterm Review: The Best of Collection (Master Tracks), Vol. 1
No ratings yet
CS 145 Midterm Review: The Best of Collection (Master Tracks), Vol. 1
107 pages
DWM 301 - Midterm Case Study
No ratings yet
DWM 301 - Midterm Case Study
10 pages
SQL Lab5
No ratings yet
SQL Lab5
13 pages
Practice 250117 Chioma
No ratings yet
Practice 250117 Chioma
8 pages
Practice - 250117 - 092103
No ratings yet
Practice - 250117 - 092103
8 pages
SQL Project - Exploring Trends, Segmentation & KPIs
No ratings yet
SQL Project - Exploring Trends, Segmentation & KPIs
43 pages
Topics: Views Aggregate Functions Joins
No ratings yet
Topics: Views Aggregate Functions Joins
51 pages
Lecture Week 3-Databases
No ratings yet
Lecture Week 3-Databases
17 pages
Advanced Data Selection
No ratings yet
Advanced Data Selection
36 pages
40 SQL Interview Questions & Solution For DBA
No ratings yet
40 SQL Interview Questions & Solution For DBA
30 pages
SQL MATERIAL DECENT 2025 Intermediate
No ratings yet
SQL MATERIAL DECENT 2025 Intermediate
15 pages
03 SQL Cont
No ratings yet
03 SQL Cont
40 pages
SQL Practice
No ratings yet
SQL Practice
25 pages
SQL Practice Statements
No ratings yet
SQL Practice Statements
3 pages
PostgreSQL Mastery Cheat Sheet
No ratings yet
PostgreSQL Mastery Cheat Sheet
1 page
SQL Practice
No ratings yet
SQL Practice
5 pages
SQL Retail Sales Project
No ratings yet
SQL Retail Sales Project
5 pages
FTP
No ratings yet
FTP
31 pages
FMS 2023 Flexible Data Placement FDP Overview 1
No ratings yet
FMS 2023 Flexible Data Placement FDP Overview 1
25 pages
Compiler Construction Notes
No ratings yet
Compiler Construction Notes
21 pages
Rakesh M 10yrs Developer
No ratings yet
Rakesh M 10yrs Developer
3 pages
Organization Blocks For Time Delay Interrupt (S7-1200/1500)
No ratings yet
Organization Blocks For Time Delay Interrupt (S7-1200/1500)
2 pages
Electronic Jacquard Slave Board
No ratings yet
Electronic Jacquard Slave Board
2 pages
Understanding Linux Configuration Files PDF
No ratings yet
Understanding Linux Configuration Files PDF
7 pages
Aradial AAA Server Performance Settings
No ratings yet
Aradial AAA Server Performance Settings
4 pages
Bca 01 Dca 101
No ratings yet
Bca 01 Dca 101
3 pages
SAM4s SPS-2000 Touch Screen ECR For Bar & Grill Restaurants
No ratings yet
SAM4s SPS-2000 Touch Screen ECR For Bar & Grill Restaurants
4 pages
Android Sample Project Report Documentation
No ratings yet
Android Sample Project Report Documentation
27 pages
Architecture of Parallel Computing
No ratings yet
Architecture of Parallel Computing
6 pages
Interpreting Esxtop Statistics
No ratings yet
Interpreting Esxtop Statistics
34 pages
Netezza Fundamentals PDF
No ratings yet
Netezza Fundamentals PDF
60 pages
Module 2
No ratings yet
Module 2
44 pages
Templated Controls Databinding: Notes Burhan Saadi
No ratings yet
Templated Controls Databinding: Notes Burhan Saadi
81 pages
PLC Connection Manual PDF
No ratings yet
PLC Connection Manual PDF
6 pages
Practical No 29 and 30 pps-1
No ratings yet
Practical No 29 and 30 pps-1
5 pages
LatinVFR Manual MGGT
No ratings yet
LatinVFR Manual MGGT
8 pages
Ug100 Ezsp Reference Guide
No ratings yet
Ug100 Ezsp Reference Guide
143 pages
NVR Quick Start Guide BD
No ratings yet
NVR Quick Start Guide BD
2 pages
History of Dos
No ratings yet
History of Dos
6 pages
Introduction To Programming: Tutorial Task 2.2: Using Functions (Hello User With Functions)
No ratings yet
Introduction To Programming: Tutorial Task 2.2: Using Functions (Hello User With Functions)
4 pages
xl03D3653F 2
No ratings yet
xl03D3653F 2
1 page
Last Exception
No ratings yet
Last Exception
2 pages

Comp Vivek 1702

Uploaded by

Comp Vivek 1702

Uploaded by

COMP 1702 BIGDATA

Here “SELECT YEAR(sale_date) AS year, COUNT(sale_id) AS total_sale” selects two

The “SELECT p.product_name,s.revenue” of the query selects two columns product_name

Task B: MapReduce Programming

def map(self, line):

# Split the line by '|'

authors, title, conference, year = line.split('|')

# Count the number of authors

# Emit (year, (num_authors, 1))

emit (year, (num_authors, 1))

def reduce(self, key, values):

for authors, count in values:

# Calculate the average number of authors per paper

avg_authors = total_authors / total_papers

# Emit the result

emit (key, avg_authors)

The algorithm is efficient as-

def combine(self, key, values):

# Sum up the number of authors and papers

for authors, count in values:

# Emit the partially aggregated value

yield (key, (total_authors, total_papers))

This is how a combiner can be implemented.

Task c: Big Data Project Analysis

The reasons can be as follows:

Approach for implementing Data lakes:

1. Scalability- it is highly scalable, and it processes large quantity of data by spreading it

1. Stream processing framework- frameworks such as Apache Kafka can be implemented

You might also like