0% found this document useful (0 votes)
20 views21 pages

Comp Vivek 1702

The document outlines a comprehensive analysis of big data concepts, including data warehouse design, MapReduce programming, and the advantages of using data lakes over traditional data warehouses. It discusses SQL queries for data retrieval and analysis, as well as the MapReduce algorithm for calculating average authors per paper. Additionally, it evaluates the strengths and limitations of MapReduce, proposing best practices for real-time data processing and analytics.

Uploaded by

sinchana6470
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views21 pages

Comp Vivek 1702

The document outlines a comprehensive analysis of big data concepts, including data warehouse design, MapReduce programming, and the advantages of using data lakes over traditional data warehouses. It discusses SQL queries for data retrieval and analysis, as well as the MapReduce algorithm for calculating average authors per paper. Additionally, it evaluates the strengths and limitations of MapReduce, proposing best practices for real-time data processing and analytics.

Uploaded by

sinchana6470
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

COMP 1702 BIGDATA

Vivek Mahanthesh M
001367027
MSc In Data Science

1
Table of Contents
TASK A: DATA WAREHOUSE DESIGN: ............................................................................ 3
Task B: MapReduce Programming .................................................................................. 12
Task c: Big Data Project Analysis ............................................................................................. 15
c1: ......................................................................................................................................... 15
c2. ......................................................................................................................................... 17
c3: ......................................................................................................................................... 18
References: .......................................................................................................................... 19

2
TASK A: DATA WAREHOUSE DESIGN:

Let us consider a store which sells various products across the different categories. This
warehouse has three different tables i.e. customers, sale and product. The “customer” table has
the information about the customer and it has 4 columns which includes customer_id,
customer_name, city and state. The second table is “sale” which has all the information about

3
sales which has 6 columns i.e. sale_id, product_id, quantity, sale_date, store_id, revenue. The
next table is “product” which consists of 3 columns i.e. product_id, product_name, unit_price.

The “SELECT COUNT (*) AS total products FROM product” query retrives the data. The
COUNT (*) function is used to count the total number of rows in the product table. The “AS
total_products” is the alias. “FROM product” specifies the table in which the data is retrieved
from the product table.The output displays the total number of rows in the product table.

Here “SELECT YEAR(sale_date) AS year, COUNT(sale_id) AS total_sale” selects two


columns i.e. YEAR(sale_date) calculates the year from the ‘sale_date’ column and ‘AS year’
is the alias name. Then, “COUNT(sale_id) calculates the number of rows in each group, which

4
represents the total number of sales in each year. Further “FROM sale” specifies the data isi
retrieved from the ‘sale’ table. Next, GROUP BY YEAR(sale_date) clause gives the year from
‘sale_data’ column. It calculates the total number of sales each year.

This query “SELECT SUM(revenue) AS total_revenue” which retrieves data from the
database. It uses the SUM() function to calculate the total sum of the revenue in sale table. The
AS total_revenue is the alias total_revenue to the result of the SUM() function. This means that
the sum of revenues will be labeled as total_revenue in the output.Next “ FROM sale” tells that
the data is retrieved from the ‘sale’ table. This gives the output of total revenue from the ‘sale’
table.

5
The “SELECT p.product_name, s.revenue” selects two columns i.e. product_name from the
product table (p) and revenue from the sale table (s).It shows that the result of the query will
have the product name and revenue for each product sale.Next ”FROM product p” retrieves
the data from the ‘product’ table with alias name p. Further, “JOIN sale s ON p.product_id =
s.sale_id” joins the product table (p) with the sale table (s) on the given condition that
product_id in product table matches sale_id in the sale table. This gets to link each product
with its corresponding sale information. Next, “WHERE s.revenue>299.99” which displays
the rows where revenue in the ‘sale’ table is greater than 299.99. The output gives the product
names and revenue for sales where revenue exceeds 299.99.

The “SELECT p.product_name,s.revenue” of the query selects two columns product_name


from the product table (p) and revenue from the sale table (s).It shows the product name and

6
revenue for each product sale. Next, “FROM product p” specifies that the data will be retrieved
from the product table, with the alias p and “JOIN sale s ON p.product_id = s.sale_id” joins
the product table (p) with the sale table (s). This allows us to link each product with its
corresponding sale information. Next comes the condition “WHERE s.revenue BETWEEN
399.99 AND 599.99”clause gives the results to include only those things where the revenue in
the sale table (s) is within the range of 399.99 and 599.99. The output display the product name
and the revenue between 399.99 and 599.99.

The “SELECT customer_name” selects the customer_name column from the Customers table.
This means that the result of the query will include the names of customers.Next, “FROM
Customers” specifies that the data will be retrieved from the Customers table and “WHERE
state = ‘TN’ clause filter the result for only those where the ‘state’ in the ‘Customers’ is qual to
‘TN’. This displays the result where it includes the customers who are located in the state TN.

7
The given query “SELECT p.product_name, COUNT(s.sale_id) AS total_sold” selects two
columns i.e. product_name from the product table (p).The COUNT() function is applied to the
sale_id column from the sale table (s). This calculates the total number of sales for each
product. The “AS total_sold” is the alias and total_sold to the result of the COUNT() function,
representing the total number of sales for each product.”FROM product p” retrieves the data
from the ‘product’ table with alias ‘p’. Next “JOIN sale s ON p.product_id = s.sale_id” joins
the product table (p) with the sale table (s) on the given condition that the product_id in the
product table matches the sale_id in the sale table.Further, “GROUP BY p.product_name”
clause groups the results by the product_name column from the product table. (COUNT())
function will be used to count the number of rows which has the same product.Next, “ORDER
BY total_sold DESC” orders the result on the total_sold column in descending order and
“LIMIT 3” clause tells to display the result which include only top 3 rows.

8
9
“SELECT c.city, COUNT(s.sale_id) AS products_sold” selects two columns city from the
customers table (c) and the COUNT() function applied to the sale_id column from the sale
table (s). This gives the total number of sales for each city. The AS products_sold part of the
statement assigns the alias products_sold to the result of the COUNT() function,which shows
the total number of products sold for each city.” FROM customers c”specifies that the data will
be retrieved from the customers table, with the table alias as c.” JOIN sale s ON c.customer_id
= s.sale_id” joins the customers table (c) with the sale table (s) on the given condition that the
customer_id in the customers table matches the sale_id in the sale table and “ GROUP BY
c.city” clause groups the results by the city column from the customers table. The result we get
is the product sold for each city.

10
” SELECT c.city, COUNT(s.sale_id) AS products_sold” selects the city from the customers
table and counts the number of sales IDs (sale_id) from the sale table and "products_sold" is
the alias. “FROM customers c” specifies that the data will be selected from the "customers"
table and it's aliased as "c".” JOIN sale s ON c.customer_id = s.sale_id” joins the "customers"
table with the "sale" table on the given condition that the customer_id from the "customers"
table matches the sale_id from the "sale" table. “WHERE s.sale_date BETWEEN '2023-01-01'
AND '2023-02-01' filters the rows from the joined tables based on the condition that the
sale_date is within a specified date range . This condition restricts the sales data considered in
the query to only those that occurred within the specified date range.” GROUP BY c.city”
groups the result set by the city column from the "customers" table. The output will be count
of the product sold in each city within the date.

“SELECT c.city, COUNT(s.sale_id) AS products_sold” selects the city from the "customers"
table (c) and counts the number of sale IDs (sale_id) from the "sale" table (s). This count is
aliased as "products_sold". “FROM customers c” specifies that the data will be selected from
the "customers" table”. “JOIN sale s ON c.customer_id = s.sale_id” joins the "customers" table

11
with the "sale" table on the condition that the customer_id from the "customers" table matches
the sale_id from the "sale" table. “WHERE s.quantity > 5” filters the joined data, specifying
that only sales transactions where the quantity of products sold is greater than 5 . “GROUP BY
c.city” groups the filtered dataset by the city column from the "customers" table. The output
will be the products sold in each city where the quantity sold is greater than 5.

Task B: MapReduce Programming


class Mapper:

def map(self, line):

# Split the line by '|'

authors, title, conference, year = line.split('|')

# Count the number of authors

num_authors = len(authors.split(','))

# Emit (year, (num_authors, 1))

emit (year, (num_authors, 1))

class Reducer:

def reduce(self, key, values):

total_authors = 0

total_papers = 0

12
# Sum up the number of authors and papers

for authors, count in values:

total_authors += authors

total_papers += count

# Calculate the average number of authors per paper

avg_authors = total_authors / total_papers

# Emit the result

emit (key, avg_authors)

The above is the MapReduce algorithm to find average number of authors per paper for each
year.

Map Stage-

In this map stage, a key-value pair of each paper is emitted where the key is year of paper and
value is tuple containing number of authors for that paper and 1 to keep count of number of
papers for that year.

Each line is split by ‘|; character to get authors, title, conference and year fields then the number
of authors is counted by splitting author’s field by ‘,’ and to get the length of result.

Reduce Stage-

In the reduce stage, the values are aggregated for each year. The calculation of average number
of authors per paper for that year is done by summing up the number of authors and count of
papers.

13
The summing up process is done by number of authors and count of papers by iterating over
the values. Then average is calculated by dividing the total number of authors by total number
of papers and emit result as a key - value pair where key is the year and value is average number
of authors.

The algorithm is efficient as-

1. Input data is distributed across machines in Hadoop cluster and each mapper processes
a subset of data in parallel.
2. Shuffle and sort phase groups all values for specific key and sends them to same
reducer.
3. Reducers run in parallel each processing a subset of keys.
4. It performs a single pass over input data which makes it efficient.

A combiner can be used to perform partial aggregation on mapper outputs before sending then
to reducers. It reduces the amount of data transferred across network improving the
performance.

class Combiner:

def combine(self, key, values):

total_authors = 0

total_papers = 0

# Sum up the number of authors and papers

for authors, count in values:

total_authors += authors

14
total_papers += count

# Emit the partially aggregated value

yield (key, (total_authors, total_papers))

This is how a combiner can be implemented.

The combiner gets same input as reducer then it sums up num_authors values and count of
papers. Then it emits a new key-value pair with same year key and value that is tuple containing
summed total_authors and total_papers.

Combiners helps in increasing the performance especially for large datasets or clusters with
high network.

But sometimes it can introduce overhead and complexity so the benefits should be compared
against drawbacks for each use case.

Task c: Big Data Project Analysis


c1:
Building a Data Lake would be more suitable than traditional Data Warehouse.

The reasons can be as follows:

1. Flexibility and agility- Data lake allows storing raw, unstructured and semi-structured
data without schema definition or data transformation. It enables to store a variety of
data types including online news, social media and internal enterprise data etc without
any restrictions. Hence processing large amount of data efficiently.(Lakshmanan,2021)
2. Scalability- data lakes are built to scale horizontally to handle large volume of data. It
can seamlessly scale to store evolving data without disturbance.

15
3. Cost-effectiveness- in case of traditional data warehouses it requires structured data
modelling and storage whereas data lakes have cost effective storage solution such as
cloud object storage like (Amazon S3, Azure) or distributed file system (Hadoop). This
reduces storage costs for large scale data.
4. Analytics flexibility- data lakes supports a variety of analytics and data processing tools
to help analysts and data scientists explore and work on new or raw data
directly.(Inmon,2016)
5. Future-proofing- by storing data in raw format, new data sources merge or when
analytical techniques gets evolved the bank can change its data lake architecture to
bring changes without the requirement of data migration.

Approach for implementing Data lakes:

1. Data ingestion- data ingestion pipelines are developed to ingest data from APIs such
as social media, online news, and internal enterprise systems. Stream and batch
processing is used to ingest data efficiently.(Narkhede,2017)
2. Data storage- a suitable cloud storage service is chosen such as amazon, hdfs or
azure. Data lake architecture is done to store expected volume of data and to also
check scalability, fault tolerance and data durability.(Shvachko,2010)
3. Data governance and metadata management- data governance policies and
metadata management practices are implemented to guarantee data quality, security,
and compliance. Tagging and annotation is done to help data in data discovery,
lineage tracking and accessing the control.(Hai,2018)
4. Data processing and analytics- to perform batch and stream processing several data
processing tools like Apache spark, and Hadoop MapReduce is used with data lake
infrastructure. This enables data scientists to run complex queries directly on data
lake.(Karau,2015)
5. Security and access control- by implementing specific security measures such as
encryption and to access the control policies to save from data trading and to ensure
regulations and prevent unauthorized use.(Hai,2018)
6. Monitoring and performance optimization – monitoring and logging solutions are
deployed to track data lake performance, data ingestion rates and resource

16
utilization. Data lake is simultaneously optimized for performance, reliability and
cost efficiency based on patterns and metrics.(Shvachko,2010)

c2.
Strengths of MapReduce:

1. Scalability- it is highly scalable, and it processes large quantity of data by spreading it


across a cluster of nodes.(Lammel,2008)
2. Fault tolerance- Hadoop MapReduce gives fault tolerance by handling node failures
and again running tasks on other nodes.
3. Parallel processing – it has parallel processing of data which splits tasks into smaller
data that is processed on different nodes.
4. Batch processing – data is processed in discrete batches or chunks here.

Limitations of MapReduce:

1. Latency- low latency processing is not satisfactory as it is mainly designed for high
throughput batch processing. The overhead of job scheduling, data shuffling and task
execution brings latency but does not meet requirement of real time performance.
2. Batch processing model- processing is done in batches, therefore it collects data over a
period and processes it together. It is not suitable for scenarios which requires
immediate response like detecting discussions about financial products on social media
in real time.(Fadika,2010)
3. Complexity- complexity lies in developing and managing MapReduce jobs which
requires expertise in distributed systems and programming skills. This stops rapid
development and deployment of real time analytics.

17
Best approach:

1. Stream processing framework- frameworks such as Apache Kafka can be implemented


to ingest social media data in real time and perform analysis then and there
itself.(Narkhede,2017)
2. Event driven architecture- an event driven architecture is devised where social media
events are triggered in real time processing and analysis workflow.(Luckham,2001)
3. Microservice- processing tasks are broken into smaller, independent microservices that
scales independently and handles several different varieties of data
processing.(Wolff,2016)
4. Data pipeline- real- time data pipeline integrates social media data with other data and
analytics tools to regenerate time independent values.(Samimi,2021)
5. Continuous deployment- continuous practices helps in iterating on analytical models
and it deploys updates on production systems on real time.

c3:
Designing a hosting strategy requires consideration of scalability, availability and global
accessibility.

1. Cloud infrastructure- multiple cloud providers are considered like AWS, Azure and
GCP. Multiple instances are deployed over multiple global regions like US, Europe to
have low latency and high availability of users. VPCs are formed to isolate
infrastructure and network security.
2. Data storage- cloud based distributes storage solutions like amazon s3, azure bob or
google cloud storage for storing large volume of data. Data replication is used across
multiple regions for data redundancy and disaster recovery. Encryption is implemented
to secure data in cloud.
3. Resources- auto-scaling groups are used to scale and compute resources based on
workload demands. Technologies like Kubernetes and docker are used for efficient
resource utilization and deployment.
4. Data processing framework- cloud providers are used for running distributed data
processing frameworks like Apache spark.
5. Availability- global load balancers are implemented to distribute incoming traffic
among multiple regions. Failovers are configured between regions to seamlessly
redirect traffic in case of disruptions.

18
6. Networking – CDN services are utilised to cache and deliver static assets to users
globally with low latency. Private networks are established between regions to secure
data transfer.
7. Monitoring and management- centralized monitoring and logging solutions are
implemented to track system performance, resource utilization and security incidents.
Automated alerts and notifications are raised to respond to performance degradation or
security breaches.

References:
1. Lakshmanan, G. T., Malik, P., & Mordohai, P. (2021). Data lakes and data lakes. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery
https://www.wiley.com/enus/Wiley+Interdisciplinary+Reviews%3A+Data+Mining+a
nd+Knowledge+Discovery-p-9780JRNL72827
2. Inmon, W. H., & Krishnan, K. (2016). Building the unstructured data warehouse.
Technics Publications.
https://www.abebooks.co.uk/9781935504047/Building-Unstructured-Data-
Warehouse-Architecture-1935504045/plp
3. Narkhede, N., Shapira, G., & Palino, T. (2017). Kafka: The Definitive Guide: Real-Time
Data and Stream Processing at Scale. O'Reilly Media, Inc.
https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/

19
4. Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010, May). The hadoop distributed
file system. In 2010 IEEE 26th symposium on mass storage systems and technologies
(MSST) (pp. 1-10). IEEE. https://ieeexplore.ieee.org/document/5496972
5. Hai, R., Quix, C., & Geisler, S. (2018). Data lake architecture: A blueprint for data lake
management. Journal of Data and Information Quality (JDIQ), 10(3), 1-18.
https://www.researchgate.net/publication/350656318_The_Data_Lake_Architecture_Fra
mework_A_Foundation_for_Building_a_Comprehensive_Data_Lake_Architecture
6. Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning Spark: Lightning-
Fast Data Analysis. O'Reilly Media, Inc.
7. https://www.oreilly.com/library/view/learning-spark/9781449359034/
8. Lämmel, R. (2008). Google's MapReduce programming model—Revisited. Science of
computer programming, 70(1), 1-30.
https://www.sciencedirect.com/science/article/pii/S0167642307001281
9. Fadika, Z., & Govindaraju, M. (2010, April). LBVIZ: A batch apache hadoop job
analyzer. In 2010 IEEE International Symposium on Parallel & Distributed Processing,
Workshops and Phd Forum (IPDPSW) (pp. 1-8). IEEE.
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=63646dbd7dc4a323b
1186f2772497d393d1de042
10. Etzion, O., & Niblett, P. (2011). Event processing in action. Manning Publications Co.
https://www.manning.com/books/event-processing-in-action
11. Wolff, E. (2016). Microservices: Flexible software architecture. Addison-Wesley
Professional.
https://www.oreilly.com/library/view/microservices-flexible-software/9780134650449/

12. Chen, L. (2018). Continuous delivery: Huge benefits, but challenges too. IEEE Software,
32(2),50-54.
https://www.researchgate.net/publication/271635510_Continuous_Delivery_Huge_Ben
efits_but_Challenges_Too

20
21

You might also like