0% found this document useful (0 votes)

20 views16 pages

2.2. Components of Hadoop - Analysing

Components of Hadoop

Uploaded by

Sundar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views16 pages

2.2. Components of Hadoop - Analysing

Components of Hadoop

Uploaded by

Sundar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

What is the Hadoop Ecosystem?

● Hadoop ecosystem is a framework which helps in solving the Big Data problems.
● It comprises different components and services (ingesting, storing, analyzing, and
maintaining) inside of it.
● Most of the services available in the Hadoop ecosystem are to supplement the main four
core components of Hadoop which include HDFS, YARN, MapReduce and Common.

Components of Hadoop Ecosystem

The different components of the Hadoop Ecosystem that make Hadoop so powerful. Some of the
Hadoop ecosystem components are
1. HDFS (Hadoop Distributed File System)
2. MapReduce
3. YARN
4. Hive
5. Pig
6. HBase
7. HCatalog
8. Avro
9. Thrift
10. Apache Drill
11. Apache Mahout
12. Apache Sqoop
13. Apache Flume
14. Ambari
15. Zookeeper
16. Oozie
17. Spark

1. components
● Hadoop Distributed File System runs on top of the existing file systems on
each node in a Hadoop cluster.
● Hadoop Distributed File System is a block-structured file system where each
file is divided into blocks of a predetermined size.
● Data in a Hadoop cluster is broken down into smaller units (called blocks)
and distributed throughout the cluster. Each block is duplicated twice (for a
total of three copies), with the two replicas stored on two nodes in a rack
somewhere else in the cluster.
● Since the data has a default replication factor of three, it is highly available
and fault-tolerant.
● If a copy is lost, HDFS will automatically re-replicate it elsewhere in the
cluster, ensuring that the threefold replication factor is maintained.

2. MapReduce - Data processing using programming languages

MapReduce assigns fragments of data across the nodes in a Hadoop cluster. The goal
is to split a dataset into chunks and use an algorithm to process those chunks at the same
time. The parallel processing on multiple machines greatly increases the speed of
handling even petabytes of data.
➢ Hadoop MapReduce: This is for parallel processing of large data sets.
● The MapReduce framework consists of a single master node (JobTracker) and n
numbers of slave nodes (Task Tracker) where n can be 1000s. Master manages,
maintains and monitors the slaves while slaves are the actual worker nodes.
● Client submit a job to Hadoop. The job can be a mapper, a reducer or a list of
input. The job is sent to job tracker process on master node. Each slave node runs
a process through task tracker.
● The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves,
monitoring them and re-executing the failed tasks.
● The slaves TaskTracker execute the tasks as directed by the master and provide
task-status information to the master periodically.
● Master stores the metadata (data about data) while slaves are the nodes which store
the data. The client connects with master node to perform any task.
➢ Hadoop YARN:
● YARN (Yet Another Resource Negotiator) is the framework responsible for
assigning computational resources for application execution and cluster
management. YARN consists of three core components:
○ ResourceManager (one per cluster)
○ ApplicationMaster (one per application)
○ NodeManagers (one per node)

3. SPARK - Retain data processing in Memory- Faster

Spark is a Hadoop enhancement to MapReduce. The primary difference between Spark
and MapReduce is that Spark processes and retains data in memory for subsequent steps,
whereas MapReduce processes data on disk. As a result, for smaller workloads, Spark's
data processing speeds are up to 100x faster than MapReduce.
4. YARN - Data processing in HDFS
YARN allows the data stored in HDFS (Hadoop Distributed File System) to be processed
and run by various data processing engines such as batch processing, stream processing,
interactive processing, graph processing and many more. Thus the efficiency of the
system is increased with the use of YARN.
5. Hive - Work on petabytes of data
Hive allows users to read, write, and manage petabytes of data using SQL. Hive is
built on top of Apache Hadoop, which is an open-source framework used to efficiently
store and process large datasets. Hive is designed to work quickly on petabytes of data.
6. Apache Pig - Analyze with Parallel data
Apache Pig is a platform that is used to analyze large data sets. It consists of a
high-level language to express data analysis programs, along with the infrastructure to
evaluate these programs. One of the most significant features of Pig is that its structure is
responsive to significant parallelization.
7. Apache HBase - Random Access to Big Data
Apache HBase is used to have random, real-time read/write access to Big Data. It hosts
very large tables on top of clusters of commodity hardware. Apache HBase is a
non-relational database model. It works on top of Hadoop and HDFS.
8. HCatalog - Random Access on any format using SerDe
HCatalog supports reading and writing files in any format for which a SerDe
(serializer-deserializer) can be written.
9. Avro - Remote procedure call framework using Serialization
Apache Avro is a data serialization and remote procedure call framework which is
developed within the Apache Hadoop project where it provides both a serialization
format to get persistent data and a wire format for providing communication between
Hadoop nodes.
10. Thrift - Interface over several Programming languages
Thrift uses a special Interface Description Language (IDL) to define data types and
service interfaces which are stored as “.thrift” files and used later as input by the compiler
for generating the source code of client and server software that communicate over
different programming languages.
11. Drill - Act on HDFS without Meta Data using
● Low-latency SQL queries.
● Dynamic queries on self-describing data in files (such as JSON, Parquet, text) and
HBase tables, without requiring metadata definitions in the Hive metastore.
● ANSI SQL.
● Nested data suppo.
12. Apache mahout - Apply Machine Learning Algorithms on HDFS
Apache Mahout is an open source project to create scalable, machine learning
algorithms. Mahout operates in addition to Hadoop, which allows you to apply the
concept of machine learning via a selection of Mahout algorithms to distributed
computing via Hadoop.
13. Sqoop - Transfer data from RDBMS to HDFS
Sqoop is used to transfer data from RDBMS (relational database management system)
like MySQL and Oracle to HDFS. Big Data Sqoop can also be used to transform data in
Hadoop MapReduce and then export it into RDBMS.
14. Apache Flume - Transfer Data from web services to HDFS
Apache Flume is a tool/service/data ingestion mechanism for collecting, aggregating and
transporting large amounts of streaming data such as log data, events (etc...) from various
web services to a centralized data store.
15. Ambari - Monitor and Tracking Clusters
Ambari provides a dashboard for monitoring health and status of the Hadoop
cluster. Ambari leverages Ambari Metrics System for metrics collection. Ambari
leverages Ambari Alert Framework for system alerting and will notify you when your
attention is needed (e.g., a node goes down, remaining disk space is low, etc). It is
responsible for keeping track of the running applications.
16. Zookeeper - Configuration services for Cluster Organizations
ZooKeeper is an open source Apache project that provides a centralized service for
providing configuration information, naming, synchronization and group services over
large clusters in distributed systems. The goal is to make these systems easier to manage
with improved, more reliable propagation of changes.
17. Apache OOzie - Combine jobs as sequential logical work
Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated
with the Hadoop stack, with YARN as its architectural center, and supports Hadoop jobs
for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop.
Fig:Hadoop Ecosystem and Their Components

Analyzing the Data with Hadoop

What is Big Data Analysis?

Big data is mostly generated from social media websites, sensors, devices, video/audio,
networks, log files and web, and much of it is generated in real time and on a very large scale.
Big data analytics is the process of examining this large amount of different data types, or big
data, in an effort to uncover hidden patterns, unknown correlations and other useful information.
Advantages of Big Data Analysis

Big data analysis allows market analysts, researchers and business users to develop deep insights
from the available data, resulting in numerous business advantages. Business users are able to
make a precise analysis of the data and the key early indicators from this analysis can mean
fortunes for the business. Some of the exemplary use cases are as follows:

● Whenever users browse travel portals, shopping sites, search flights, hotels or add a
particular item into their cart, then Ad Targeting companies can analyze this wide
variety of data and activities and can provide better recommendations to the user
regarding offers, discounts and deals based on the user browsing history and product
history.
● In the telecommunications space, if customers are moving from one service provider
to another service provider, then by analyzing huge call data records of the various
issues faced by the customers can be unearthed. Issues could be as wide-ranging as a
significant increase in the call drops or some network congestion problems. Based on
analyzing these issues, it can be identified if a telecom company needs to place a new
tower in a particular urban area or if they need to revive the marketing strategy for a
particular region as a new player has come up there. That way customer churn can be
proactively minimized.

Case Study – Stock market data

Now let’s look at a case study for analyzing stock market data. We will evaluate various big data
technologies to analyze this stock market data from a sample ‘New York Stock Exchange’
dataset and calculate the covariance for this stock data and aim to solve both storage and
processing problems related to a huge volume of data.
Covariance is a financial term that represents the degree or amount that two stocks or financial
instruments move together or apart from each other. With covariance, investors have the
opportunity to seek out different investment options based upon their respective risk profile. It is
a statistical measure of how one investment moves in relation to the other.

A positive covariance means that asset returns moved together. If investment instruments or
stocks tend to be up or down during the same time periods, they have positive covariance.

A negative covariance means returns move inversely. If one investment instrument tends to be up
while the other is down, they have negative covariance.

This will help a stock broker in recommending the stocks to his customers.

Dataset: The sample dataset provided is a comma separated file (CSV) named
‘NYSE_daily_prices_Q.csv’ that contains the stock information such as daily quotes, Stock
opening price, Stock highest price, etc. on the New York Stock Exchange.

The dataset provided is just a small sample dataset having around 3500 records, but in the real
production environment there could be huge stock data running into GBs or TBs. So our solution
must be supported in a real production environment.

Hadoop Data Analysis Technologies

Let’s have a look at the existing open source Hadoop data analysis technologies to analyze the
huge stock data being generated very frequently.
Which Data Analysis Technologies should be used?

Based on the available sample dataset, it is having following properties:

● Data is having structured format

● It would require joins to calculate Stock Covariance
● It could be organized into schema
● In real environment, data size would be too much

Based on these criteria and comparing with the above analysis of features of these technologies,
we can conclude:

● If we use MapReduce, then complex business logic needs to be written to handle the
joins. We would have to think from map and reduce perspective and which particular
code snippet will go into map and which one will go into reduce side. A lot of
development effort needs to go into deciding how map and reduce joins will take
place. We would not be able to map the data into schema format and all efforts need
to be handled programmatically.
● If we are going to use Pig, then we would not be able to partition the data, which can
be used for sample processing from a subset of data by a particular stock symbol or
particular date or month. In addition to that Pig is more like a scripting language
which is more suitable for prototyping and rapidly developing MapReduce based
jobs. It also doesn’t provide the facility to map our data into an explicit schema
format that seems more suitable for this case study.
● Hive not only provides a familiar programming model for people who know SQL, it
also eliminates lots of boilerplate and sometimes tricky coding that we would have to
do in MapReduce programming. If we apply Hive to analyze the stock data, then we
would be able to leverage the SQL capabilities of Hive-QL as well as data can be
managed in a particular schema. It will also reduce the development time as well and
can manage joins between stock data also using Hive-QL which is of course pretty
difficult in MapReduce. Hive also has its thrift servers, by which we can submit our
Hive queries from anywhere to the Hive server, which in turn executes them. Hive
SQL queries are being converted into map reduce jobs by Hive compiler, leaving
programmers to think beyond complex programming and provides opportunity to
focus on business problem.

So based on the above discussion, Hive seems the perfect choice for the aforementioned case
study.
Problem Solution with Hive

Apache Hive is a data warehousing package built on top of Hadoop for providing data
summarization, query and analysis. The query language being used by Hive is called Hive-QL
and is very similar to SQL.

Since we are now done zeroing in on the data analysis technology part, now it’s time to get your
feet wet with deriving solutions for the mentioned case study.

● Hive Configuration on Cloudera

Follow the steps mentioned in my previous blog How to Configure Hive On Cloudera:

● Create Hive Table

Use ‘create table’ Hive command to create the Hive table for our provided csv dataset:

hive> create table NYSE (exchange String,stock_symbol String,stock_date

String,stock_price_open double, stock_price_high double, stock_price_low double,
stock_price_close double, stock_volume double, stock_price_adj_close double) row format
delimited fields terminated by ‘,’;

This will create a Hive table named ‘NYSE’ in which rows would be delimited and row fields
will be terminated by commas. This schema will be created into the embedded derby database as
configured into the Hive setup. By default, Hive stores metadata in an embedded Apache Derby
database, but can be configured for other databases like MySQL, SQL server, Oracle, etc.
● Load CSV Data into Hive Table

Use the following Hive command to load the CSV data file into Hive table:

hive> load data local inpath ‘/home/cloudera/NYSE_daily_prices_Q.csv’ into table NYSE;

This will load the dataset from the mentioned location to the Hive table ‘NYSE’ as created above
but all this dataset will be stored into the Hive-controlled file system namespace on HDFS, so
that it could be batch processed further by MapReduce jobs or Hive queries.

● Calculate the Covariance

We can calculate the Covariance for the provided stock dataset for the inputted year as below
using the Hive select query:

select a.STOCK_SYMBOL, b.STOCK_SYMBOL, month(a.STOCK_DATE),

(AVG(a.STOCK_PRICE_HIGH*b.STOCK_PRICE_HIGH) –
(AVG(a.STOCK_PRICE_HIGH)*AVG(b.STOCK_PRICE_HIGH)))

from NYSE a join NYSE b on

a.STOCK_DATE=b.STOCK_DATE where a.STOCK_SYMBOL<b.STOCK_SYMBOL and

year(a.STOCK_DATE)=2008

Group by a.STOCK_SYMBOL, b. STOCK_SYMBOL, month(a.STOCK_DATE);

This Hive select query will trigger the MapReduce job as below:

The covariance results after the above stock data analysis, are as follows:
The covariance has been calculated between two different stocks for each month on a particular
date for the available year.

From the covariance results, stock brokers or fund managers can provide below
recommendations:

● For Stocks QRR and QTM, these are having more positive covariance than negative
covariance, so having high probability that stocks will move together in same
direction.
● For Stocks QRR and QXM, these are mostly having negative covariance. So there
exists a greater probability of stock prices moving in an inverse direction.
● For Stocks QTM and QXM, these are mostly having positive covariance for most of
all months, so these tend to move in the same direction most of the times.
So similarly we can analyze more use cases of big data and can explore all possible solutions to
solve that use case and then by the comparison chart, the final best solution can be narrowed
down.

Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Asit Kumar Das - M5 SPARK
No ratings yet
Asit Kumar Das - M5 SPARK
24 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
Big Data
No ratings yet
Big Data
63 pages
MCQs Nptel
100% (11)
MCQs Nptel
4 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Hadoop
No ratings yet
Hadoop
61 pages
Multi Threading in Java by Durga Sir
No ratings yet
Multi Threading in Java by Durga Sir
68 pages
Hadoop
No ratings yet
Hadoop
21 pages
Unit 4 Endsem PYQs
No ratings yet
Unit 4 Endsem PYQs
24 pages
JAVA - Interview Questions
No ratings yet
JAVA - Interview Questions
47 pages
BDA Experiment1
No ratings yet
BDA Experiment1
8 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
dSbDa MiniProject Case Study
No ratings yet
dSbDa MiniProject Case Study
10 pages
Unit 2
No ratings yet
Unit 2
9 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
BDA Unit 3
No ratings yet
BDA Unit 3
7 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Hadoop Components
No ratings yet
Hadoop Components
5 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Hadoop Kufkaf Apeche
No ratings yet
Hadoop Kufkaf Apeche
14 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Yasir f29 Ass1 Bigdata
No ratings yet
Yasir f29 Ass1 Bigdata
7 pages
Apache Hadoop Ecosystem
No ratings yet
Apache Hadoop Ecosystem
13 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
8 pages
M5
No ratings yet
M5
18 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
BDA - Unit 4
No ratings yet
BDA - Unit 4
18 pages
Unit Iii
No ratings yet
Unit Iii
9 pages
Attachment
No ratings yet
Attachment
11 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Unit 5
No ratings yet
Unit 5
4 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
13 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
6 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Module 2
No ratings yet
Module 2
20 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Data Intensive Computing
No ratings yet
Data Intensive Computing
33 pages
Unit 2
No ratings yet
Unit 2
23 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
DSA Topics and Patterns
No ratings yet
DSA Topics and Patterns
3 pages
1 Cor3.1
No ratings yet
1 Cor3.1
220 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Programming 8051 Microcontroller
No ratings yet
Programming 8051 Microcontroller
121 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Ipl 60
No ratings yet
Ipl 60
71 pages
Hadoop
No ratings yet
Hadoop
3 pages
Zencontrol KNX Vs DALI
No ratings yet
Zencontrol KNX Vs DALI
24 pages
Question Bank Oose
No ratings yet
Question Bank Oose
4 pages
Sub Unit 3
No ratings yet
Sub Unit 3
9 pages
INTRO Hadoop-Ecosystem
No ratings yet
INTRO Hadoop-Ecosystem
6 pages
HUAWEI BTS3012AE Hardware Structure and Operating Principle-20061011-B-1.0
No ratings yet
HUAWEI BTS3012AE Hardware Structure and Operating Principle-20061011-B-1.0
91 pages
Big Data Links
No ratings yet
Big Data Links
7 pages
Toc Unit 4, 5
No ratings yet
Toc Unit 4, 5
44 pages
Oj1436 Manual 1.3 en
No ratings yet
Oj1436 Manual 1.3 en
92 pages
Sensopart Manual v10 v20 Def
No ratings yet
Sensopart Manual v10 v20 Def
314 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
Fundamentals of Electric Circuits Sadiku 4th Ed-100-105
No ratings yet
Fundamentals of Electric Circuits Sadiku 4th Ed-100-105
6 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
5 pages
4 TH
No ratings yet
4 TH
13 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Sharp BP-30C25 - BRO-PDF - 0
No ratings yet
Sharp BP-30C25 - BRO-PDF - 0
12 pages
Bece Mock 4 Computing
No ratings yet
Bece Mock 4 Computing
12 pages
2023 2024 1st Final
No ratings yet
2023 2024 1st Final
8 pages
IOT Broucher
No ratings yet
IOT Broucher
10 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Using The SPI Module On 8-Bit PIC® Microcontrollers
No ratings yet
Using The SPI Module On 8-Bit PIC® Microcontrollers
23 pages
Note
No ratings yet
Note
3 pages
Communication Labb
No ratings yet
Communication Labb
3 pages
409r Iwall 4k Uhd 4x9 Video Wall Controller Manual
No ratings yet
409r Iwall 4k Uhd 4x9 Video Wall Controller Manual
21 pages
P-N Junction Diode Baising and Its VI Characteristics
No ratings yet
P-N Junction Diode Baising and Its VI Characteristics
12 pages
Status: Network Application Administration Help
No ratings yet
Status: Network Application Administration Help
2 pages
A Salon Coupon Discount System in C++
No ratings yet
A Salon Coupon Discount System in C++
9 pages
CA Question Bank-1
No ratings yet
CA Question Bank-1
2 pages
IT FACULTY - SLD521 - FORMATIVE 3 - (YEAR 1 - Certificate) Paper (V1.0) - 20230506 - 1221
No ratings yet
IT FACULTY - SLD521 - FORMATIVE 3 - (YEAR 1 - Certificate) Paper (V1.0) - 20230506 - 1221
10 pages
IBM Rational Software Architect - Presentation04
No ratings yet
IBM Rational Software Architect - Presentation04
51 pages
ALD Alarm Solutions
No ratings yet
ALD Alarm Solutions
7 pages
Battery Charger Project
No ratings yet
Battery Charger Project
5 pages
6.8-Inch Center Control Panel English Product Manual
No ratings yet
6.8-Inch Center Control Panel English Product Manual
2 pages
Flavor: A Formal Language For Audio-Visual Object Representation
No ratings yet
Flavor: A Formal Language For Audio-Visual Object Representation
9 pages
Test Case Example
No ratings yet
Test Case Example
3 pages
DCO Introduction - Pega PRPC v5.5
No ratings yet
DCO Introduction - Pega PRPC v5.5
10 pages
Sales Order
No ratings yet
Sales Order
2 pages

2.2. Components of Hadoop - Analysing

Uploaded by

2.2. Components of Hadoop - Analysing

Uploaded by

What is the Hadoop Ecosystem?

Components of Hadoop Ecosystem

2. MapReduce - Data processing using programming languages

3. SPARK - Retain data processing in Memory- Faster

Analyzing the Data with Hadoop

What is Big Data Analysis?

Case Study – Stock market data

Hadoop Data Analysis Technologies

Based on the available sample dataset, it is having following properties:

● Data is having structured format

● Hive Configuration on Cloudera

● Create Hive Table

hive> create table NYSE (exchange String,stock_symbol String,stock_date

hive> load data local inpath ‘/home/cloudera/NYSE_daily_prices_Q.csv’ into table NYSE;

● Calculate the Covariance

select a.STOCK_SYMBOL, b.STOCK_SYMBOL, month(a.STOCK_DATE),

from NYSE a join NYSE b on

a.STOCK_DATE=b.STOCK_DATE where a.STOCK_SYMBOL<b.STOCK_SYMBOL and

Group by a.STOCK_SYMBOL, b. STOCK_SYMBOL, month(a.STOCK_DATE);

You might also like