0% found this document useful (0 votes)
151 views5 pages

Aggregated Function in HIVE

The document describes various aggregate functions in Hive such as SUM, COUNT, AVG, MIN, MAX, VARIANCE, STDDEV, COVARIANCE, CORRELATION, PERCENTILE, HISTOGRAM, COLLECTIONS, and NTILE. Examples are provided showing how to use each function to aggregate and analyze data in the Tri100 table.

Uploaded by

siv535
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views5 pages

Aggregated Function in HIVE

The document describes various aggregate functions in Hive such as SUM, COUNT, AVG, MIN, MAX, VARIANCE, STDDEV, COVARIANCE, CORRELATION, PERCENTILE, HISTOGRAM, COLLECTIONS, and NTILE. Examples are provided showing how to use each function to aggregate and analyze data in the Tri100 table.

Uploaded by

siv535
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Creating Table in HIVE :

hive> create external table Tri100(id int,name string,location varchar(30),sal int,Hike int)
>
>
>
>

row format delimited


fields terminated by ','
lines terminated by '\n'
stored as textfile location '/Jayanth/Jay/';

OK
Time taken: 0.688 seconds

Aggregated Functions and Normal Queries:

hive> select * from Tri100;


OK
1
rahul Hyderabad
30000
40000
2
Mohit Banglore
22000
25000
3
Rohan Banglore
33000
40000
4
Ajay Bangladesh
40000
45000
5
srujay Srilanka
25000
30000
Time taken: 0.184 seconds, Fetched: 5 row(s)

SUM

Returns the sum of the elements in the group or sum of the distinct values of the column in the group.

hive> select sum(sal) from Tri100;


OK
150000
Time taken: 17.909 seconds, Fetched: 1 row(s)

hive> select Sum(sal) from Tri100 where loccation='Banglore';


OK
55000
Time taken: 18.324 seconds, Fetched: 1 row(s)

Count

count(*) - Returns the total number of retrieved rows, including rows containing NULL values;
count(expr) Returns the number of rows for which the supplied expression is non-NULL;
count(DISTINCT expr[, expr]) Returns the number of rows for which the supplied expression(s) are
unique and non- NULL;

hive> select count(*) from Tri100;


OK
5
Time taken: 16.307 seconds, Fetched: 1 row(s)

hive> select count(distinct location) from Tri100;

OK
4
Time taken: 17.37 seconds, Fetched: 1 row(s)

hive> select count(*) from Tri100 where sal>30000;


OK
2
Time taken: 18.36 seconds, Fetched: 1 row(s)

hive> select count(location) from Tri100;


OK
5
Time taken: 17.338 seconds, Fetched: 1 row(s)

Average

Returns the average of the elements in the group or the average of the distinct values of the column in
the group.

hive> select avg(sal) from Tri100 where location='Banglore';


OK
27500.0
Time taken: 17.276 seconds, Fetched: 1 row(s)

hive> select avg(distinct sal) from Tri100;


OK
30000.0
Time taken: 17.276 seconds, Fetched: 1 row(s)

Minimum

Returns the minimum of the column in the group.

hive> select min(sal) from Tri100;


OK
22000
Time taken: 17.368 seconds, Fetched: 1 row(s)

Maximum

Returns the maximum of the column in the group.

hive> select max(sal) from Tri100;


OK
40000
Time taken: 17.267 seconds, Fetched: 1 row(s)

Variance

Returns the variance of a numeric column in the group.

hive> select variance(sal) from Tri100;


OK
3.96E7
Time taken: 17.223 seconds, Fetched: 1 row(s)

hive> select var_pop(sal) from Tri100;


OK
3.96E7
Time taken: 17.195 seconds, Fetched: 1 row(s)
Returns the unbiased sample variance of a numeric column in the group.

hive> select var_samp(sal) from Tri100;


OK
4.95E7
Time taken: 17.245 seconds, Fetched: 1 row(s)

Standard Deviation

Returns the Standard Deviation of a numeric column in the group.

hive> select stddev_pop(sal) from Tri100;


OK
6292.8530890209095
Time taken: 18.63 seconds, Fetched: 1 row(s)
Returns the unbiased sample Standard Deviation of a numeric column in the group.

hive> select stddev_samp(sal) from Tri100;


OK
7035.623639735144
Time taken: 17.299 seconds, Fetched: 1 row(s)

Covariance

Returns the population covariance of a pair of numeric columns in the group.

hive> select covar_pop(sal,Hike) from Tri100;


OK
4.4E7
Time taken: 18.888 seconds, Fetched: 1 row(s)
Returns the sample covariance of a pair of numeric columns in the group.

hive> select covar_samp(sal,Hike) from Tri100;


OK
5.5E7
Time taken: 18.302 seconds, Fetched: 1 row(s)

Correlation

Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.

hive> select corr(sal,Hike) from Tri100;


OK
0.9514987095307504
Time taken: 17.514 seconds, Fetched: 1 row(s)

Percentile

Returns the exact pth percentile of a column in the group(does not work with floating point types).
P must be between 0 and 1. NOTE: A true percentile Percentile(BIGINT col,P)can only be computed for
INTEGER values. Use PERCENTILE_APPROX if you are input is non-integral.

hive> select percentile(sal,0) from Tri100;------------------------Output Gives Lower Value of table as P is


0.It takes lower value as 0%.
OK
22000.0
Time taken: 17.321 seconds, Fetched: 1 row(s)

hive> select percentile(sal,1) from Tri100; -----------------------Output Gives Higher Value of table as P is
1.It takes Higher value as 100%.
OK
40000.0
Time taken: 17.966 seconds, Fetched: 1 row(s)

hive> select percentile(sal,0.5) from Tri100;


OK
30000.0
Time taken: 17.368 seconds, Fetched: 1 row(s)

Histogram

Computes a histogram of a numeric column in the group using

b non-uniformly spaced bins.

The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and
heights. histogram_numeric(col, b)

hive> select histogram_numeric(sal,5) from Tri100;


OK
[{"x":22000.0,"y":1.0},{"x":25000.0,"y":1.0},{"x":30000.0,"y":1.0},{"x":33000.0,"y":1.0},
{"x":40000.0,"y":1.0}]
Time taken: 17.534 seconds, Fetched: 1 row(s)

Collections

Returns a set of objects with duplicate elements eliminated.

hive> select collect_set(Hike) from Tri100;

OK
[45000,40000,25000,30000]
Time taken: 18.29 seconds, Fetched: 1 row(s)
Returns a set of objects with duplicates(as of Hive 0.13.0)

hive> select collect_list(Hike) from Tri100;


OK
[40000,25000,40000,45000,30000]
Time taken: 17.217 seconds, Fetched: 1 row(s)

NTILE

This function divides an ordered partition into x groups called buckets and assigns a bucket number to
each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and
other common summary statistics. (As of Hive 0.11.0.).

hive> select name,Hike,NTILE(3) over (order by sal DESC) from Tri100;


OK
Ajay
45000 1
Rohan 40000 1
Rahul 40000 2
Srujay 30000 2
Mohit 25000 3
Time taken: 17.217 seconds, Fetched: 1 row(s)

You might also like