Impala Query Tuning

Part-1 :
If you work with Impala, but have no idea how to interpret the Impala query
PROFILEs, it would be very hard to understand what’s going on and how to make
your query run at its full potential. I think this is the case for lots of Impala users,
so I would like to write a simple blog post to share my experience and hope that
it can help with anyone who like to learn more.
This is the Part 1 of the series, so I will go with the basics and just cover the main
things to look out for when examining the PROFILE.
So first thing first, how do you collect Impala query PROFILE? Well, there are a
couple of ways. The simplest way is to just run “PROFILE” after your query in
impala-shell, like below:
1. [impala-daemon-host.com:21000] > SELECT COUNT(*) FROM sample_07;
2. Query: SELECT COUNT(*) FROM sample_07
3. Query submitted at: 2018-09-14 15:57:35 (Coordinator: https://impala-daemon-
host.com:25000)
4. dQuery progress can be monitored at: https://impala-daemon-host.com:25000/query_plan?
query_id=36433472787e1cab:29c30e7800000000
5. +----------+
6. | count(*) |
7. +----------+
8. | 823 |
9. +----------+
10. Fetched 1 row(s) in 6.68s
11.
12. [impala-daemon-host.com:21000] > PROFILE; <-- Simply run "PROFILE" as a query
13. Query Runtime Profile:
14. Query (id=36433472787e1cab:29c30e7800000000):
15. Summary:
16. Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
17. Session Type: BEESWAX
18. .....
You can also collect from Cloudera Manager Web UI, by navigating to CM >
Impala > Queries, locate the query you just ran and click on “Query Details”
Then scroll down a bit to locate “Download Profile” button:

Last, but not least, you can navigate to Impala Daemon’s web UI and download
from there. Go to the Impala Daemon that is used as the coordinator to run the
query:
https://{impala-daemon-url}:25000/queries
The list of queries will be displayed:

Click through the “Details” link and then to “Profile” tab:
All right, so we have the PROFILE now, let’s dive into the details.
Below is the snippet of Query PROFILE we will go through today, which is the
Summary section at the top of the PROFILE:
1. Query (id=36433472787e1cab:29c30e7800000000):
2. Summary:
3. Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
5. Start Time: 2018-09-14 15:57:35.883111000
6. End Time: 2018-09-14 15:57:42.565042000
7. Query Type: QUERY
8. Query State: FINISHED
9. Query Status: OK
10. Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build
50eddf4550faa6200f51e98413de785bf1bf0de1)
11. User: [email protected]
12. Connected User: [email protected]
13. Delegated User:
14. Network Address: ::ffff:xxx.xx.xxx.xx:58834
15. Default Db: default
16. Sql Statement: SELECT COUNT(*) FROM sample_07
17. Coordinator: impala-daemon-url.com:22000
18. Query Options (set by configuration):
19. Query Options (set by configuration and planner): MT_DOP=0
20. Plan:
21. ----------------
Let’s break it into sections and walk through one by one. There are a few
important information here that used more often:
a. Query ID:
1. Query (id=36433472787e1cab:29c30e7800000000):
This is useful to identify relevant Query related information from Impala Daemon
logs. Simply search this query ID and you can find out what it was doing behind
the scene, especially useful for finding out related error messages.
b. Session Type:
This can tell us where the connection is from. BEESWAX means that the query ran
from impala-shell client. If you run from Hue, the type will be “HIVESERVER2”
since Hue connects via HiveServer2 thrift.
c. Start and End time:

1. Start Time: 2018-09-14 15:57:35.883111000
2. End Time: 2018-09-14 15:57:42.565042000
This is useful to tell how long the query ran for. Please keep it in mind that this
time includes session idle time. So if you run a simple query that returns in a few
seconds in Hue, since Hue keeps session open until session is closed or user runs
another query, so the time here might show longer time than normal. The start
and end time should match exactly the run time if run through impala-shell
however, since impala-shell closes query handler straightaway after query
finishes.
d. Query status:
1. Query Status: OK
This tells if the query finished successfully or not. OK means good. If there are
errors, normally will show here, for example, cancelled by user, session timeout,
Exceptions etc.
e. Impala version:
1. Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build
50eddf4550faa6200f51e98413de785bf1bf0de1)
This confirms the version that is used to run the query, if you see this is not
matching with your installation, then something is not setup properly.
f. User information:
1. User: [email protected]
2. Connected User: [email protected]
3. Delegated User:
You can find out who ran the query from this session, so you know who to
blame :).
g. DB selected on connection:

1. Default Db: default
Not used a lot, but good to know.
h. The query that used to return this PROFILE:

1. Sql Statement: SELECT COUNT(*) FROM sample_07
You will need this info if you are helping others to troubleshoot, as you need to
know how query was constructed and what tables are involved. In lots of cases
that a simple rewrite of the query will help to resolve issues or boost query
performance.
i. The impala daemon that is used to run the query, what we

called the Coordinator:
1. Coordinator: impala-daemon-host.com:22000
This is important piece of information, as you will determine which host to get
the impala daemon log should you wish to check for INFO, WARNING and
ERROR level logs.
j. Query Options used for this query:

1. Query Options (set by configuration):
2. Query Options (set by configuration and planner): MT_DOP=0
This section tells you what kind of QUERY OPTIONS being applied to the current
query, if there are any. This is useful to see if there is any user level, or pool level
overrides that will affect this query. One example would be if Impala Daemon’s
memory is set at, say 120GB, but a small query still fails with OutOfMemory error.
This is the place you will check if user accidentally set MEM_LIMIT in their session
to a lower value that could results in OutOfMemory error.
This concludes the part 1 of the series to explain the Summary section of the
query to understand the basic information. In the next part of the series, I will
explain in detail on Query Plan as well as the Execution Summary of the PROFILE.
Part -2 :
The Query Plan and Execution Summary looks like below:

1. Query (id=36433472787e1cab:29c30e7800000000):
2. Summary:
3. ....Skipped here....
4. Plan:
5. ----------------
6. Max Per-Host Resource Reservation: Memory=0B
7. Per-Host Resource Estimates: Memory=52.00MB
8. WARNING: The following tables are missing relevant table and/or column statistics.
9. default.sample_07
10.
11. F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
12. | Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B
13. PLAN-ROOT SINK
14. | mem-estimate=0B mem-reservation=0B
15. |
16. 03:AGGREGATE [FINALIZE]
17. | output: count:merge(*)
18. | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
19. | tuple-ids=1 row-size=8B cardinality=1
20. |
21. 02:EXCHANGE [UNPARTITIONED]
24. |
25. F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
26. Per-Host Resources: mem-estimate=42.00MB mem-reservation=0B
27. 01:AGGREGATE
28. | output: count(*)
31. |
32. 00:SCAN HDFS [default.sample_07, RANDOM]
33. partitions=1/1 files=1 size=44.98KB
34. stats-rows=unavailable extrapolated-rows=disabled
35. table stats: rows=unavailable size=44.98KB
36. column stats: all
37. mem-estimate=32.00MB mem-reservation=0B
38. tuple-ids=0 row-size=0B cardinality=unavailable
39. ----------------
40. Estimated Per-Host Mem: 54525952
41. Tables Missing Stats: default.sample_07
42. Per Host Min Reservation: xxx-3.xxxx.com:22000(0) xxx-4.xxxx.com:22000(0)
43. Request Pool: root.hive
44. Admission result: Admitted immediately
45. ExecSummary:
46. Operator Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
47. -----------------------------------------------------------------------------------------------------------
48. 03:AGGREGATE 1 0.000ns 0.000ns 1 1 20.00 KB 10.00 MB FINALIZE
49. 02:EXCHANGE 1 868.991ms 868.991ms 1 1 0 0 UNPARTITIONED
50. 01:AGGREGATE 1 0.000ns 0.000ns 1 1 16.00 KB 10.00 MB
51. 00:SCAN HDFS 1 743.001ms 743.001ms 823 -1 80.00 KB 32.00 MB default.sample_07
OK, let’s dive in.
1. Table/Column Statistics:
1. Max Per-Host Resource Reservation: Memory=0B
2. Per-Host Resource Estimates: Memory=52.00MB
3. WARNING: The following tables are missing relevant table and/or column statistics.
4. default.sample_07
The first two lines simply state resource information, they are not very important
and not used very often.
The next line is very important however, as Impala tells us if it has detected that
the tables involved in the query have up-to-date information about their stats or
not. This is very crucial because Impala uses table/column statistics information
to do resource estimation as well as perform query plan to determine the best
strategy to run the query. If the stats are not up-to-date, Impala will end up with
bad query plan, hence will affect the overall query performance.
In my example, we can see that the table default.sample_07’s stats are missing.
Impala produced the warning so that users are informed about this and
COMPUTE STATS should be performed on the table to fix this.
For more information on Impala stats, please refer to documentation Table and

Column Statistics provided by Cloudera.
2. Query Plan Details:
3. PLAN-ROOT SINK
5. |
10. |
14. |
17. 01:AGGREGATE
21. |
Now, it comes to the interesting part: the query plan. Query plan is one of the
most important part of Impala PROFILE that you need to know how to read, as it
tells you how table is scanned, data exchanged and joined to get the final result.
This can get very complicated if your query is complex, but let’s start with this
simple query to understand the basis. One thing to remember is that you need to
read the Query Plan backwards, which will allow you to follow what Impala
planned to do.
a. It normally starts with HDFS Scan:

From above, we can see below useful information:
 there was only one partition in the table, and Impala also read one
partition. This does not necessarily mean that this table is partitioned. If the
table is not partitioned, it will just be shown as 1/1.
 there was only one file under the table/partition (files=1)
 the total size read by Impala was 44.98KB
 there were no stats available for this table (stats-rows=unavailable, table
stats: rows=unavailable and cardinality=unavailable)
 estimated memory to be 32MB to run the query and no memory were
reserved
b. After HDFS scan was complete, Impala needed to do Aggregation, as we did
COUNT(*):
1. 01:AGGREGATE
There isn’t much to explain here, but just to know that this operation does the
Aggregation step.
c. Fragment information:
This bit of information just above the 00:SCAN HDFS and 01:AGGREGATE
operators tells us that both Scan and Aggregation Operator belongs to Fragment
F00, which ran on 1 host and 1 instance. This Fragment ID of F00 can be used to
find the actual Fragment statistic in the later part of PROFILE, which can tell us
more detailed information about how this Fragment runs at run time. I will also
cover this in the later part of the series.
d. Exchange Operation:
So after aggregation was done on each worker node, the results needed to be
exchanged from each worker node to the coordinator, that was what happened
here. After that, the coordinator needed to do the final aggregation/merger on
the those results:
And both of above two operations belonged to the same Fragment 01, which
again can be used to reference the rest of Profile data to find out more detailed
stats about the query:
Now, let’s have a look at the Summary Section of the Profile:
1. Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
2. -----------------------------------------------------------------------------------------------------------
3. 03:AGGREGATE 1 999.992us 999.992us 1 1 20.00 KB 10.00 MB FINALIZE
4. 02:EXCHANGE 1 831.992ms 831.992ms 1 1 0 0 UNPARTITIONED
5. 01:AGGREGATE 1 0.000ns 0.000ns 1 1 16.00 KB 10.00 MB
6. 00:SCAN HDFS 1 709.995ms 709.995ms 823 -1 80.00 KB 32.00 MB default.sample_07
Here you can find below information that could be useful:
 It tells the Average time and Maximum time each operation took. If there is
big difference between the two, you would know that there was in-
balance/skew when running jobs in each worker node, as in theory, they
should be processing similar amount of data and we should expect all of the
nodes to finish in similar time range
 If the values for “#Row” and “Est. #Rows” are way off, in my case -1 for Est.
#Rows for SCAN HDFS operation and 823 for #Row (the actual number of
rows returned after running the query), we know that Impala has out of date
information about the table statistics. In my case, we did not have table stats,
so Impala reported “-1” estimated value. If the estimated value is positive,
but is still different from actual rows returned, then we know that we need to
run “COMPUTE STATS” against this table to update the statistics.
 the “#Hosts” column tells us know many worker nodes participated in the
query for that particular operation. In my case, since the data was small, we
only had 1 host to run the query.
 The “Peak Mem” and “Est. Peak Mem” are self-explanatory, they are the
actual memory used vs. the estimated memory that Impala calculated based
on table stats.
If there are joins in queries, this section will also show us what join strategies
were used in the join operation, either Broadcast or Shuffle Join. I will try to cover
this as well in the later part of the series.
That’s all for this part II of the series, and hope that they are useful. I will try to
get more complicated query Profiles to share next time and work through to
understand more.
Part -3 :
In this third part of this blog series, I will be still focusing on the Query Plan as
well as the Execution Summary, but using a more complicated query that is
against real life data that is downloaded from Kaggle’s Flights Delay database.
This database has 3 tables:
flights.csv
airlines.csv
airports.csv
The query that generated the profile as below:

1. SELECT
2. a.airline as airline_name,
3. COUNT(IF(cancelled = 1, 1, NULL)) AS cancelled,
4. COUNT(1) AS total,
5. CONCAT(CAST(CAST(COUNT(IF(cancelled = 1, 1, NULL)) / COUNT(1) AS DECIMAL(8,4)) * 100 AS
STRING), "%") AS cancelled_rate
6. FROM flights f
7. JOIN airlines a
8. ON (f.airline = a.iata_code)
9. GROUP BY a.airline ORDER BY a.airline
This query JOINs the flights and airports tables and generated a report that can
tell us the cancellation rate for each airline for all flights happened during 2015.
The result looks like below:
1. +------------------------------+-----------+---------+----------------+
2. | airline_name | cancelled | total | cancelled_rate |
3. +------------------------------+-----------+---------+----------------+
4. | Alaska Airlines Inc. | 669 | 172521 | 0.3800% |
5. | American Airlines Inc. | 10919 | 725984 | 1.5000% |
6. | American Eagle Airlines Inc. | 15025 | 294632 | 5.0900% |
7. | Atlantic Southeast Airlines | 15231 | 571977 | 2.6600% |
8. | Delta Air Lines Inc. | 3824 | 875881 | 0.4300% |
9. | Frontier Airlines Inc. | 588 | 90836 | 0.6400% |
10. | Hawaiian Airlines Inc. | 171 | 76272 | 0.2200% |
11. | JetBlue Airways | 4276 | 267048 | 1.6000% |
12. | Skywest Airlines Inc. | 9960 | 588353 | 1.6900% |
13. | Southwest Airlines Co. | 16043 | 1261855 | 1.2700% |
14. | Spirit Air Lines | 2004 | 117379 | 1.7000% |
15. | US Airways Inc. | 4067 | 198715 | 2.0400% |
16. | United Air Lines Inc. | 6573 | 515723 | 1.2700% |
17. | Virgin America | 534 | 61903 | 0.8600% |
18. +------------------------------+-----------+---------+----------------+
And the full details on the query plan and execution summary from the profile as
below:
2. | Per-Host Resources: mem-estimate=0B mem-reservation=0B
3. PLAN-ROOT SINK
5. |
6. 08:MERGING-EXCHANGE [UNPARTITIONED]
7. | order by: a.airline ASC
10. |
11. F02:PLAN FRAGMENT [HASH(a.airline)] hosts=4 instances=4
12. Per-Host Resources: mem-estimate=22.00MB mem-reservation=13.94MB
13. 04:SORT
14. | order by: a.airline ASC
15. | mem-estimate=12.00MB mem-reservation=12.00MB spill-buffer=2.00MB
17. |
19. | output: count:merge(if(cancelled = 1, 1, NULL)), count:merge(*)
20. | group by: a.airline
21. | mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB
23. |
24. 06:EXCHANGE [HASH(a.airline)]
27. |
29. Per-Host Resources: mem-estimate=187.94MB mem-reservation=3.94MB
30. 03:AGGREGATE [STREAMING]
31. | output: count(if(cancelled = 1, 1, NULL)), count(*)
32. | group by: a.airline
35. |
36. 02:HASH JOIN [INNER JOIN, BROADCAST]
37. | hash predicates: f.airline = a.iata_code
38. | fk/pk conjuncts: f.airline = a.iata_code
39. | runtime filters: RF000 <- a.iata_code
41. | tuple-ids=0,1 row-size=73B cardinality=5819079
42. |
43. |--05:EXCHANGE [BROADCAST]
44. | | mem-estimate=0B mem-reservation=0B
45. | | tuple-ids=1 row-size=54B cardinality=14
46. ||
47. | F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
49. | 01:SCAN HDFS [flight_delay.airlines a, RANDOM]
50. | partitions=1/1 files=1 size=341B
51. | stats-rows=14 extrapolated-rows=disabled
52. | table stats: rows=14 size=341B
53. | column stats: all
54. | mem-estimate=32.00MB mem-reservation=0B
56. |
57. 00:SCAN HDFS [flight_delay.flights f, RANDOM]
58. partitions=1/1 files=1 size=564.96MB
59. runtime filters: RF000 -> f.airline
60. stats-rows=5819079 extrapolated-rows=disabled
61. table stats: rows=5819079 size=564.96MB
64. tuple-ids=0 row-size=19B cardinality=5819079
65. ----------------
66. Estimated Per-Host Mem: 253689856
67. Per Host Min Reservation: host-xxx.xxxx.com:22000(17.88 MB) host-
xxx.xxxx.com:22000(17.88 MB) host-xxx.xxxx.com:22000(17.88 MB) host-
xxx.xxxx.com:22000(17.88 MB)
68. Request Pool: root.hive
69. Admission result: Admitted immediately
70. ExecSummary:
72. -------------------------------------------------------------------------------------------------------------------------
73. 08:MERGING-EXCHANGE 1 4s122ms 4s122ms 14 14 0 0 UNPARTITIONED
74. 04:SORT 4 249.999us 999.996us 14 14 12.02 MB 12.00 MB
75. 07:AGGREGATE 4 2.750ms 4.000ms 14 14 1.99 MB 10.00 MB FINALIZE
76. 06:EXCHANGE 4 4s100ms 4s137ms 55 14 0 0 HASH(a.airline)
77. 03:AGGREGATE 4 280.499ms 339.002ms 55 14 10.11 MB 10.00 MB STREAMING
78. 02:HASH JOIN 4 177.749ms 184.999ms 5.82M 5.82M 10.05 MB 1.94 MB INNER JOIN,
BROADCAST
79. |--05:EXCHANGE 4 0.000ns 0.000ns 14 14 0 0 BROADCAST
80. | 01:SCAN HDFS 1 97.000ms 97.000ms 14 14 177.00 KB 32.00 MB flight_delay.airlines a
81. 00:SCAN HDFS 4 2s052ms 3s278ms 5.82M 5.82M 40.06 MB 176.00 MB flight_delay.flights f
OK, let’s jump to the Execution Summary section first this time, as it is easier to
see and it is normally the first section I would check when helping CDH users to
troubleshoot any Impala query related issues. I will assume that you have read
through my Part 2 series, so you know what had been explained before and I will
go a bit faster this time.
So, from the above Execution Summary, we can see what happened during the
life time of the query exeuction, remember to read backwards:
1. it took average of 2 seconds to scan HDFS data for table flight_delay.flights

(2s052ms)
2. Impala estimated to have 5.82 million rows in table flight_delay.flights, and
the scan result returned the same, which means the table stat is up-to-date
(5.82M)
3. Impala estimated 176MB memory needed to scan table
flight_delay.flights, but in fact only 40MB was used. This is expected, as
memory estimation is not possible to be exactly the same as actual usage.
The idea is to get as close as possible
4. Since the data is big, Impala was able to split the file and perform the scan
operation on 4 hosts, so that the load was distributed
5. After scanning for table flight_delay.flights was completed, Impala started
scanning for another table flight_delay.airlines. Again, the estimated and
actual rows returned match, so table stats is also up-to-date for
flight_delay.airlines as well. And since the table only has 14 rows, it only
took 97 milli-seconds to scan it
6. As the table is small, with only 14 rows, Impala only used 1 host to do the
scan operation
7. The next step was to broadcast the smaller table flight_delay.airlines into all
nodes in the cluster that will perform the query, in my case, 4 hosts
(BROADCAST)
8. Once broadcast was done, Impala performed Hash Join between
flight_delay.airlines and flight_delay.flights, which took 177 milli-
seconds and used 10MB of memory
9. Since we had called COUNT aggregation function, Impala was instructed
to perform the AGGREGATE operation, which ran, again, on 4 hosts,
spent 280ms, 10MB of memory and returned 55 rows
10. Since above step was performed on 4 worker nodes, Impala then
needed to merge the results from them, which was achieved by exchanging
the data internally and then performed the final aggregation on the
intermediate result
11. A sort operation was performed after step 10 was complete, as we had
“ORDER BY” in our query
You can match the operation number, like 00, 01, 02 etc in the Summary section
to the numbers in the Query Plan section, which will tell you more details about
the operation. The details I had covered in Part 2, so please refer to the previous
post if you need reference.
Now, let’s jump further down the PROFILE and have a look at the Planner and
Query Timeline:
1. Planner Timeline
2. Analysis finished: 3ms (3389346)
3. Equivalence classes computed: 3ms (3600838)
4. Single node plan created: 4ms (4625920)
5. Runtime filters computed: 4ms (4734686)
6. Distributed plan created: 5ms (5120630)
7. Lineage info computed: 13ms (13666462)
8. Planning finished: 15ms (15712999)
9. Query Timeline
10. Query submitted: 0ns (0)
11. Planning finished: 16ms (16999947)
12. Submit for admission: 17ms (17999944)
13. Completed admission: 17ms (17999944)
14. Ready to start on 4 backends: 18ms (18999941)
15. All 4 execution backends (10 fragment instances) started: 28ms (28999909)
16. Rows available: 4.28s (4280986646)
17. First row fetched: 4.31s (4308986559)
Each line is pretty much self-explanatory, we can see that the query plan
took 15ms seconds to run, submitted for admission from 17ms, ready to
execute plan on worker nodes from 28ms and then finally rows were ready
at 4.28 seconds and first row was fetched by client at 4.31 seconds. This
gives you a very clear overview of how long each stage took. If any of the stages
is slow, it will be very obvious and then we can start to drill down further to see
what might have happened.
Since my query was fast, so it is not very interesting to see here. Let’s have a look
at another real production impala query profile:
1. Query Compilation: 16.268ms
2. - Metadata of all 1 tables cached: 1.786ms (1.786ms)
3. - Analysis finished: 6.162ms (4.376ms)
4. - Value transfer graph computed: 6.537ms (374.918us)
5. - Single node plan created: 7.955ms (1.417ms)
6. - Runtime filters computed: 8.274ms (318.815us)
7. - Distributed plan created: 8.430ms (156.307us)
8. - Lineage info computed: 9.664ms (1.234ms)
9. - Planning finished: 16.268ms (6.603ms)
10. Query Timeline: 35m46s
11. - Query submitted: 0.000ns (0.000ns)
12. - Planning finished: 22.001ms (22.001ms)
13. - Submit for admission: 23.001ms (1.000ms)
14. - Completed admission: 23.001ms (0.000ns)
15. - Ready to start on 2 backends: 24.001ms (1.000ms)
16. - All 2 execution backends (2 fragment instances) started: 36.001ms (12.000ms)
17. - Rows available: 5m51s (5m51s)
18. - First row fetched: 5m52s (950.045ms)
19. - Last row fetched: 35m46s (29m53s)
20. - Released admission control resources: 35m46s (1.000ms)
21. - Unregister query: 35m46s (30.001ms)
22. - ComputeScanRangeAssignmentTimer: 0.000ns
This was taken from a case that Impala query took a long time to run and
customer wanted to find out why. From the Query Timeline, we can clearly see
that it took almost 6 minutes (5m51s) from starting execution (All 2 execution
backends) until data was available (Rows available). This 6 minutes execution
could be normal, as if there were lots of joins with large dataset, it is common to
have query run for several minutes.
However, we can notice that it took Impala 30 minutes to pass the data back to
client as First row fetched at 6 minutes mark, but Last row fetch
only at 36 minutes mark. So from here, we could suspect that there could
be some networking issue between Impala coordinator and client (as data fetch
happens from Client, like impala-shell or Hue, to Impala Coordinator host).
Another possibility is that client might be capturing the results and performing
other actions like printing on the screen, as the return data might be big, that
operation can be time consuming.
So this section of the PROFILE can lead us to the right direction of where to look
for to find out the bottleneck.
This concludes the Part 3 of the Impala Profile series. I will explain more
regarding how to relate the operation number that shown in the Query Plan
section to the bottom of the Profile section where it shows the detailed metric of
each operation, either average or individually on each host.
Part-4 :
OK, let’s get started. Since the Profile itself is quite large, as it involved several
Impala Daemons to run, it will be ugly on the page if I include the full content
here. I will go through section by section and explain on what information I was
looking for when troubleshooting the issue.
The problem with the query was that for whatever reason, the same query used
to be able to finish under a few minutes, but now it took more than 1 hour to
finish. This profile was just one example, in fact, ALL queries running through
this cluster had the exact same issue at the time. So please spend sometime to
go through this Profile and see if you are able to capture any useful information
and understand the situation here.
Now, let me go through in more detail on the steps I used to troubleshoot this
particular issue.
1. Since user complained query took longer than normal, the first thing I wanted
to check was how long? So very obvious I looked for the Start and End time at
the beginning of the Profile:
1. Start Time: 2020-01-03 07:33:42.928171000
2. End Time: 2020-01-03 08:47:55.745537000
I noticed that it took 1 hour and 14 minutes to finish the query, which matched
what user complained.
2. I noticed that the query failed with EXCEPTION, due to user cancellation:
1. Query State: EXCEPTION
2. Query Status: Cancelled
So it was likely that user was not patience anymore and had to cancel the query
as it took too long. Nothing to worry about here.
3. Moved along, I checked how complex the query was:

1. Sql Statement:
2. SELECT COUNT(*),NVL(SUM(NVL(NUMBER_OF_CALLS,0)),0)
3. FROM xxx_hpmm
4. WHERE xxxx_month IN (
5. SELECT DISTINCT dd.month_id
6. FROM xxx_xxx_xxx cef, date_xxx dd
7. WHERE CAST(xxxx_date_id AS bigint) = dd.xxx_id
8. )
9. AND xxx_date = TO_TIMESTAMP ('01/02/2020', 'MM/dd/yyyy')
I have re-formatted for readability.
I could see that the query was pretty simple, a straight SELECT statement with a
sub-query IN condition.
4. Next thing I could see was the Coordinator host:

1. Coordinator: xxxxx-xxx-cdh-cdn003.xxx.XXXXXX.com:22000
If I could not conclude anything from this file, then the next step should be to get
the impala daemon log on this host. But for now, I just had to continue, as I did
not have the log yet.
5. Next block from the Profile I reached the Query Plan section, as I mentioned in
my previous post, I normally would skip this section first, and jump to the
Summary section, to understand how long each operator took and see if there
could be any obvious information that might tell me a bit more:
2. -----------------------------------------------------------------------------------------------------------
3. 12:AGGREGATE 1 0.000ns 0.000ns 0 1 16.00 KB 10.00 MB FINALIZE
4. 11:EXCHANGE 1 0.000ns 0.000ns 28 1 232.00 KB 0 UNPA...
5. 06:AGGREGATE 29 61.358ms 110.536ms 28 1 13.60 MB 10.00 MB
6. 05:HASH JOIN 29 223.055ms 345.295ms 277.41M 116.66M 2.00 MB 1.94 MB LEFT...
7. |--10:EXCHANGE 29 29.977us 67.789us 2 261 24.00 KB 0 BROADCAST
8. | 09:AGGREGATE 29 1.569ms 2.903ms 2 261 1.96 MB 10.00 MB FINALIZE
9. | 08:EXCHANGE 29 33.880us 419.580us 58 261 240.00 KB 0 HASH(..
10. | 04:AGGREGATE 29 9s751ms 3m53s 58 261 4.58 MB 10.00 MB STREAMING
11. | 03:HASH JOIN 29 1s099ms 2s635ms 411.27M 411.27M 3.35 MB 1.94 MB INNER ..
12. | |--07:EXCHANGE 29 142.532us 334.794us 8.07K 8.07K 304.00 KB 0 BROADCAST
13. | | 02:SCAN HDFS 1 1.700ms 1.700ms 8.07K 8.07K 332.94 KB 32.00 MB xx..
14. | 01:SCAN HDFS 29 44s391ms 13m18s 411.27M 411.27M 525.95 MB 4.38 GB xx..
15. 00:SCAN HDFS 29 3m7s 56m13s 277.41M 116.66M 7.83 MB 160.00 MB xx..
There are a few key things that I am looking for here ( I have re-formatted again
for readability and removed the information that I was not interested in):
 Read in the reverse order, from bottom to top, as it is the order that Impala
does the operation
 Compare “Avg Time” and “Max Time” columns
 Compare “#Rows” and “Est. #Rows” columns
 Check “Detail” column to see what type of JOINs were for each operation
Immediately, I noticed there was a big difference between “Avg Time” and
“Max Time” for SCAN HDFS operator. Average time took 3 minutes and 7
seconds, but Maximum time from one of the hosts, out of 29 hosts, took 56
minutes and 13 seconds. Kept reading, I also noticed the exact same issue
for the second SCAN HDFS operation, 44 seconds vs 13 minutes and 18
seconds.
So my next thought was to identify which host/hosts performed much slower
than others and whether it was from the same host. To do so, I used string
“id=0” to search through the Profile. “0” is the operator number from the
beginning against each line in the Summary section “00:SCAN HDFS”. This
“id=[\d]+” string will be attached to each operator in the detailed breakdown
section down below in the Profile. Remember to remove any leading 0s.
I searched for the first instance of “id=0” from the beginning of the file, and
reached below section:
1. HDFS_SCAN_NODE (id=0)
2. ....
3. - ScannerThreadsTotalWallClockTime: 20.0m (1200982115995)
4. - MaterializeTupleTime(*): 226ms (226568242)
5. - ScannerThreadsSysTime: 322ms (322168172)
6. - ScannerThreadsUserTime: 6.76s (6758158482)
7. - ScannerThreadsVoluntaryContextSwitches: 10,907 (10907)
8. - TotalRawHdfsOpenFileTime(*): 8.6m (517759170560)
9. - TotalRawHdfsReadTime(*): 3.4m (201957505069)
10. - TotalReadThroughput: 749.9 KiB/s (767874)
11. - TotalTime: 3.1m (187289950304)
I noticed TotalTime was 3.1 minutes, which matched the 3 minutes and 7
seconds that I saw in the Summary section, so this was the Average Fragment.
To confirm, I scrolled back and saw below:
1. Averaged Fragment F00
Continue searching the file, I came to below section (second instance of “id=0”):
2. ....
3. - ScannerThreadsTotalWallClockTime: 10.4m (626435081910)
4. - MaterializeTupleTime(*): 278ms (278689886)
5. - ScannerThreadsSysTime: 266ms (266094000)
6. - ScannerThreadsUserTime: 5.75s (5748833000)
7. - ScannerThreadsVoluntaryContextSwitches: 11,285 (11285)
8. - TotalRawHdfsOpenFileTime(*): 7.8m (468388283839)
9. - TotalRawHdfsReadTime(*): 1.9m (114598713147)
10. - TotalReadThroughput: 731.0 KiB/s (748535)
11. - TotalTime: 2.1m (125005670562)
This one told me it took 2.1 minutes, which was faster than average of 3.1
minutes, and scrolling back to confirm the host:
1. Fragment F00
2. Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-
cdn002.xxx.XXXXXX.com:22000)
Now, I could see that there were three things that I was looking for:
3. - TotalTime: 2.1m (125005670562)
I thought it would be easier if I could use simple “grep” to filter out everything.
Since the Profile was nicely indented, I used below egrep command to get what I
was after:
1. egrep ' Instance .*\)|^ HDFS_SCAN_NODE \(id=0\)|^ - TotalTime: ' profile-example.txt
It yielded result below:
1. ...
3. ...
5. - TotalTime: 2.1m (125005670562)
7. ...
9. - TotalTime: 1.9m (114395426955)
11. ...
13. - TotalTime: 1.5m (92671503850)
14. Instance 94481a81355e51e4:51fd9f950000003d (host=xxxxx-xxx-cdh-
15. ...
17. - TotalTime: 1.4m (86459970122)
18. Instance 94481a81355e51e4:51fd9f950000004b (host=xxxxx-xxx-cdh-
19. ...
21. - TotalTime: 1.4m (82187347776)
23. ...
25. - TotalTime: 1.4m (82146306944)
26. Instance 94481a81355e51e4:51fd9f950000004f (host=xxxxx-xxx-cdh-
27. ...
29. - TotalTime: 1.3m (80468400288)
30. Instance 94481a81355e51e4:51fd9f950000004d (host=xxxxx-xxx-cdh-
31. ...
33. - TotalTime: 1.3m (79714897965)
35. ...
37. - TotalTime: 1.3m (78877950983)
39. ...
41. - TotalTime: 1.3m (77593734314)
42. Instance 94481a81355e51e4:51fd9f950000003c (host=xxxxx-xxx-cdh-
43. ...
45. - TotalTime: 1.3m (76164245478)
47. ...
49. - TotalTime: 1.3m (75588331159)
51. ...
53. - TotalTime: 1.2m (73596530464)
55. ...
57. - TotalTime: 1.2m (72946574082)
59. ...
61. - TotalTime: 1.2m (69918383242)
63. ...
65. - TotalTime: 1.2m (69355611992)
67. ...
69. - TotalTime: 1.1m (68527129814)
71. ...
73. - TotalTime: 1.1m (67249633571)
75. ...
77. - TotalTime: 1.1m (63989781076)
79. ...
81. - TotalTime: 1.0m (62739870946)
82. Instance 94481a81355e51e4:51fd9f950000003f (host=xxxxx-xxx-cdh-
83. ...
85. - TotalTime: 1.0m (62136511127)
87. ...
89. - TotalTime: 1.0m (61943905274)
91. ...
93. - TotalTime: 1.0m (61955797776)
94. Instance 94481a81355e51e4:51fd9f950000004e (host=xxxxx-xxx-cdh-
95. ...
97. - TotalTime: 1.0m (60045780252)
99. ...
101. - TotalTime: 58.05s (58048904552)
102. Instance 94481a81355e51e4:51fd9f950000004a (host=xxxxx-xxx-cdh-
103. ...
105. - TotalTime: 57.34s (57338024825)
107. ...
109. - TotalTime: 53.13s (53130104765)
111. ...
113. - TotalTime: 43.24s (43238668974)
114. Instance 94481a81355e51e4:51fd9f950000003e (host=xxxxx-xxx-cdh-
115. ...
117. - TotalTime: 56.2m (3373973559713)
I have omitted other irrelevant information and only left the ones that I was
interested. Now I could see clearly which host was the bottleneck. It was
host xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com, which took 56.2
minutes, while ALL other hosts took around 40 seconds to 2 minutes.
Now, I remember another HDFS SCAN had the same symptom, which was
operator 1 (01:SCAN HDFS), so I did the same “egrep” command (remember that
the indentation for different operators might be different, so I needed to search
those again in the Profile first and copy paste exactly the amount of white spaces
before them to get the result I wanted):
1. egrep ' Instance .*\)|^ HDFS_SCAN_NODE \(id=1\)|^ - TotalTime: ' profile-example.txt
And again result confirmed the same:
1. ....
3. ...
5. - TotalTime: 13.3m (798747290751)
6. ...
8. ...
10. - TotalTime: 28.16s (28163113611)
12. ...
14. - TotalTime: 23.29s (23285966387)
15. ...
It was clear that, again the same host xxxxx-xxx-cdh-
cdn015.xxx.XXXXXX.com had the exact same problem, that it ran much
slower than other hosts, 13.3 minutes vs 28.16 seconds.
I then came to conclusion that something happened on the host and needed to
be fixed.
To confirm my theory as the result of the above investigation, I asked the user to
shutdown Impala Daemon on this host and test their query again, and they
confirmed back that issue was resolved. And later on they updated me and said
that they had identified hardware issues on that host and it had been
decommissioned for maintenance.
I hope above steps I took to troubleshoot this particular Impala Query

performance issue can be useful and help in someway to understand how best to
tackle Impala query issues.
Summary
Admission result:
Here we can see if the query was admitted immediately, queued or rejected. If
the query was queued or rejected, we can see the reason.
Start Time:
The time when the query was submitted.
End Time:
The time when the query was unregistered.
ExecSummary:
The summary of each operator(exec node), including time, number of rows and
memory usage. It’s invisible until the query is closed.
Max Per-Host Resource Reservation:

Maximum possible (in the case all fragments are scheduled on all hosts with max
DOP) minimum reservation required per host, in bytes. It’s the initially reserved
memory for this query.
Per-Host Resource Estimates:

Estimated per-host peak memory consumption in bytes. Used by admission
control. It’s ignored if query option MEM_LIMIT is set.
Plan:
Query plan delimited by lines “—————”.
Query Compilation / Planner Timeline:

It was named “Planner Timeline” until CDH5.15.0. Timeline of important events in
the planning process (FE/Java), used for debugging and profiling. The timelines
started from the time when logging message “Analyzing query: <query string>”
appeared.
Query Timeline:
Timeline of important events in the execution process (BE/C++). The timelines
started from the time when this query was registered.
ImpalaServer
ClientFetchWaitTimer:
Time spent by the coordinator while idle waiting for a client to fetch rows.
MetastoreUpdateTimer:
Time spent to gather and publish all required updates to the metastore.
RowMaterializationTimer:
Time spent by the coordinator to fetch rows.
Execution Profile
Metrics in Coordinator
CatalogOpExecTimer
Time spent by the coordinator to send catalog operation execution request and
wait for the response from the catalogd.
ComputeScanRangeAssignmentTimer:
Compute the assignment of scan ranges to hosts for each scan node.
FiltersReceived:
The total number of filter updates received (always 0 if filter mode is not
GLOBAL). Excludes repeated broadcast filter updates.
FinalizationTimer:
Total time spent in finalization (typically 0 except for INSERT into HDFS tables).
Metrics in Fragment Instance:

ExecTime:
Time spent in fragment execution. Basically, we can consider it as
ExecTreeExecTime plus the time used by the sink to send rows to the next
fragment or client.
ExecTreeExecTime:
Time spent by execution node and its offsprings in this fragment to retrieve rows
and return them via row_batch.
ExecTreeOpenTime:
Time spent by execution node and its offsprings in this fragment to perform any
preparatory work prior to retrieving rows.
Filter X arrival:
The amount of time waited since registration for the filter to arrive. 0 means that
filter has not yet arrived.
Fragment Instance Lifecycle Event Timeline

Event sequence tracking the completion of various stages of this fragment
instance. The timelines started after the timestamp of message “descriptor table
for query=<query_id>” in logs.
OpenTime:
Time spent in fragment Open() logic. It includes ExecTreeOpenTime, the time to
generate LLVM code and the time to open sink in this fragment.
PerHostPeakMemUsage:
A counter for the per query, per host peak mem usage. Note that this is not the
max of the peak memory of all fragments running on a host since it needs to take
into account when they are running concurrently. All fragments for a single query
on a single host will have the same value for this counter.
PrepareTime:
Time to prepare for fragment execution.
RowsProduced:
The number of rows returned by this fragment instance.
TotalNetworkReceiveTime:
Total time spent receiving over the network (across all threads).
TotalNetworkSendTime:
Total time spent waiting for RPCs to complete. This time is a combination of:
 network time of sending the RPC payload to the destination

 processing and queuing time in the destination
 network time of sending the RPC response to the originating node
TotalStorageWaitTime:
Total time waiting in storage (across all threads).
TotalThreads:
Total CPU utilization for all threads in this plan fragment.
Common metrics:
InactiveTotalTime:
Total time spent waiting (on non-children) that should not be counted when
computing local_time_percent_. This is updated for example in the exchange
node when waiting on the sender from another fragment.
LocalTime:
Time spent in this node (not including the children). Computed in
ComputeTimeInProfile().
Node Lifecycle Event Timeline

ExecNode lifecycle events for this ExecNode. Introduced since CDH6.2.0. Same as
Fragment Instance Lifecycle Event Timeline, the timelines here also started after
the timestamp of message “descriptor table for query=<query_id>” in logs.
TotalTime:
The total elapsed time.
Metrics in Buffer Pool:

AllocTime:
The total amount of time spent inside BufferAllocator::AllocateBuffer().
CumulativeAllocationBytes:
Bytes of buffers allocated via BufferAllocator::AllocateBuffer().
CumulativeAllocations:
The number of buffers allocated via BufferAllocator::AllocateBuffer().
PeakReservation:
The tracker’s peak reservation in bytes.
PeakUnpinnedBytes:
The peak total size of unpinned pages.
PeakUsedReservation:
The tracker’s peak usage in bytes.
ReadIoBytes:
Total bytes read from disk.
ReadIoOps:
The total number of read I/O operations issued.
ReadIoWaitTime:
Amount of time spent waiting for reads from disk to complete.
ReservationLimit:
The hard limit on the tracker’s reservations.
WriteIoBytes:
Total bytes written to disk.
WriteIoOps:
The total number of write I/O operations issued.
WriteIoWaitTime:
Amount of time spent waiting for writes to disk to complete.
Metrics in Data Sender:

BytesSent:
Time series of the number of bytes sent, samples bytes_sent_counter_.
EosSent:
Total number of EOS sent.
NetworkThroughput:
Summary of network throughput for sending row batches. Network time also
includes queuing time in KRPC transfer queue for transmitting the RPC requests
and receiving the responses.
OverallThroughput:
Throughput per total time spent in the sender.
RowsReturned:
The number of row batches enqueued into the row batch queue.
RowsSent:
The total number of rows sent.
RpcFailure:
The total number of times RPC fails or the remote responds with a non-retryable
error.
RpcRetry:
Number of TransmitData() RPC retries due to remote service being busy.
SerializeBatchTime:
Time for serializing row batches.
TotalBytesSent:
The total number of bytes sent. Updated on RPC completion.
TransmitDataRPCTime:
The concurrent wall time spent sending data over the network.
UncompressedRowBatchSize:
The total number of bytes of row batches before compression.
Metrics in Data Receiver:

BytesDequeued:
Time series of bytes of deserialized row batches, samples
‘bytes_dequeued_counter_’.
BytesReceived:
The total number of bytes of serialized row batches received.
BytesSkipped:
The number of bytes skipped when advancing to next sync on error.
DataWaitTime:
Total wall-clock time spent waiting for data to be available in queues.
DeferredQueueSize:
Time series of the number of deferred row batches, samples
‘num_deferred_rpcs_’.
DeserializeRowBatchTimer:
Total wall-clock time spent deserializing row batches.
DispatchTime:
Summary stats of time which RPCs spent in KRPC service queue before being
dispatched to the RPC handlers.
FirstBatchArrivalWaitTime:
Time spent waiting until the first batch arrives across all queues.
FirstBatchWaitTime:
Wall-clock time spent waiting for the first batch arrival across all queues.
SendersBlockedTimer:
Wall time senders spend waiting for the recv buffer to have the capacity.
SendersBlockedTotalTimer:
Total time (summed across all threads) spent waiting for the recv buffer to be
drained so that new batches can be added. Remote plan fragments are blocked
for the same amount of time.
TotalBatchesEnqueued:
The total number of deserialized row batches enqueued into the row batch
queues.
TotalBatchesReceived:
The total number of serialized row batches received.
TotalBytesDequeued:
The number of bytes of deserialized row batches dequeued.
TotalBytesReceived:
The total number of bytes of serialized row batches received.
TotalEarlySenders:
The total number of senders which arrive before the receiver is ready.
TotalEosReceived:
Total number of EOS received.
TotalGetBatchTime:
Total wall-clock time spent in SenderQueue::GetBatch().
TotalHasDeferredRPCsTime:
Total wall-clock time in which the ‘deferred_rpcs_’ queues are not empty.
TotalRPCsDeferred:
Total number of RPCs whose responses are deferred because of early senders or
full row batch queue.
Metrics in Data Sink:

BytesWritten:
The total number of bytes written into files.
CompressTimer:
Time spent compressing data before writing into files.
EncodeTimer:
Time spent converting tuple to the on-disk format.
FilesCreated:
The number of created files.
HdfsWriteTimer:
Time spent writing to HDFS.
KuduApplyTimer:
Time spent applying Kudu operations. In normal circumstances, the Kudu
operation should be negligible because it is asynchronous with
AUTO_FLUSH_BACKGROUND enabled. Significant KuduApplyTimer may indicate
that Kudu cannot buffer and send rows as fast as the sink can write them.
NumRowErrors:
The number of (Kudu) rows with errors.
PartitionsCreated:
The total number of partitions created.
RowsInserted:
The number of inserted rows.
RowsProcessedRate:
The rate at which the sink consumes and processes rows, i.e. writing rows to Kudu
or skipping rows that are known to violate nullability constraints.
TotalNumRows:
The total number of rows processed, i.e. rows written to Kudu and also rows with
errors.
Metrics in TopN Node:

InsertBatchTime:
Timer for time spent in InsertBatch() function (or codegen’d version) .
TuplePoolReclamations:
Number of times tuple pool memory was reclaimed
Metrics in Scan Node:

AverageHdfsReadThreadConcurrency:
The average number of HDFS read threads executing read operations on behalf
of this scan. Higher values show that this scan is using a larger proportion of the
I/O capacity of the system. Lower values show that either this thread is not I/O
bound or that it is getting a small share of the I/O capacity of the system because
of other concurrently executing queries.
AverageScannerThreadConcurrency:
The average number of scanner threads executing between Open() and the time
when the scan completes. Present only for multithreaded scan nodes.
BytesRead:
Total bytes read from disk by this scan node. Provided as a counter as well as a
time series that samples the counter. Only implemented for scan node subclasses
that expose the bytes read, e.g. HDFS and HBase.
BytesReadDataNodeCache:
The total number of bytes read from the data node cache.
BytesReadLocal:
The total number of bytes read locally.
BytesReadRemoteUnexpected:
The total number of bytes read remotely that were expected to be local.
BytesReadShortCircuit:
The total number of bytes read via short circuit read.
CachedFileHandlesHitCount:
The total number of file handle opens where the file handle was present in the
cache.
CachedFileHandlesMissCount:
The total number of file handle opens where the file handle was not in the cache.
CollectionItemsRead:
The total number of nested collection items read by the scan. Only created for
scans (e.g. Parquet) that support nested types.
DecompressionTime:
Time spent decompressing bytes.
DelimiterParseTime:
Time spent parsing the bytes for delimiters in text files.
FooterProcessingTime:
Average and min/max time spent processing the (parquet) footer by each split.
Hdfs Read Thread Concurrency Bucket:

The bucket counting (%) of HDFS read thread concurrency.
MaterializeTupleTime:
Wall clock time spent materializing tuples and evaluating predicates. Usually, it’s
affected by the load on the CPU and the complexity of expressions.
MaxCompressedTextFileLength:
The size of the largest compressed text file to be scanned. This is used to
estimate scanner thread memory usage.
NumColumns:
The number of (parquet) columns that need to be read.
NumDictFilteredRowGroups:
The number of (parquet) row groups skipped due to dictionary filter.
NumDisksAccessed:
The number of distinct disks accessed by HDFS scan. Each local disk is counted as
a disk and each remote disk queue (e.g. HDFS remote reads, S3) is counted as a
distinct disk.
NumRowGroups:
The number of (parquet) row groups that need to be read.
NumScannerThreadsStarted:
The number of scanner threads started for the duration of the scan node. This is
at most the number of scan ranges but should be much less since a single
scanner thread will likely process multiple scan ranges. This is *not* the same as
peak scanner thread concurrency because the number of scanner threads can
fluctuate during the execution of the scan.
NumScannersWithNoReads:
The number of scanners that end up doing no reads because their splits don’t
overlap with the midpoint of any row-group in the (parquet) file.
NumStatsFilteredRowGroups:
The number of row groups that are skipped because of Parquet row group
statistics.
PeakScannerThreadConcurrency:
The peak number of scanner threads executing at any one time. Present only for
multithreaded scan nodes.
PerReadThreadRawHdfsThroughput:
The read throughput in bytes/sec for each HDFS read thread while it is executing
I/O operations on behalf of this scan.
RemoteScanRanges:
The total number of remote scan ranges.
RowBatchBytesEnqueued:
The number of row batches and bytes
enqueued in the scan node’s output queue.
RowBatchesEnqueued:
The number of row batches enqueued into the row batch queue.
RowBatchQueueCapacity:
The capacity in batches of the scan node’s output queue.
RowBatchQueueGetWaitTime:
Wall clock time that the fragment execution thread spent blocked waiting for row
batches to be added to the scan node’s output queue.
RowBatchQueuePeakMemoryUsage:
Peak memory consumption of row batches enqueued in the scan node’s output
queue.
RowBatchQueuePutWaitTime:
Wall clock time that the scanner threads spent blocked waiting for space in the
scan node’s output queue when it is full.
RowsRead:
The number of top-level rows/tuples read from the storage layer, including those
discarded by predicate evaluation. Used for all types of scans.
ScannerIoWaitTime:
The total amount of time scanner threads spent waiting for I/O. This is
comparable to ScannerThreadsTotalWallClockTime in the traditional HDFS scan
nodes and the scan node total time for the MT_DOP > 1 scan nodes. Low values
show that each I/O completed before or around the time that the scanner thread
was ready to process the data. High values show that scanner threads are
spending significant time waiting for I/O instead of processing data. Note that if
CPU load is high, this can include the time that the thread is runnable but not
scheduled.
ScannerThreadsSysTime /
ScannerThreadsUserTime /
ScannerThreadsVoluntaryContextSwitches /
ScannerThreadsInvoluntaryContextSwitches:
These are aggregated counters across all scanner threads of this scan node. They
are taken from getrusage. See RuntimeProfile::ThreadCounters for details.
ScannerThreadsTotalWallClockTime:
Total wall clock time spent in all scanner threads.
ScanRangesComplete:
The number of scan ranges completed. Initialized for scans that have a concept of
“scan range”.
TotalRawHbaseReadTime:
The total wall clock time spent in HBase read calls. For example, if we have 3
threads and each spent 1 sec, this counter will report 3 sec.
TotalRawHdfsOpenFileTime:
The total wall clock time spent by Disk I/O threads in HDFS open operations. For
example, if we have 3 threads and each spent 1 sec, this counter will report 3 sec.
TotalRawHdfsReadTime:
The total wall clock time spent by Disk I/O threads in HDFS read operations. For
example, if we have 3 threads and each spent 1 sec, this counter will report 3 sec.
TotalReadThroughput:
BytesRead divided by the total wall clock time that this scan was executing (from
Open() to Close()). This gives the aggregate rate that data is read from disks. If
this is the only scan executing, ideally this will approach the maximum bandwidth
supported by the disks.
Metrics in Runtime Filter:
BloomFilterBytes:
The total amount of memory allocated to Bloom Filters.
Rows processed:
Total number of rows to which each filter was applied
Rows rejected:
Total number of rows that each filter rejected.
Rows total:
The total number of rows that each filter could have been applied to (if it were
available from row 0).
Metrics in Join Node:

BuildRows:
The number of build (right child) rows.
BuildRowsPartitioned:
The number of build rows that have been partitioned.
BuildRowsPartitionTime:
Time spent partitioning build rows.
BuildTime:
Time to prepare build side.
HashBuckets:
The total number of hash buckets across all partitions.
HashCollisions:
The number of cases where we had to compare buckets with the same hash
value, but the row equality failed.
HashTablesBuildTime:
Time spent building hash tables.
LargestPartitionPercent:
The largest fraction after repartitioning. This is expected to be 1 /
PARTITION_FANOUT. A value much larger indicates skew.
MaxPartitionLevel:
Level of the max partition (i.e. number of repartitioning steps).
NullAwareAntiJoinEvalTime:
Time spent evaluating other_join_conjuncts for NAAJ.
NumHashTableBuildsSkipped:
The number of partitions which had zero probe rows and we therefore didn’t
build the hash table.
NumRepartitions:
The number of partitions that have been repartitioned.
PartitionsCreated:
ProbeRows:
The number of probe (left child) rows.
ProbeRowsPartitioned:
The number of probe rows that have been partitioned.
ProbeTime:
Time to process the probe (left child) batch.
RepartitionTime:
Time spent repartitioning and building hash tables of any resulting partitions that
were not spilled.
SpilledPartitions:
The number of partitions that have been spilled.
Metrics in Sort Node:

InitialRunsCreated / RunsCreated:
The number of initial runs created (Sorter).
InMemorySortTime:
Time spent sorting initial runs in memory.
MergeGetNext:
Time to get and return the next batch of sorted rows from this merger.
MergeGetNextBatch:
Times calls to get the next batch of rows from the input run.
NumRowsPerRun:
Min, max, and avg size of runs in number of tuples (Sorter).
TotalMergesPerformed:
Number of merges of sorted runs.
SortDataSize:
The total size of the initial runs in bytes.
SpilledRuns:
The number of runs that were unpinned and may have spilled to disk, including
initial and intermediate runs.
Metrics in Aggregation Node:
BuildTime:
Time to prepare build side.
GetResultsTime:
Time spent returning the aggregated rows.
HashBuckets:
The total number of hash buckets across all partitions.
HTResizeTime:
Total time spent resizing hash tables.
LargestPartitionPercent:
The largest fraction after repartitioning. This is expected to be 1 /
PARTITION_FANOUT. A value much larger indicates skew.
MaxPartitionLevel:
Level of the max partition (i.e. number of repartitioning steps).
NumRepartitions:
The number of partitions that have been repartitioned.
PartitionsCreated:
ReductionFactorEstimate:
The estimated reduction of the pre-aggregation.
ReductionFactorThresholdToExpand:
Expose the minimum reduction factor to continue growing the hash tables.
RowsPassedThrough:
The number of rows passed through without aggregation.
RowsRepartitioned:
The number of rows that have been repartitioned.
SpilledPartitions:
The number of partitions that have been spilled.
StreamingTime:
Time spent in streaming pre-aggregation algorithm.
Metrics in Exchange Node:

ConvertRowBatchTime:
Time spent reconstructing received rows
Metrics in Eval Node:

EvaluationTime:
Time spent processing the child rows (AnalyticEvalNode).
Metrics in CodeGen:
CodegenTime:
Time spent doing codegen (adding IR to the module).
CompileTime:
Time spent compiling the module.
ExecTreePrepareTime:
Time to prepare sink, set up internal structures in execution node and its
offsprings, start the profile-reporting thread and wait until it’s active.
LoadTime:
Time spent reading the .ir file from the file system.
ModuleBitcodeSize:
The total size of bitcode modules loaded in bytes.
NumFunctions:
The number of functions that are optimized and compiled after pruning unused
functions from the module.
NumInstructions:
The number of instructions that are optimized and compiled after
pruning unused functions from the module.
OptimizationTime:
Time spent optimizing the module.
PrepareTime:
Time spent constructing the in-memory module from the ir.
Metrics in TmpFileMgr:
TmpFileMgr provides an abstraction for management of temporary (a.k.a. scratch)
files on the filesystem and I/O to and from them.
ScratchBytesRead:
The number of bytes read from disk (includes reads started but not yet
completed).
ScratchBytesWritten:
The number of bytes written to disk (includes writes started but not yet
completed).
ScratchFileUsedBytes:
Amount of scratch space allocated in bytes.
ScratchReads:
The number of READ operations (includes reads started but not yet completed).
ScratchWrites:
The number of write operations (includes writes started but not yet completed).
TotalEncryptionTime:
Time spent in disk spill encryption, decryption, and integrity checking.
TotalReadBlockTime:
Time spent waiting for disk reads

Impala Query Tuning

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Impala Query Tuning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Impala Query Tuning

Uploaded by

Copyright:

Available Formats

Part-1 :

Then scroll down a bit to locate “Download Profile” button:

The list of queries will be displayed:

c. Start and End time:

g. DB selected on connection:

h. The query that used to return this PROFILE:

i. The impala daemon that is used to run the query, what we

j. Query Options used for this query:

The Query Plan and Execution Summary looks like below:

For more information on Impala stats, please refer to documentation Table and

a. It normally starts with HDFS Scan:

The query that generated the profile as below:

1. it took average of 2 seconds to scan HDFS data for table flight_delay.flights

3. Moved along, I checked how complex the query was:

4. Next thing I could see was the Coordinator host:

I hope above steps I took to troubleshoot this particular Impala Query

Max Per-Host Resource Reservation:

Per-Host Resource Estimates:

Query Compilation / Planner Timeline:

Metrics in Fragment Instance:

Fragment Instance Lifecycle Event Timeline

 network time of sending the RPC payload to the destination

Node Lifecycle Event Timeline

Metrics in Buffer Pool:

Metrics in Data Sender:

Metrics in Data Receiver:

Metrics in Data Sink:

Metrics in TopN Node:

Metrics in Scan Node:

Hdfs Read Thread Concurrency Bucket:

Metrics in Join Node:

Metrics in Sort Node:

Metrics in Exchange Node:

Metrics in Eval Node:

You might also like