Impala Query Tuning
Impala Query Tuning
Impala Query Tuning
If you work with Impala, but have no idea how to interpret the Impala query
PROFILEs, it would be very hard to understand what’s going on and how to make
your query run at its full potential. I think this is the case for lots of Impala users,
so I would like to write a simple blog post to share my experience and hope that
it can help with anyone who like to learn more.
This is the Part 1 of the series, so I will go with the basics and just cover the main
things to look out for when examining the PROFILE.
So first thing first, how do you collect Impala query PROFILE? Well, there are a
couple of ways. The simplest way is to just run “PROFILE” after your query in
impala-shell, like below:
1. [impala-daemon-host.com:21000] > SELECT COUNT(*) FROM sample_07;
2. Query: SELECT COUNT(*) FROM sample_07
3. Query submitted at: 2018-09-14 15:57:35 (Coordinator: https://impala-daemon-
host.com:25000)
4. dQuery progress can be monitored at: https://impala-daemon-host.com:25000/query_plan?
query_id=36433472787e1cab:29c30e7800000000
5. +----------+
6. | count(*) |
7. +----------+
8. | 823 |
9. +----------+
10. Fetched 1 row(s) in 6.68s
11.
12. [impala-daemon-host.com:21000] > PROFILE; <-- Simply run "PROFILE" as a query
13. Query Runtime Profile:
14. Query (id=36433472787e1cab:29c30e7800000000):
15. Summary:
16. Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
17. Session Type: BEESWAX
18. .....
You can also collect from Cloudera Manager Web UI, by navigating to CM >
Impala > Queries, locate the query you just ran and click on “Query Details”
https://{impala-daemon-url}:25000/queries
All right, so we have the PROFILE now, let’s dive into the details.
Below is the snippet of Query PROFILE we will go through today, which is the
Summary section at the top of the PROFILE:
1. Query (id=36433472787e1cab:29c30e7800000000):
2. Summary:
3. Session ID: 443110cc7292c92:6e3ff4d76f0c5aaf
4. Session Type: BEESWAX
5. Start Time: 2018-09-14 15:57:35.883111000
6. End Time: 2018-09-14 15:57:42.565042000
7. Query Type: QUERY
8. Query State: FINISHED
9. Query Status: OK
10. Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build
50eddf4550faa6200f51e98413de785bf1bf0de1)
11. User: [email protected]
12. Connected User: [email protected]
13. Delegated User:
14. Network Address: ::ffff:xxx.xx.xxx.xx:58834
15. Default Db: default
16. Sql Statement: SELECT COUNT(*) FROM sample_07
17. Coordinator: impala-daemon-url.com:22000
18. Query Options (set by configuration):
19. Query Options (set by configuration and planner): MT_DOP=0
20. Plan:
21. ----------------
Let’s break it into sections and walk through one by one. There are a few
important information here that used more often:
a. Query ID:
1. Query (id=36433472787e1cab:29c30e7800000000):
This is useful to identify relevant Query related information from Impala Daemon
logs. Simply search this query ID and you can find out what it was doing behind
the scene, especially useful for finding out related error messages.
b. Session Type:
1. Session Type: BEESWAX
This can tell us where the connection is from. BEESWAX means that the query ran
from impala-shell client. If you run from Hue, the type will be “HIVESERVER2”
since Hue connects via HiveServer2 thrift.
d. Query status:
1. Query Status: OK
This tells if the query finished successfully or not. OK means good. If there are
errors, normally will show here, for example, cancelled by user, session timeout,
Exceptions etc.
e. Impala version:
1. Impala Version: impalad version 2.11.0-cdh5.14.x RELEASE (build
50eddf4550faa6200f51e98413de785bf1bf0de1)
This confirms the version that is used to run the query, if you see this is not
matching with your installation, then something is not setup properly.
f. User information:
1. User: [email protected]
2. Connected User: [email protected]
3. Delegated User:
You can find out who ran the query from this session, so you know who to
blame :).
This concludes the part 1 of the series to explain the Summary section of the
query to understand the basic information. In the next part of the series, I will
explain in detail on Query Plan as well as the Execution Summary of the PROFILE.
Part -2 :
The next line is very important however, as Impala tells us if it has detected that
the tables involved in the query have up-to-date information about their stats or
not. This is very crucial because Impala uses table/column statistics information
to do resource estimation as well as perform query plan to determine the best
strategy to run the query. If the stats are not up-to-date, Impala will end up with
bad query plan, hence will affect the overall query performance.
In my example, we can see that the table default.sample_07’s stats are missing.
Impala produced the warning so that users are informed about this and
COMPUTE STATS should be performed on the table to fix this.
This can get very complicated if your query is complex, but let’s start with this
simple query to understand the basis. One thing to remember is that you need to
read the Query Plan backwards, which will allow you to follow what Impala
planned to do.
there was only one partition in the table, and Impala also read one
partition. This does not necessarily mean that this table is partitioned. If the
table is not partitioned, it will just be shown as 1/1.
there was only one file under the table/partition (files=1)
the total size read by Impala was 44.98KB
there were no stats available for this table (stats-rows=unavailable, table
stats: rows=unavailable and cardinality=unavailable)
estimated memory to be 32MB to run the query and no memory were
reserved
b. After HDFS scan was complete, Impala needed to do Aggregation, as we did
COUNT(*):
1. 01:AGGREGATE
2. | output: count(*)
3. | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
4. | tuple-ids=1 row-size=8B cardinality=1
There isn’t much to explain here, but just to know that this operation does the
Aggregation step.
c. Fragment information:
1. F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
2. Per-Host Resources: mem-estimate=42.00MB mem-reservation=0B
This bit of information just above the 00:SCAN HDFS and 01:AGGREGATE
operators tells us that both Scan and Aggregation Operator belongs to Fragment
F00, which ran on 1 host and 1 instance. This Fragment ID of F00 can be used to
find the actual Fragment statistic in the later part of PROFILE, which can tell us
more detailed information about how this Fragment runs at run time. I will also
cover this in the later part of the series.
d. Exchange Operation:
1. 02:EXCHANGE [UNPARTITIONED]
2. | mem-estimate=0B mem-reservation=0B
3. | tuple-ids=1 row-size=8B cardinality=1
So after aggregation was done on each worker node, the results needed to be
exchanged from each worker node to the coordinator, that was what happened
here. After that, the coordinator needed to do the final aggregation/merger on
the those results:
1. 03:AGGREGATE [FINALIZE]
2. | output: count:merge(*)
3. | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
4. | tuple-ids=1 row-size=8B cardinality=1
And both of above two operations belonged to the same Fragment 01, which
again can be used to reference the rest of Profile data to find out more detailed
stats about the query:
1. F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
Now, let’s have a look at the Summary Section of the Profile:
1. Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
2. -----------------------------------------------------------------------------------------------------------
3. 03:AGGREGATE 1 999.992us 999.992us 1 1 20.00 KB 10.00 MB FINALIZE
4. 02:EXCHANGE 1 831.992ms 831.992ms 1 1 0 0 UNPARTITIONED
5. 01:AGGREGATE 1 0.000ns 0.000ns 1 1 16.00 KB 10.00 MB
6. 00:SCAN HDFS 1 709.995ms 709.995ms 823 -1 80.00 KB 32.00 MB default.sample_07
Here you can find below information that could be useful:
It tells the Average time and Maximum time each operation took. If there is
big difference between the two, you would know that there was in-
balance/skew when running jobs in each worker node, as in theory, they
should be processing similar amount of data and we should expect all of the
nodes to finish in similar time range
If the values for “#Row” and “Est. #Rows” are way off, in my case -1 for Est.
#Rows for SCAN HDFS operation and 823 for #Row (the actual number of
rows returned after running the query), we know that Impala has out of date
information about the table statistics. In my case, we did not have table stats,
so Impala reported “-1” estimated value. If the estimated value is positive,
but is still different from actual rows returned, then we know that we need to
run “COMPUTE STATS” against this table to update the statistics.
the “#Hosts” column tells us know many worker nodes participated in the
query for that particular operation. In my case, since the data was small, we
only had 1 host to run the query.
The “Peak Mem” and “Est. Peak Mem” are self-explanatory, they are the
actual memory used vs. the estimated memory that Impala calculated based
on table stats.
If there are joins in queries, this section will also show us what join strategies
were used in the join operation, either Broadcast or Shuffle Join. I will try to cover
this as well in the later part of the series.
That’s all for this part II of the series, and hope that they are useful. I will try to
get more complicated query Profiles to share next time and work through to
understand more.
Part -3 :
In this third part of this blog series, I will be still focusing on the Query Plan as
well as the Execution Summary, but using a more complicated query that is
against real life data that is downloaded from Kaggle’s Flights Delay database.
This database has 3 tables:
flights.csv
airlines.csv
airports.csv
Now, let’s jump further down the PROFILE and have a look at the Planner and
Query Timeline:
1. Planner Timeline
2. Analysis finished: 3ms (3389346)
3. Equivalence classes computed: 3ms (3600838)
4. Single node plan created: 4ms (4625920)
5. Runtime filters computed: 4ms (4734686)
6. Distributed plan created: 5ms (5120630)
7. Lineage info computed: 13ms (13666462)
8. Planning finished: 15ms (15712999)
9. Query Timeline
10. Query submitted: 0ns (0)
11. Planning finished: 16ms (16999947)
12. Submit for admission: 17ms (17999944)
13. Completed admission: 17ms (17999944)
14. Ready to start on 4 backends: 18ms (18999941)
15. All 4 execution backends (10 fragment instances) started: 28ms (28999909)
16. Rows available: 4.28s (4280986646)
17. First row fetched: 4.31s (4308986559)
Each line is pretty much self-explanatory, we can see that the query plan
took 15ms seconds to run, submitted for admission from 17ms, ready to
execute plan on worker nodes from 28ms and then finally rows were ready
at 4.28 seconds and first row was fetched by client at 4.31 seconds. This
gives you a very clear overview of how long each stage took. If any of the stages
is slow, it will be very obvious and then we can start to drill down further to see
what might have happened.
Since my query was fast, so it is not very interesting to see here. Let’s have a look
at another real production impala query profile:
1. Query Compilation: 16.268ms
2. - Metadata of all 1 tables cached: 1.786ms (1.786ms)
3. - Analysis finished: 6.162ms (4.376ms)
4. - Value transfer graph computed: 6.537ms (374.918us)
5. - Single node plan created: 7.955ms (1.417ms)
6. - Runtime filters computed: 8.274ms (318.815us)
7. - Distributed plan created: 8.430ms (156.307us)
8. - Lineage info computed: 9.664ms (1.234ms)
9. - Planning finished: 16.268ms (6.603ms)
10. Query Timeline: 35m46s
11. - Query submitted: 0.000ns (0.000ns)
12. - Planning finished: 22.001ms (22.001ms)
13. - Submit for admission: 23.001ms (1.000ms)
14. - Completed admission: 23.001ms (0.000ns)
15. - Ready to start on 2 backends: 24.001ms (1.000ms)
16. - All 2 execution backends (2 fragment instances) started: 36.001ms (12.000ms)
17. - Rows available: 5m51s (5m51s)
18. - First row fetched: 5m52s (950.045ms)
19. - Last row fetched: 35m46s (29m53s)
20. - Released admission control resources: 35m46s (1.000ms)
21. - Unregister query: 35m46s (30.001ms)
22. - ComputeScanRangeAssignmentTimer: 0.000ns
This was taken from a case that Impala query took a long time to run and
customer wanted to find out why. From the Query Timeline, we can clearly see
that it took almost 6 minutes (5m51s) from starting execution (All 2 execution
backends) until data was available (Rows available). This 6 minutes execution
could be normal, as if there were lots of joins with large dataset, it is common to
have query run for several minutes.
However, we can notice that it took Impala 30 minutes to pass the data back to
client as First row fetched at 6 minutes mark, but Last row fetch
only at 36 minutes mark. So from here, we could suspect that there could
be some networking issue between Impala coordinator and client (as data fetch
happens from Client, like impala-shell or Hue, to Impala Coordinator host).
Another possibility is that client might be capturing the results and performing
other actions like printing on the screen, as the return data might be big, that
operation can be time consuming.
So this section of the PROFILE can lead us to the right direction of where to look
for to find out the bottleneck.
This concludes the Part 3 of the Impala Profile series. I will explain more
regarding how to relate the operation number that shown in the Query Plan
section to the bottom of the Profile section where it shows the detailed metric of
each operation, either average or individually on each host.
Part-4 :
OK, let’s get started. Since the Profile itself is quite large, as it involved several
Impala Daemons to run, it will be ugly on the page if I include the full content
here. I will go through section by section and explain on what information I was
looking for when troubleshooting the issue.
The problem with the query was that for whatever reason, the same query used
to be able to finish under a few minutes, but now it took more than 1 hour to
finish. This profile was just one example, in fact, ALL queries running through
this cluster had the exact same issue at the time. So please spend sometime to
go through this Profile and see if you are able to capture any useful information
and understand the situation here.
Now, let me go through in more detail on the steps I used to troubleshoot this
particular issue.
1. Since user complained query took longer than normal, the first thing I wanted
to check was how long? So very obvious I looked for the Start and End time at
the beginning of the Profile:
1. Start Time: 2020-01-03 07:33:42.928171000
2. End Time: 2020-01-03 08:47:55.745537000
I noticed that it took 1 hour and 14 minutes to finish the query, which matched
what user complained.
2. I noticed that the query failed with EXCEPTION, due to user cancellation:
1. Query State: EXCEPTION
2. Query Status: Cancelled
So it was likely that user was not patience anymore and had to cancel the query
as it took too long. Nothing to worry about here.
Read in the reverse order, from bottom to top, as it is the order that Impala
does the operation
Compare “Avg Time” and “Max Time” columns
Compare “#Rows” and “Est. #Rows” columns
Check “Detail” column to see what type of JOINs were for each operation
Immediately, I noticed there was a big difference between “Avg Time” and
“Max Time” for SCAN HDFS operator. Average time took 3 minutes and 7
seconds, but Maximum time from one of the hosts, out of 29 hosts, took 56
minutes and 13 seconds. Kept reading, I also noticed the exact same issue
for the second SCAN HDFS operation, 44 seconds vs 13 minutes and 18
seconds.
So my next thought was to identify which host/hosts performed much slower
than others and whether it was from the same host. To do so, I used string
“id=0” to search through the Profile. “0” is the operator number from the
beginning against each line in the Summary section “00:SCAN HDFS”. This
“id=[\d]+” string will be attached to each operator in the detailed breakdown
section down below in the Profile. Remember to remove any leading 0s.
I searched for the first instance of “id=0” from the beginning of the file, and
reached below section:
1. HDFS_SCAN_NODE (id=0)
2. ....
3. - ScannerThreadsTotalWallClockTime: 20.0m (1200982115995)
4. - MaterializeTupleTime(*): 226ms (226568242)
5. - ScannerThreadsSysTime: 322ms (322168172)
6. - ScannerThreadsUserTime: 6.76s (6758158482)
7. - ScannerThreadsVoluntaryContextSwitches: 10,907 (10907)
8. - TotalRawHdfsOpenFileTime(*): 8.6m (517759170560)
9. - TotalRawHdfsReadTime(*): 3.4m (201957505069)
10. - TotalReadThroughput: 749.9 KiB/s (767874)
11. - TotalTime: 3.1m (187289950304)
I noticed TotalTime was 3.1 minutes, which matched the 3 minutes and 7
seconds that I saw in the Summary section, so this was the Average Fragment.
To confirm, I scrolled back and saw below:
1. Averaged Fragment F00
Continue searching the file, I came to below section (second instance of “id=0”):
1. HDFS_SCAN_NODE (id=0)
2. ....
3. - ScannerThreadsTotalWallClockTime: 10.4m (626435081910)
4. - MaterializeTupleTime(*): 278ms (278689886)
5. - ScannerThreadsSysTime: 266ms (266094000)
6. - ScannerThreadsUserTime: 5.75s (5748833000)
7. - ScannerThreadsVoluntaryContextSwitches: 11,285 (11285)
8. - TotalRawHdfsOpenFileTime(*): 7.8m (468388283839)
9. - TotalRawHdfsReadTime(*): 1.9m (114598713147)
10. - TotalReadThroughput: 731.0 KiB/s (748535)
11. - TotalTime: 2.1m (125005670562)
This one told me it took 2.1 minutes, which was faster than average of 3.1
minutes, and scrolling back to confirm the host:
1. Fragment F00
2. Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-
cdn002.xxx.XXXXXX.com:22000)
Now, I could see that there were three things that I was looking for:
1. Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-
cdn002.xxx.XXXXXX.com:22000)
2. HDFS_SCAN_NODE (id=0)
3. - TotalTime: 2.1m (125005670562)
I thought it would be easier if I could use simple “grep” to filter out everything.
Since the Profile was nicely indented, I used below egrep command to get what I
was after:
1. egrep ' Instance .*\)|^ HDFS_SCAN_NODE \(id=0\)|^ - TotalTime: ' profile-example.txt
It yielded result below:
1. ...
2. Instance 94481a81355e51e4:51fd9f9500000053 (host=xxxxx-xxx-cdh-
cdn002.xxx.XXXXXX.com:22000)
3. ...
4. HDFS_SCAN_NODE (id=0)
5. - TotalTime: 2.1m (125005670562)
6. Instance 94481a81355e51e4:51fd9f9500000057 (host=xxxxx-xxx-cdh-
cdn003.xxx.XXXXXX.com:22000)
7. ...
8. HDFS_SCAN_NODE (id=0)
9. - TotalTime: 1.9m (114395426955)
10. Instance 94481a81355e51e4:51fd9f9500000058 (host=xxxxx-xxx-cdh-
cdn020.xxx.XXXXXX.com:22000)
11. ...
12. HDFS_SCAN_NODE (id=0)
13. - TotalTime: 1.5m (92671503850)
14. Instance 94481a81355e51e4:51fd9f950000003d (host=xxxxx-xxx-cdh-
cdn012.xxx.XXXXXX.com:22000)
15. ...
16. HDFS_SCAN_NODE (id=0)
17. - TotalTime: 1.4m (86459970122)
18. Instance 94481a81355e51e4:51fd9f950000004b (host=xxxxx-xxx-cdh-
cdn014.xxx.XXXXXX.com:22000)
19. ...
20. HDFS_SCAN_NODE (id=0)
21. - TotalTime: 1.4m (82187347776)
22. Instance 94481a81355e51e4:51fd9f9500000050 (host=xxxxx-xxx-cdh-
cdn006.xxx.XXXXXX.com:22000)
23. ...
24. HDFS_SCAN_NODE (id=0)
25. - TotalTime: 1.4m (82146306944)
26. Instance 94481a81355e51e4:51fd9f950000004f (host=xxxxx-xxx-cdh-
cdn024.xxx.XXXXXX.com:22000)
27. ...
28. HDFS_SCAN_NODE (id=0)
29. - TotalTime: 1.3m (80468400288)
30. Instance 94481a81355e51e4:51fd9f950000004d (host=xxxxx-xxx-cdh-
cdn022.xxx.XXXXXX.com:22000)
31. ...
32. HDFS_SCAN_NODE (id=0)
33. - TotalTime: 1.3m (79714897965)
34. Instance 94481a81355e51e4:51fd9f9500000043 (host=xxxxx-xxx-cdh-
cdn017.xxx.XXXXXX.com:22000)
35. ...
36. HDFS_SCAN_NODE (id=0)
37. - TotalTime: 1.3m (78877950983)
38. Instance 94481a81355e51e4:51fd9f9500000052 (host=xxxxx-xxx-cdh-
cdn001.xxx.XXXXXX.com:22000)
39. ...
40. HDFS_SCAN_NODE (id=0)
41. - TotalTime: 1.3m (77593734314)
42. Instance 94481a81355e51e4:51fd9f950000003c (host=xxxxx-xxx-cdh-
cdn019.xxx.XXXXXX.com:22000)
43. ...
44. HDFS_SCAN_NODE (id=0)
45. - TotalTime: 1.3m (76164245478)
46. Instance 94481a81355e51e4:51fd9f9500000045 (host=xxxxx-xxx-cdh-
cdn007.xxx.XXXXXX.com:22000)
47. ...
48. HDFS_SCAN_NODE (id=0)
49. - TotalTime: 1.3m (75588331159)
50. Instance 94481a81355e51e4:51fd9f9500000044 (host=xxxxx-xxx-cdh-
cdn010.xxx.XXXXXX.com:22000)
51. ...
52. HDFS_SCAN_NODE (id=0)
53. - TotalTime: 1.2m (73596530464)
54. Instance 94481a81355e51e4:51fd9f9500000042 (host=xxxxx-xxx-cdh-
cdn018.xxx.XXXXXX.com:22000)
55. ...
56. HDFS_SCAN_NODE (id=0)
57. - TotalTime: 1.2m (72946574082)
58. Instance 94481a81355e51e4:51fd9f9500000055 (host=xxxxx-xxx-cdh-
cdn026.xxx.XXXXXX.com:22000)
59. ...
60. HDFS_SCAN_NODE (id=0)
61. - TotalTime: 1.2m (69918383242)
62. Instance 94481a81355e51e4:51fd9f9500000054 (host=xxxxx-xxx-cdh-
cdn011.xxx.XXXXXX.com:22000)
63. ...
64. HDFS_SCAN_NODE (id=0)
65. - TotalTime: 1.2m (69355611992)
66. Instance 94481a81355e51e4:51fd9f9500000051 (host=xxxxx-xxx-cdh-
cdn009.xxx.XXXXXX.com:22000)
67. ...
68. HDFS_SCAN_NODE (id=0)
69. - TotalTime: 1.1m (68527129814)
70. Instance 94481a81355e51e4:51fd9f9500000048 (host=xxxxx-xxx-cdh-
cdn016.xxx.XXXXXX.com:22000)
71. ...
72. HDFS_SCAN_NODE (id=0)
73. - TotalTime: 1.1m (67249633571)
74. Instance 94481a81355e51e4:51fd9f9500000047 (host=xxxxx-xxx-cdh-
cdn013.xxx.XXXXXX.com:22000)
75. ...
76. HDFS_SCAN_NODE (id=0)
77. - TotalTime: 1.1m (63989781076)
78. Instance 94481a81355e51e4:51fd9f9500000041 (host=xxxxx-xxx-cdh-
cdn028.xxx.XXXXXX.com:22000)
79. ...
80. HDFS_SCAN_NODE (id=0)
81. - TotalTime: 1.0m (62739870946)
82. Instance 94481a81355e51e4:51fd9f950000003f (host=xxxxx-xxx-cdh-
cdn025.xxx.XXXXXX.com:22000)
83. ...
84. HDFS_SCAN_NODE (id=0)
85. - TotalTime: 1.0m (62136511127)
86. Instance 94481a81355e51e4:51fd9f950000004c (host=xxxxx-xxx-cdh-
cdn005.xxx.XXXXXX.com:22000)
87. ...
88. HDFS_SCAN_NODE (id=0)
89. - TotalTime: 1.0m (61943905274)
90. Instance 94481a81355e51e4:51fd9f9500000046 (host=xxxxx-xxx-cdh-
cdn027.xxx.XXXXXX.com:22000)
91. ...
92. HDFS_SCAN_NODE (id=0)
93. - TotalTime: 1.0m (61955797776)
94. Instance 94481a81355e51e4:51fd9f950000004e (host=xxxxx-xxx-cdh-
cdn021.xxx.XXXXXX.com:22000)
95. ...
96. HDFS_SCAN_NODE (id=0)
97. - TotalTime: 1.0m (60045780252)
98. Instance 94481a81355e51e4:51fd9f9500000040 (host=xxxxx-xxx-cdh-
cdn029.xxx.XXXXXX.com:22000)
99. ...
100. HDFS_SCAN_NODE (id=0)
101. - TotalTime: 58.05s (58048904552)
102. Instance 94481a81355e51e4:51fd9f950000004a (host=xxxxx-xxx-cdh-
cdn023.xxx.XXXXXX.com:22000)
103. ...
104. HDFS_SCAN_NODE (id=0)
105. - TotalTime: 57.34s (57338024825)
106. Instance 94481a81355e51e4:51fd9f9500000049 (host=xxxxx-xxx-cdh-
cdn008.xxx.XXXXXX.com:22000)
107. ...
108. HDFS_SCAN_NODE (id=0)
109. - TotalTime: 53.13s (53130104765)
110. Instance 94481a81355e51e4:51fd9f9500000056 (host=xxxxx-xxx-cdh-
cdn004.xxx.XXXXXX.com:22000)
111. ...
112. HDFS_SCAN_NODE (id=0)
113. - TotalTime: 43.24s (43238668974)
114. Instance 94481a81355e51e4:51fd9f950000003e (host=xxxxx-xxx-cdh-
cdn015.xxx.XXXXXX.com:22000)
115. ...
116. HDFS_SCAN_NODE (id=0)
117. - TotalTime: 56.2m (3373973559713)
I have omitted other irrelevant information and only left the ones that I was
interested. Now I could see clearly which host was the bottleneck. It was
host xxxxx-xxx-cdh-cdn015.xxx.XXXXXX.com, which took 56.2
minutes, while ALL other hosts took around 40 seconds to 2 minutes.
Now, I remember another HDFS SCAN had the same symptom, which was
operator 1 (01:SCAN HDFS), so I did the same “egrep” command (remember that
the indentation for different operators might be different, so I needed to search
those again in the Profile first and copy paste exactly the amount of white spaces
before them to get the result I wanted):
1. egrep ' Instance .*\)|^ HDFS_SCAN_NODE \(id=1\)|^ - TotalTime: ' profile-example.txt
And again result confirmed the same:
1. ....
2. Instance 94481a81355e51e4:51fd9f950000000c (host=xxxxx-xxx-cdh-
cdn015.xxx.XXXXXX.com:22000)
3. ...
4. HDFS_SCAN_NODE (id=1)
5. - TotalTime: 13.3m (798747290751)
6. ...
7. Instance 94481a81355e51e4:51fd9f9500000007 (host=xxxxx-xxx-cdh-
cdn001.xxx.XXXXXX.com:22000)
8. ...
9. HDFS_SCAN_NODE (id=1)
10. - TotalTime: 28.16s (28163113611)
11. Instance 94481a81355e51e4:51fd9f9500000018 (host=xxxxx-xxx-cdh-
cdn009.xxx.XXXXXX.com:22000)
12. ...
13. HDFS_SCAN_NODE (id=1)
14. - TotalTime: 23.29s (23285966387)
15. ...
It was clear that, again the same host xxxxx-xxx-cdh-
cdn015.xxx.XXXXXX.com had the exact same problem, that it ran much
slower than other hosts, 13.3 minutes vs 28.16 seconds.
I then came to conclusion that something happened on the host and needed to
be fixed.
To confirm my theory as the result of the above investigation, I asked the user to
shutdown Impala Daemon on this host and test their query again, and they
confirmed back that issue was resolved. And later on they updated me and said
that they had identified hardware issues on that host and it had been
decommissioned for maintenance.
Summary
Admission result:
Here we can see if the query was admitted immediately, queued or rejected. If
the query was queued or rejected, we can see the reason.
Start Time:
The time when the query was submitted.
End Time:
The time when the query was unregistered.
ExecSummary:
The summary of each operator(exec node), including time, number of rows and
memory usage. It’s invisible until the query is closed.
Query Timeline:
Timeline of important events in the execution process (BE/C++). The timelines
started from the time when this query was registered.
ImpalaServer
ClientFetchWaitTimer:
Time spent by the coordinator while idle waiting for a client to fetch rows.
MetastoreUpdateTimer:
Time spent to gather and publish all required updates to the metastore.
RowMaterializationTimer:
Time spent by the coordinator to fetch rows.
Execution Profile
Metrics in Coordinator
CatalogOpExecTimer
Time spent by the coordinator to send catalog operation execution request and
wait for the response from the catalogd.
ComputeScanRangeAssignmentTimer:
Compute the assignment of scan ranges to hosts for each scan node.
FiltersReceived:
The total number of filter updates received (always 0 if filter mode is not
GLOBAL). Excludes repeated broadcast filter updates.
FinalizationTimer:
Total time spent in finalization (typically 0 except for INSERT into HDFS tables).
ExecTreeExecTime:
Time spent by execution node and its offsprings in this fragment to retrieve rows
and return them via row_batch.
ExecTreeOpenTime:
Time spent by execution node and its offsprings in this fragment to perform any
preparatory work prior to retrieving rows.
Filter X arrival:
The amount of time waited since registration for the filter to arrive. 0 means that
filter has not yet arrived.
OpenTime:
Time spent in fragment Open() logic. It includes ExecTreeOpenTime, the time to
generate LLVM code and the time to open sink in this fragment.
PerHostPeakMemUsage:
A counter for the per query, per host peak mem usage. Note that this is not the
max of the peak memory of all fragments running on a host since it needs to take
into account when they are running concurrently. All fragments for a single query
on a single host will have the same value for this counter.
PrepareTime:
Time to prepare for fragment execution.
RowsProduced:
The number of rows returned by this fragment instance.
TotalNetworkReceiveTime:
Total time spent receiving over the network (across all threads).
TotalNetworkSendTime:
Total time spent waiting for RPCs to complete. This time is a combination of:
TotalStorageWaitTime:
Total time waiting in storage (across all threads).
TotalThreads:
Total CPU utilization for all threads in this plan fragment.
Common metrics:
InactiveTotalTime:
Total time spent waiting (on non-children) that should not be counted when
computing local_time_percent_. This is updated for example in the exchange
node when waiting on the sender from another fragment.
LocalTime:
Time spent in this node (not including the children). Computed in
ComputeTimeInProfile().
TotalTime:
The total elapsed time.
CumulativeAllocationBytes:
Bytes of buffers allocated via BufferAllocator::AllocateBuffer().
CumulativeAllocations:
The number of buffers allocated via BufferAllocator::AllocateBuffer().
PeakReservation:
The tracker’s peak reservation in bytes.
PeakUnpinnedBytes:
The peak total size of unpinned pages.
PeakUsedReservation:
The tracker’s peak usage in bytes.
ReadIoBytes:
Total bytes read from disk.
ReadIoOps:
The total number of read I/O operations issued.
ReadIoWaitTime:
Amount of time spent waiting for reads from disk to complete.
ReservationLimit:
The hard limit on the tracker’s reservations.
WriteIoBytes:
Total bytes written to disk.
WriteIoOps:
The total number of write I/O operations issued.
WriteIoWaitTime:
Amount of time spent waiting for writes to disk to complete.
EosSent:
Total number of EOS sent.
NetworkThroughput:
Summary of network throughput for sending row batches. Network time also
includes queuing time in KRPC transfer queue for transmitting the RPC requests
and receiving the responses.
OverallThroughput:
Throughput per total time spent in the sender.
RowsReturned:
The number of row batches enqueued into the row batch queue.
RowsSent:
The total number of rows sent.
RpcFailure:
The total number of times RPC fails or the remote responds with a non-retryable
error.
RpcRetry:
Number of TransmitData() RPC retries due to remote service being busy.
SerializeBatchTime:
Time for serializing row batches.
TotalBytesSent:
The total number of bytes sent. Updated on RPC completion.
TransmitDataRPCTime:
The concurrent wall time spent sending data over the network.
UncompressedRowBatchSize:
The total number of bytes of row batches before compression.
BytesReceived:
The total number of bytes of serialized row batches received.
BytesSkipped:
The number of bytes skipped when advancing to next sync on error.
DataWaitTime:
Total wall-clock time spent waiting for data to be available in queues.
DeferredQueueSize:
Time series of the number of deferred row batches, samples
‘num_deferred_rpcs_’.
DeserializeRowBatchTimer:
Total wall-clock time spent deserializing row batches.
DispatchTime:
Summary stats of time which RPCs spent in KRPC service queue before being
dispatched to the RPC handlers.
FirstBatchArrivalWaitTime:
Time spent waiting until the first batch arrives across all queues.
FirstBatchWaitTime:
Wall-clock time spent waiting for the first batch arrival across all queues.
SendersBlockedTimer:
Wall time senders spend waiting for the recv buffer to have the capacity.
SendersBlockedTotalTimer:
Total time (summed across all threads) spent waiting for the recv buffer to be
drained so that new batches can be added. Remote plan fragments are blocked
for the same amount of time.
TotalBatchesEnqueued:
The total number of deserialized row batches enqueued into the row batch
queues.
TotalBatchesReceived:
The total number of serialized row batches received.
TotalBytesDequeued:
The number of bytes of deserialized row batches dequeued.
TotalBytesReceived:
The total number of bytes of serialized row batches received.
TotalEarlySenders:
The total number of senders which arrive before the receiver is ready.
TotalEosReceived:
Total number of EOS received.
TotalGetBatchTime:
Total wall-clock time spent in SenderQueue::GetBatch().
TotalHasDeferredRPCsTime:
Total wall-clock time in which the ‘deferred_rpcs_’ queues are not empty.
TotalRPCsDeferred:
Total number of RPCs whose responses are deferred because of early senders or
full row batch queue.
CompressTimer:
Time spent compressing data before writing into files.
EncodeTimer:
Time spent converting tuple to the on-disk format.
FilesCreated:
The number of created files.
HdfsWriteTimer:
Time spent writing to HDFS.
KuduApplyTimer:
Time spent applying Kudu operations. In normal circumstances, the Kudu
operation should be negligible because it is asynchronous with
AUTO_FLUSH_BACKGROUND enabled. Significant KuduApplyTimer may indicate
that Kudu cannot buffer and send rows as fast as the sink can write them.
NumRowErrors:
The number of (Kudu) rows with errors.
PartitionsCreated:
The total number of partitions created.
RowsInserted:
The number of inserted rows.
RowsProcessedRate:
The rate at which the sink consumes and processes rows, i.e. writing rows to Kudu
or skipping rows that are known to violate nullability constraints.
TotalNumRows:
The total number of rows processed, i.e. rows written to Kudu and also rows with
errors.
AverageScannerThreadConcurrency:
The average number of scanner threads executing between Open() and the time
when the scan completes. Present only for multithreaded scan nodes.
BytesRead:
Total bytes read from disk by this scan node. Provided as a counter as well as a
time series that samples the counter. Only implemented for scan node subclasses
that expose the bytes read, e.g. HDFS and HBase.
BytesReadDataNodeCache:
The total number of bytes read from the data node cache.
BytesReadLocal:
The total number of bytes read locally.
BytesReadRemoteUnexpected:
The total number of bytes read remotely that were expected to be local.
BytesReadShortCircuit:
The total number of bytes read via short circuit read.
CachedFileHandlesHitCount:
The total number of file handle opens where the file handle was present in the
cache.
CachedFileHandlesMissCount:
The total number of file handle opens where the file handle was not in the cache.
CollectionItemsRead:
The total number of nested collection items read by the scan. Only created for
scans (e.g. Parquet) that support nested types.
DecompressionTime:
Time spent decompressing bytes.
DelimiterParseTime:
Time spent parsing the bytes for delimiters in text files.
FooterProcessingTime:
Average and min/max time spent processing the (parquet) footer by each split.
MaterializeTupleTime:
Wall clock time spent materializing tuples and evaluating predicates. Usually, it’s
affected by the load on the CPU and the complexity of expressions.
MaxCompressedTextFileLength:
The size of the largest compressed text file to be scanned. This is used to
estimate scanner thread memory usage.
NumColumns:
The number of (parquet) columns that need to be read.
NumDictFilteredRowGroups:
The number of (parquet) row groups skipped due to dictionary filter.
NumDisksAccessed:
The number of distinct disks accessed by HDFS scan. Each local disk is counted as
a disk and each remote disk queue (e.g. HDFS remote reads, S3) is counted as a
distinct disk.
NumRowGroups:
The number of (parquet) row groups that need to be read.
NumScannerThreadsStarted:
The number of scanner threads started for the duration of the scan node. This is
at most the number of scan ranges but should be much less since a single
scanner thread will likely process multiple scan ranges. This is *not* the same as
peak scanner thread concurrency because the number of scanner threads can
fluctuate during the execution of the scan.
NumScannersWithNoReads:
The number of scanners that end up doing no reads because their splits don’t
overlap with the midpoint of any row-group in the (parquet) file.
NumStatsFilteredRowGroups:
The number of row groups that are skipped because of Parquet row group
statistics.
PeakScannerThreadConcurrency:
The peak number of scanner threads executing at any one time. Present only for
multithreaded scan nodes.
PerReadThreadRawHdfsThroughput:
The read throughput in bytes/sec for each HDFS read thread while it is executing
I/O operations on behalf of this scan.
RemoteScanRanges:
The total number of remote scan ranges.
RowBatchBytesEnqueued:
The number of row batches and bytes
enqueued in the scan node’s output queue.
RowBatchesEnqueued:
The number of row batches enqueued into the row batch queue.
RowBatchQueueCapacity:
The capacity in batches of the scan node’s output queue.
RowBatchQueueGetWaitTime:
Wall clock time that the fragment execution thread spent blocked waiting for row
batches to be added to the scan node’s output queue.
RowBatchQueuePeakMemoryUsage:
Peak memory consumption of row batches enqueued in the scan node’s output
queue.
RowBatchQueuePutWaitTime:
Wall clock time that the scanner threads spent blocked waiting for space in the
scan node’s output queue when it is full.
RowsRead:
The number of top-level rows/tuples read from the storage layer, including those
discarded by predicate evaluation. Used for all types of scans.
ScannerIoWaitTime:
The total amount of time scanner threads spent waiting for I/O. This is
comparable to ScannerThreadsTotalWallClockTime in the traditional HDFS scan
nodes and the scan node total time for the MT_DOP > 1 scan nodes. Low values
show that each I/O completed before or around the time that the scanner thread
was ready to process the data. High values show that scanner threads are
spending significant time waiting for I/O instead of processing data. Note that if
CPU load is high, this can include the time that the thread is runnable but not
scheduled.
ScannerThreadsSysTime /
ScannerThreadsUserTime /
ScannerThreadsVoluntaryContextSwitches /
ScannerThreadsInvoluntaryContextSwitches:
These are aggregated counters across all scanner threads of this scan node. They
are taken from getrusage. See RuntimeProfile::ThreadCounters for details.
ScannerThreadsTotalWallClockTime:
Total wall clock time spent in all scanner threads.
ScanRangesComplete:
The number of scan ranges completed. Initialized for scans that have a concept of
“scan range”.
TotalRawHbaseReadTime:
The total wall clock time spent in HBase read calls. For example, if we have 3
threads and each spent 1 sec, this counter will report 3 sec.
TotalRawHdfsOpenFileTime:
The total wall clock time spent by Disk I/O threads in HDFS open operations. For
example, if we have 3 threads and each spent 1 sec, this counter will report 3 sec.
TotalRawHdfsReadTime:
The total wall clock time spent by Disk I/O threads in HDFS read operations. For
example, if we have 3 threads and each spent 1 sec, this counter will report 3 sec.
TotalReadThroughput:
BytesRead divided by the total wall clock time that this scan was executing (from
Open() to Close()). This gives the aggregate rate that data is read from disks. If
this is the only scan executing, ideally this will approach the maximum bandwidth
supported by the disks.
Metrics in Runtime Filter:
BloomFilterBytes:
The total amount of memory allocated to Bloom Filters.
Rows processed:
Total number of rows to which each filter was applied
Rows rejected:
Total number of rows that each filter rejected.
Rows total:
The total number of rows that each filter could have been applied to (if it were
available from row 0).
BuildRowsPartitioned:
The number of build rows that have been partitioned.
BuildRowsPartitionTime:
Time spent partitioning build rows.
BuildTime:
Time to prepare build side.
HashBuckets:
The total number of hash buckets across all partitions.
HashCollisions:
The number of cases where we had to compare buckets with the same hash
value, but the row equality failed.
HashTablesBuildTime:
Time spent building hash tables.
LargestPartitionPercent:
The largest fraction after repartitioning. This is expected to be 1 /
PARTITION_FANOUT. A value much larger indicates skew.
MaxPartitionLevel:
Level of the max partition (i.e. number of repartitioning steps).
NullAwareAntiJoinEvalTime:
Time spent evaluating other_join_conjuncts for NAAJ.
NumHashTableBuildsSkipped:
The number of partitions which had zero probe rows and we therefore didn’t
build the hash table.
NumRepartitions:
The number of partitions that have been repartitioned.
PartitionsCreated:
The total number of partitions created.
ProbeRows:
The number of probe (left child) rows.
ProbeRowsPartitioned:
The number of probe rows that have been partitioned.
ProbeTime:
Time to process the probe (left child) batch.
RepartitionTime:
Time spent repartitioning and building hash tables of any resulting partitions that
were not spilled.
SpilledPartitions:
The number of partitions that have been spilled.
InMemorySortTime:
Time spent sorting initial runs in memory.
MergeGetNext:
Time to get and return the next batch of sorted rows from this merger.
MergeGetNextBatch:
Times calls to get the next batch of rows from the input run.
NumRowsPerRun:
Min, max, and avg size of runs in number of tuples (Sorter).
TotalMergesPerformed:
Number of merges of sorted runs.
SortDataSize:
The total size of the initial runs in bytes.
SpilledRuns:
The number of runs that were unpinned and may have spilled to disk, including
initial and intermediate runs.
Metrics in Aggregation Node:
BuildTime:
Time to prepare build side.
GetResultsTime:
Time spent returning the aggregated rows.
HashBuckets:
The total number of hash buckets across all partitions.
HTResizeTime:
Total time spent resizing hash tables.
LargestPartitionPercent:
The largest fraction after repartitioning. This is expected to be 1 /
PARTITION_FANOUT. A value much larger indicates skew.
MaxPartitionLevel:
Level of the max partition (i.e. number of repartitioning steps).
NumRepartitions:
The number of partitions that have been repartitioned.
PartitionsCreated:
The total number of partitions created.
ReductionFactorEstimate:
The estimated reduction of the pre-aggregation.
ReductionFactorThresholdToExpand:
Expose the minimum reduction factor to continue growing the hash tables.
RowsPassedThrough:
The number of rows passed through without aggregation.
RowsRepartitioned:
The number of rows that have been repartitioned.
SpilledPartitions:
The number of partitions that have been spilled.
StreamingTime:
Time spent in streaming pre-aggregation algorithm.
Metrics in CodeGen:
CodegenTime:
Time spent doing codegen (adding IR to the module).
CompileTime:
Time spent compiling the module.
ExecTreePrepareTime:
Time to prepare sink, set up internal structures in execution node and its
offsprings, start the profile-reporting thread and wait until it’s active.
LoadTime:
Time spent reading the .ir file from the file system.
ModuleBitcodeSize:
The total size of bitcode modules loaded in bytes.
NumFunctions:
The number of functions that are optimized and compiled after pruning unused
functions from the module.
NumInstructions:
The number of instructions that are optimized and compiled after
pruning unused functions from the module.
OptimizationTime:
Time spent optimizing the module.
PrepareTime:
Time spent constructing the in-memory module from the ir.
Metrics in TmpFileMgr:
TmpFileMgr provides an abstraction for management of temporary (a.k.a. scratch)
files on the filesystem and I/O to and from them.
ScratchBytesRead:
The number of bytes read from disk (includes reads started but not yet
completed).
ScratchBytesWritten:
The number of bytes written to disk (includes writes started but not yet
completed).
ScratchFileUsedBytes:
Amount of scratch space allocated in bytes.
ScratchReads:
The number of READ operations (includes reads started but not yet completed).
ScratchWrites:
The number of write operations (includes writes started but not yet completed).
TotalEncryptionTime:
Time spent in disk spill encryption, decryption, and integrity checking.
TotalReadBlockTime:
Time spent waiting for disk reads