0% found this document useful (0 votes)
142 views41 pages

Openstack Troubleshooting Field Survival Guide

This document summarizes OpenStack troubleshooting steps for beginners. It outlines generic troubleshooting principles, including understanding component interactions and available tools. It then walks through debugging a specific failure scenario - an instance creation failure due to authentication issues. Troubleshooting steps are provided for both the client and operator, such as checking logs and services.

Uploaded by

ankur thaper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views41 pages

Openstack Troubleshooting Field Survival Guide

This document summarizes OpenStack troubleshooting steps for beginners. It outlines generic troubleshooting principles, including understanding component interactions and available tools. It then walks through debugging a specific failure scenario - an instance creation failure due to authentication issues. Troubleshooting steps are provided for both the client and operator, such as checking logs and services.

Uploaded by

ankur thaper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

04 / 30 / 2019

OpenStack troubleshooting: a field survival guide

MARS TOKTONALIEV MARK KORONDI


Nokia Acronis / Freelancer

[email protected] [email protected]

@marstokt @kmarc

1
bit.ly / openstack-troubleshoot
bit.ly //openstack-troubleshoot
bit.ly openstack-troubleshoot
What is this talk about?
● Beginner’s session
● Generic troubleshooting steps for the majority of OpenStack components
● Principles of finding what causes OpenStack components’ erroneous
behavior
● Where to search and how to ask for help
● Exercises covering a few failure scenarios

2
bit.ly / openstack-troubleshoot
DevStack virtual machine
bit.ly / upstream-institute

● Pre-installed virtual machine


○ Runs with VirtualBox / VMware / KVM, on Windows / Linux / Mac
○ Requires minimum 5GB free RAM (at least 8GB on the host)
○ Has a basic desktop environment and tools to set up devstack

● Interested in contributing?
○ https://docs.openstack.org/upstream-training

3
bit.ly / openstack-troubleshoot
Why troubleshoot? And how?!

4
bit.ly / openstack-troubleshoot
Why to troubleshoot
● Because google://software+is+broken
● Complexity increases room for errors
● OpenStack - the software
○ Easy concept: “Just a bunch of python scripts with a nice WebGUI”
○ Yet complex: >20M LOC (including docs), ~65K commits in a year across ~60 projects

● OpenStack - the platform


○ Deployed on hundreds / thousands of servers in a DC (horizontal complexity)
○ Components layered on top of each other (vertical complexity)
○ Services communicate across clusters (mesh complexity)
○ Redundancy for high availability (temporal complexity)
5
bit.ly / openstack-troubleshoot
Basic troubleshooting recipe
● Read the operations guide
○ https://docs.openstack.org/operations-guide/ops-maintenance.html

● Apply knowledge

● Problems fixed!

● Jokes aside:
○ Know your system to locate failure (what components, how they work together)
○ Understand the layers (minimal understanding from the kernel up to client UI)
○ Learn the tools that can help in troubleshooting (searching logs, checking statuses)
○ Reach out for help (community is amazing!)
6
bit.ly / openstack-troubleshoot
Best approach to troubleshooting
● Avoid troubles!
○ Monitoring, logging
○ Alerting
○ Blue-Green deployments
○ Dev / staging environments
○ Infrastructure-as-code
○ Log analytics, etc.

● This talk does not address that perfect world scenario

7
bit.ly / openstack-troubleshoot
What can go wrong during a VM instance creation?

8
bit.ly / openstack-troubleshoot
Nova instance
creation flow

Source: Pradeep Kumar


https://www.linuxtechi.com/step-by-step-instance-creation-flow-in-openstack/
9
bit.ly / openstack-troubleshoot
Nova instance creation flow #1
1. The Horizon Dashboard or OpenStack CLI authenticates against the
Identity service (Keystone) via it’s REST API
○ Keystone authenticates the user and replies with a token, which is used for authenticating
requests to other components

$ openstack server create


Missing value auth-url required for auth plugin password
$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test1
Failed to discover available identity versions when contacting http://192.168.10.15/identity. Attempting to parse
version from URL.
Could not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is
correct. Unable to establish connection to http://192.168.10.15/identity: HTTPConnectionPool(host='192.168.10.15',
port=80): Max retries exceeded with url: /identity (Caused by NewConnectionError('<urllib3.connection.HTTPConnection
object at 0x7fd0293c99d0>: Failed to establish a new connection: [Errno 111] Connection refused',))

10
bit.ly / openstack-troubleshoot
Nova instance creation flow #1 - debugging
Debugging steps on the user side

$ echo $OS_AUTH_URL $ echo $OS_AUTH_URL


# no output http://controller.myopenstack.com/identity
$ nslookup myopenstack.com # dig myopenstack.com $ nslookup myopenstack.com # dig myopenstack.com
... ...
** server can't find myopenstack.com: NXDOMAIN Non-authoritative answer:
... Name: myopenstack.com
Address: 192.168.10.15
...
$ telnet 192.168.10.15 80 $ telnet 192.168.10.15 80
Trying 192.168.10.15... # timeout Trying 192.168.10.15…
Connected to 192.168.10.15.
Escape character is '^]'.

11
bit.ly / openstack-troubleshoot
Nova instance creation flow #1 - debugging
Debugging steps on the operators side

$ sudo systemctl restart apache2.service


$ systemctl status apache2.service $ systemctl status apache2.service
● apache2.service - The Apache HTTP Server ● apache2.service - The Apache HTTP Server
... ...
Active: inactive (dead) since ... Active: active (running) since ...
... ...
$ sudo a2ensite keystone-wsgi-public
$ a2query -s keystone-wsgi-public $ a2query -s keystone-wsgi-public
No site matches keystone-wsgi-public (disabled by site keystone-wsgi-public (enabled by site administrator)
administrator)

12
bit.ly / openstack-troubleshoot
Nova instance creation flow #1 - debugging
● On the client side, use --debug to retrieve Request ID

$ openstack token issue --debug 2>&1 | grep Request-ID


...
The request you have made requires authentication. (HTTP 401) (Request-ID: req-56d543f9-079d-42c0-9eb8-a3dfbc2f90c5)
...

● On the server side, check logs


○ https://docs.openstack.org/keystone/latest/configuration/samples/keystone-conf.html
○ [DEFAULT]/log_file or systemd

$ journalctl -u [email protected] | grep req-56d543f9-079d-42c0-9eb8-a3dfbc2f90c5


Apr 27 03:14:32 upstream-training [email protected][18195]: WARNING keystone.server.flask.application [None
req-56d543f9-079d-42c0-9eb8-a3dfbc2f90c5 None None] Authorization failed. The request you have made requires
authentication. from 192.168.10.15: Unauthorized: The request you have made requires authentication.
$ journalctl -u [email protected] | grep -E 'WARNING|ERROR' # -f to watch
$ journalctl -u [email protected]

13
bit.ly / openstack-troubleshoot
Nova instance creation flow #2
2. An authenticated request to Nova is issued by connecting to nova-api
○ https://httpstatuses.com/503 - not quite helpful

$ source openrc admin


$ openstack endpoint list --service compute --column URL

+-----------------------------------+
| URL |
+-----------------------------------+
| http://192.168.10.15/compute/v2.1 |
+-----------------------------------+
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test2
Unknown Error (HTTP 503)
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test2 --debug
REQ: curl -g -i -X GET http://192.168.10.15/compute/v2.1/flavors/m1.nano -H "Accept: application/json" -H "User-Agent:
python-novaclient" -H "X-Auth-Token: {SHA256}6fa0136025917154a4e984b72b6c5ebb09e5688c7f4a14c67fe62f88d1c1a3bc" -H
"X-OpenStack-Nova-API-Version: 2.1"
Resetting dropped connection: 192.168.10.15

14
bit.ly / openstack-troubleshoot
Nova instance creation flow #2 - debugging
Debugging steps on the user side

$ ping 192.168.10.15 $ ping 192.168.10.15


PING 192.168.10.15 (192.168.10.15) 56(84) bytes of data. PING 192.168.10.15 (192.168.10.15) 56(84) bytes of data.
# timeout 64 bytes from 192.168.10.15: icmp_seq=1 ttl=64 time=0.1
...

Debugging steps on the operators side

$ a2ensite nova-api-wsgi.conf
$ curl http://192.168.10.15/compute/v2.1 $ curl http://192.168.10.15/compute/v2.1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> {"error": {"message": "The request you have made
... requires authentication.", "code": 401, "title":
<p>The requested URL /compute was not found on this server.</p> "Unauthorized"}}
<address>Apache/2.4.29 Server at 192.168.10.15 Port 80</address>
...

15
bit.ly / openstack-troubleshoot
Nova instance creation flow #3
3. nova-api queries Keystone for authentication and authorization of the
incoming request
○ Keystone validates the token and replies with an updated authentication headers with
authorization (roles / permissions) data attached

$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test3
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'keystoneauth1.exceptions.discovery.DiscoveryFailure'> (HTTP 500) (Request-ID:
req-35499014-c704-4eb3-bcf0-866f59651482)

16
bit.ly / openstack-troubleshoot
Nova instance creation flow #3 - debugging
Debugging steps on the operators site
● Get a request ID from the client side (--debug)

$ journalctl -u devstack@n-api | grep 7764b3d2-1f14-453a-8a0c-dd696695f194 | grep ERROR


Apr 28 20:24:27 upstream-training [email protected][21131]: ERROR nova.api.openstack.wsgi [None
req-7764b3d2-1f14-453a-8a0c-dd696695f194 demo demo] Unexpected exception in API method: DiscoveryFailure: Could not
find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. Unable
to establish connection to http://192.168.10.16/identity: HTTPConnectionPool(host='192.168.10.16', port=80): Max
retries exceeded with url: /identity (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at
0x7f46650f39d0>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))

● “DiscoveryFailure: Could not find versioned identity endpoints”


● “Please check that your auth_url is correct”

● Check configuration file


○ https://docs.openstack.org/nova/latest/configuration/config.html
17
bit.ly / openstack-troubleshoot
Nova instance creation flow #4
4. nova-api checks for conflicts within the Database and creates an initial
database entry for the new VM instance

$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test4
# very long wait time
Unknown Error (HTTP 500)

18
bit.ly / openstack-troubleshoot
Nova instance creation flow #4 - debugging
Debugging steps on the operators side:
● Sometimes it’s worth looking up WARNING / ERROR messages in logs

$ journalctl -u devstack@n-api | grep -E "ERROR|WARNING"


Apr 28 21:26:49 upstream-training [email protected][25453]: ERROR nova DBConnectionError:
(pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 101] ENETUNREACH)") ...
… OR …
Apr 28 21:46:48 upstream-training [email protected][27737]: ERROR nova OperationalError:
(pymysql.err.OperationalError) (1045, u"Access denied for user 'root'@'localhost' (using password: YES)") ...

● “DBConnectionError: Can't connect to MySQL server on '127.0.0.1'”


● “OperationalError: Access denied for user 'root'@'localhost' (using password: YES)”
● Check configuration file
○ https://docs.openstack.org/nova/latest/configuration/config.html
○ [database] / [api_database]
19
bit.ly / openstack-troubleshoot
Nova instance creation flow #5
5. nova-api sends an RPC request through the Message Queue to
nova-scheduler in order to find a hypervisor to launch the VM on

$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test5
# very, very, very long wait time
Unknown Error (HTTP 500)

20
bit.ly / openstack-troubleshoot
Nova instance creation flow #5 - debugging
Debugging steps on the operators site - got the gist, right?

$ journalctl -u devstack@n-api | grep -E "ERROR|WARNING"


Apr 28 21:59:43 upstream-training [email protected][28186]: WARNING oslo.messaging._drivers.impl_rabbit [-]
Unexpected error during heartbeart thread processing, retrying...: error: [Errno 111] ECONNREFUSED
Apr 28 21:59:53 upstream-training [email protected][28186]: ERROR oslo.messaging._drivers.impl_rabbit [None
req-d287384b-1b24-497a-acdc-199801f98a23 demo demo] [07c3a79e-0532-4861-afd7-4b4e3737e7fb] AMQP server on
192.168.10.15:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds.: error: [Errno 111]
ECONNREFUSED
$ systemctl status rabbitmq-server # omitted
$ journalctl -u rabbitmq-server # omitted

● “AMQP server on 192.168.10.15:5672 is unreachable”


● Check MQ health, logs and configuration file
○ https://docs.openstack.org/operations-guide/ops-maintenance-rabbitmq.html
○ https://www.rabbitmq.com/troubleshooting.html
21
bit.ly / openstack-troubleshoot
Nova instance creation flow #6
6. nova-scheduler picks the request from the MQ
$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test6
+-----------------------------+-----------------------------------------------------------------+
| Field | Value |
+-----------------------------+-----------------------------------------------------------------+
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
...
| name | test6 |
| status | BUILD |
+-----------------------------+-----------------------------------------------------------------+
$ openstack server show test6 -c status -f value
BUILD
# After a while
$ openstack server show test6 -c status -f value
ERROR

22
bit.ly / openstack-troubleshoot
Nova instance creation flow #6 - debugging
Debugging steps on the operators site
● Looks like nova-api is working
● Check again clients command line to get an error message and query other nova services

$ journalctl -u devstack@n-api | grep -E "ERROR"


# no error here...
$ openstack server show test6 -c fault -f value
{u'message': u'Timed out waiting for a reply to message ID 80a779c36dab4ba68f48600bd961e36e', u'code': 500, u'created':
u'2019-04-28T23:18:36Z'}
$ journalctl -u devstack@n-* | grep 80a779c36dab4ba68f48600bd961e36e
Apr 28 23:18:35 upstream-training nova-conductor[26967]: ERROR nova.conductor.manager [None
req-c0ddf15e-8072-4289-b7f3-9fe2173051ba demo demo] Failed to schedule instances: MessagingTimeout: Timed out waiting
for a reply to message ID 80a779c36dab4ba68f48600bd961e36e

● “Failed to schedule instances: MessagingTimeout”


● Check the diagram of Nova instance creation flow.
○ Looks like nova-scheduler is the culprit
23
bit.ly / openstack-troubleshoot
Nova instance creation flow #7
7. nova-scheduler checks the Database
○ nova-scheduler returns the updated instance entry with the appropriate host ID after
filtering and weighing
○ nova-scheduler sends an RPC request to nova-compute for launching an instance on
the appropriate host

If anything goes wrong, debugging steps on the operators side are similar to previous ones
● Check the nova-scheduler and nova-compute health, logs and configuration
● Check Database and MQ health, logs, and configuration

$ systemctl status <service_name> # omitted


$ journalctl -u <service_name> | grep -E "ERROR|WARNING" # omitted

24
bit.ly / openstack-troubleshoot
Nova instance creation flow #8
8. The responsible nova-compute instance picks the request from the MQ
and queries nova-conductor to get VM details
○ Such as image id, flavor (RAM,CPU and Disk), etc.
$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test8 # omitted
$ openstack server show test8 -c OS-EXT-STS:task_state -f value
BUILD
# After a long-long while
$ openstack server show test8 -c OS-EXT-STS:task_state -f value
BUILD

… Or… you know, it doesn’t.


● VM stuck forever in BUILD state means, the scheduler cannot find a suitable compute node
● Check nova-scheduler, and nova-compute services’ health, logs, and configuration
25
bit.ly / openstack-troubleshoot
Nova instance creation flow #9
9. nova-conductor picks the request from the MQ and queries
nova-database
○ then nova-compute picks the instance information from the MQ
$ source openrc
# Trying to allocate an m1.large instance
$ openstack server create --flavor m1.large --image cirros-0.4.0-x86_64-disk --network private test9 # omitted
$ openstack server show test9 -c OS-EXT-STS:task_state -f value
ERROR
$ openstack server show test9 -c fault -f value
{u'message': u'No valid host was found. ', u'code': 500, u'details': u' File
"/opt/stack/nova/nova/conductor/manager.py", line 1346, in schedule_and_build_instances\n instance_uuids,
...

26
bit.ly / openstack-troubleshoot
Nova instance creation flow #9 - debugging
Debugging steps on the operators side
$ journalctl -u devstack@n-* -u devstack@placement-api | grep -E "DEBUG|WARNING|ERROR"
2019-04-29 05:07:00.046 DEBUG nova.filters [req-b2a4445d-9e5b-4c7e-81d5-b1ee854a3735 admin admin] Filter ComputeFilter returned 1 host(s) from
(pid=17442) get_filtered_objects /opt/stack/nova/nova/filters.py:104
...
2019-04-29 05:07:00.049 DEBUG nova.scheduler.filters.disk_filter [req-b2a4445d-9e5b-4c7e-81d5-b1ee854a3735 admin admin] (centos70, centos70) ram:
799488MB disk: 0MB io_ops: 0 instances: 0 does not have 1024 MB usable disk, it only has 0.0 MB usable disk. from (pid=17442) host_passes
/opt/stack/nova/nova/scheduler/filters/disk_filter.py:70
2019-04-29 05:07:00.050 INFO nova.filters [req-b2a4445d-9e5b-4c7e-81d5-b1ee854a3735 admin admin] Filter DiskFilter returned 0 hosts
2019-04-29 05:07:00.051 INFO nova.filters [req-b2a4445d-9e5b-4c7e-81d5-b1ee854a3735 admin admin] Filtering removed all hosts for the request with
instance ID '05976d37-8e61-488e-aaf4-9ee770bc5ba0'. Filter results: ['RetryFilter: (start: 1, end: 1)', 'AvailabilityZoneFilter: (start: 1, end: 1)',
'ComputeFilter: (start: 1, end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)', 'ImagePropertiesFilter: (start: 1, end: 1)', 'CoreFilter:
(start: 1, end: 1)', 'RamFilter: (start: 1, end: 1)', 'DiskFilter: (start: 1, end: 0)']
2019-04-29 05:07:00.052 DEBUG nova.scheduler.filter_scheduler [req-b2a4445d-9e5b-4c7e-81d5-b1ee854a3735 admin admin] There are 0 hosts available but
1 instances requested to build. from (pid=17442) select_destinations

● “filter DiskFilter returned 0 hosts”


● “there are 0 hosts available but 1 instances requested to build.”
○ Filter scheduler docs: https://docs.openstack.org/nova/latest/user/filter-scheduler.html
○ Placement api (from Stein) docs: https://docs.openstack.org/placement/latest/
27
bit.ly / openstack-troubleshoot
Nova instance creation flow #10 - #15
For the sake completeness:
10. nova-compute connects to Glance Image service to retrieve the boot image URI
11. Glance validates auth[nz] with Keystone and returns image metadata to nova-compute
12. nova-compute connects to Neutron network service to allocate and configure (sub)networks, IP
addresses, etc.
13. Neutron validates auth[nz] with Keystone, configures networking and returns information to
nova-compute
14. nova-compute connects to Cinder Volume service to configure and attach volumes to the VM
15. Cinder validates auth[nz] with Keystone, configures block storage and returns information to
nova-compute

● Troubleshooting steps are similar to that of Nova


○ Diagnostics are done on the nova-compute nodes

28
bit.ly / openstack-troubleshoot
Nova instance creation flow #16
16. nova-compute configures the hypervisor to create the VM
○ At this point, Horizon is able to show remote VNC console, and SSH should work

$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test16 # omitted
$ openstack server show test16 -c addresses -f value
private=fd4f:38ff:47e7:0:f816:3eff:fe22:3d7d, 10.0.0.32
$ ip address list | grep -E '10\.0\.'
# No IP address in the 10.0.* space. How to SSH?

● Cannot connect to your VM? Check these:


○ Is VM successfully built?
○ Did it get an IP address?
○ Security groups let ICMP / SSH through?

29
bit.ly / openstack-troubleshoot
Nova instance creation flow #16 - debugging
Debugging steps on the user side

$ ip netns ls | grep qrouter


qrouter-c7b74975-3bd4-40fe-98ee-bd03fb0d7b7a (id: 1)
$ sudo ip netns exec qrouter-c7b74975-3bd4-40fe-98ee-bd03fb0d7b7a ip address list | grep '10\.0'
inet 10.0.0.1/26 brd 10.0.0.63 scope global qr-2f57a265-b7
$ sudo ip netns exec qrouter-c7b74975-3bd4-40fe-98ee-bd03fb0d7b7a ssh 10.0.0.32 -l cirros
# long wait, timeout
$ openstack security group rule list default
+--------------------------------------+-------------+----------+------------+--------------------------------------+
| ID | IP Protocol | IP Range | Port Range | Remote Security Group |
+--------------------------------------+-------------+----------+------------+--------------------------------------+
| 242e5b36-4541-49ba-bde0-14bccf9c5df2 | None | None | | 478cead4-7770-4703-b1db-e30a3542601b |
| 58ca602e-38de-43de-bc29-3415d9db0ebb | None | None | | None |
| a70c278e-ab6e-43a0-ba46-53bfb79b5163 | None | None | | 478cead4-7770-4703-b1db-e30a3542601b |
| ab2f3c50-9f99-4360-8bd5-10efae17a546 | None | None | | None |
+--------------------------------------+-------------+----------+------------+--------------------------------------+
$ openstack security group rule create --protocol tcp --dst-port 22:22 --ingress default # omitted
$ sudo ip netns exec qrouter-c7b74975-3bd4-40fe-98ee-bd03fb0d7b7a ssh 10.0.0.32 -l cirros
[email protected]'s password: # yaaay happiness and frustration of not remembering the password. It’s `gocubsgo`

30
bit.ly / openstack-troubleshoot
Recovering keystone admin access
● What to do if forgot credentials / lost the openrc file?
● With admin access to the control host, enable token-based auth
○ https://docs.openstack.org/keystone/latest/configuration/samples/keystone-conf.html
● Set the environment variables:
○ OS_TOKEN as in [DEFAULT] / admin_token in keystone.conf
○ OS_URL as in [DEFAULT] / admin_endpoint in keystone.conf

$ export OS_TOKEN=<admin_token>
$ export OS_URL=<admin_endpoint>
$ openstack user set --password <newpassword> admin

● Admin token-based authentication is insecure, and should be disabled


as soon as other means of authentication are recovered!

31
bit.ly / openstack-troubleshoot
General troubleshooting tips & tricks

32
bit.ly / openstack-troubleshoot
Troubleshooting checklist
● Identify & reproduce the problem
○ What was the user / admin interaction what triggered it
● Collect information
○ Client tools being used, versions, debug output
○ Services being involved, configuration, logs, debug output
○ Check environment: networking, OS, dependent services, storage disk space, etc.
● Fix trivial issues
○ Fix it on the spot, experiment with dev/test environment, home lab
● Ask for help
○ Use web search, reach out to docs, support, developers
● Mitigate carefully
○ Plan and test the steps of the mitigation procedure (aka “do not break prod”)
● Document everything for future reference
33
bit.ly / openstack-troubleshoot
Collecting information
● Networking
○ Neutron troubleshooting is hard
○ Connectivity checks using standard linux tools and openvswitch cli
$ ping <address> $ sudo ovs-vsctl show
$ telnet <address> <port> $ sudo ovs-tcpdump -i br-int
$ ip address list $ sudo tcpdump -i <tap-dev>
$ ip netns list
$ ip netns exec

● Operating system environment and metrics


○ Usually from nova-compute or cinder-volume hosts
$ lsb_release -a $ top # or htop
$ uname -a $ iostat # or iotop
$ df -h $ dmesg
$ free -m

○ More tools: http://www.brendangregg.com/Perf/linux_perf_tools_full.png


34
bit.ly / openstack-troubleshoot
Watch out for non-OpenStack related issues
● Resource exhaustion on ● Connectivity
controller / compute / storage nodes ○ IP address collision

○ Disk usage ○ Network switch

○ Memory usage misconfiguration / failure

○ Swap usage / Swappiness ○ Cable / SFP failure

○ OOM-killer ● Other
○ CPU usage / Load ○ Time synchronization
○ File descriptor limits ○ External network misconfiguration
○ Physical node failure (DNS / Firewall)

35
bit.ly / openstack-troubleshoot
Working with openstack cli tools
● Common options to all subcommands
○ To gather more information about a problem, check --version, read --help, use --debug
○ OpenStack client releases: https://releases.openstack.org/teams/openstackclient.html
$ openstack --version
openstack 3.18.0
$ openstack --help # omitted
$ openstack server create --help # omitted
$ openstack server list --debug # omitted

● The old way: use the dedicated tools


○ Today all functionality should be implemented in the openstack command
○ The individual tools are installable with pip: python-(nova|cinder|neutron|etc)client
○ Example: nova releases found on https://releases.openstack.org/teams/nova.html
$ nova --version # --help, --debug also works

36
bit.ly / openstack-troubleshoot
Example of collecting debug logs - client side
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test --debug
...
auth_config_hook(): {'auth_type': 'password', 'beta_command': False, u'image_status_code_retries': '5 Set environment
defaults: {u'auth_type': 'password', u'status': u'active', u'image_status_code_retries': 5, 'api_time
cloud cfg: {'auth_type': 'password', 'beta_command': False, u'image_status_code_retries': '5', u'inte
...
command: server create -> openstackclient.compute.v2.server.CreateServer (auth=True) Parse arguments
...
Using parameters {'username': 'demo', 'project_name': 'demo', 'user_domain_id': 'default', 'auth_url'
Get auth_ref
REQ: curl -g -i -X GET http://192.168.10.15/identity -H "Accept: application/json" -H "User-Agent: op Request auth(n|z)
Starting new HTTP connection (1): 192.168.10.15:80
http://192.168.10.15:80 "GET /identity HTTP/1.1" 300 272
RESP: [300] Connection: close Content-Length: 272 Content-Type: application/json Date: Mon, 29 Apr 20
RESP BODY: {"versions": {"values": [{"status": "stable", "updated": "2019-01-22T00:00:00Z", "media-ty
...
http://192.168.10.15:80 "POST /identity/v3/auth/tokens HTTP/1.1" 201 3253
{"token": {"is_domain": false, "methods": ["password"], "roles": [{"id": "9ae9e8b27dcb419598a8952f4d8
Instantiating image api: <class 'openstackclient.api.image_v2.APIv2'> Request image
curl -g -i -X GET -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'User-Agent: python-glancec
...
REQ: curl -g -i -X GET http://192.168.10.15/compute/v2.1/flavors/m1.nano -H "Accept: application/json
Resetting dropped connection: 192.168.10.15 Request flavor
http://192.168.10.15:80 "GET /compute/v2.1/flavors/m1.nano HTTP/1.1" 404 80
RESP: [404] Connection: close Content-Length: 80 Content-Type: application/json; charset=UTF-8 Date:
RESP BODY: {"itemNotFound": {"message": "Flavor m1.nano could not be found.", "code": 404}} 37
bit.ly / openstack-troubleshoot
Example of collecting debug logs - server side
Configure debug logging
$ grep -i ^debug /etc/nova/*
/etc/nova/nova-cpu.conf:debug = True
/etc/nova/nova-dhcpbridge.conf:debug = True
/etc/nova/nova.conf:debug = True
/etc/nova/nova_cell1.conf:debug = True

Query logs from systemd Query logs from /var/log


$ journalctl --unit [email protected] $ less /var/log/nova/nova-compute.log
$ journalctl -u [email protected] -u [email protected] $ less /var/log/nova/nova-{compute,conductor}.log
$ journalctl -u devstack@n-* $ less /var/log/nova/*
$ journalctl -u devstack@n-* | grep <id> $ grep <id> /var/log/nova/*
$ journalctl -o short-precise # nanoseconds
$ journalctl -a # colors
$ journalctl --since -1hour # limit history

# Learn your tools!


$ man systemctl
$ man systemd.time

38
bit.ly / openstack-troubleshoot
Where to search for help
● Knowledge base ● Collaboration
○ Documentation ○ OpenDev
https://docs.openstack.org/ https://opendev.org/openstack/
○ Wiki ○ Bugs, blueprints (old)
https://wiki.openstack.org/ https://launchpad.net/openstack
○ Project specifications ○ Bugs, features (new)
http://specs.openstack.org/ https://storyboard.openstack.org/
● Support
○ Community Q&A
https://ask.openstack.org/
○ IRC
freenode.net / #openstack
○ Mailing lists
http://lists.openstack.org / openstack-discuss

39
bit.ly / openstack-troubleshoot
Administrator & troubleshooting guides
● Troubleshooting guides
○ Maintenance guide:
https://docs.openstack.org/operations-guide/ops-maintenance.html
○ Compute:
https://docs.openstack.org/nova/latest/admin/support-compute.html
○ Volume:
https://docs.openstack.org/cinder/latest/admin/blockstorage-troubleshoot.html
● Project specific administrator guides
○ Image: https://docs.openstack.org/glance/latest/admin/
○ Networking: https://docs.openstack.org/neutron/latest/admin/
○ Identity: https://docs.openstack.org/keystone/latest/admin/
○ Orchestration: https://docs.openstack.org/heat/latest/admin/
○ Dashboard: https://docs.openstack.org/horizon/latest/admin/

40
bit.ly / openstack-troubleshoot
Thank you!

Questions?

41
bit.ly / openstack-troubleshoot

You might also like