Openstack Troubleshooting Field Survival Guide
Openstack Troubleshooting Field Survival Guide
[email protected] [email protected]
@marstokt @kmarc
1
bit.ly / openstack-troubleshoot
bit.ly //openstack-troubleshoot
bit.ly openstack-troubleshoot
What is this talk about?
● Beginner’s session
● Generic troubleshooting steps for the majority of OpenStack components
● Principles of finding what causes OpenStack components’ erroneous
behavior
● Where to search and how to ask for help
● Exercises covering a few failure scenarios
2
bit.ly / openstack-troubleshoot
DevStack virtual machine
bit.ly / upstream-institute
● Interested in contributing?
○ https://docs.openstack.org/upstream-training
3
bit.ly / openstack-troubleshoot
Why troubleshoot? And how?!
4
bit.ly / openstack-troubleshoot
Why to troubleshoot
● Because google://software+is+broken
● Complexity increases room for errors
● OpenStack - the software
○ Easy concept: “Just a bunch of python scripts with a nice WebGUI”
○ Yet complex: >20M LOC (including docs), ~65K commits in a year across ~60 projects
● Apply knowledge
…
● Problems fixed!
● Jokes aside:
○ Know your system to locate failure (what components, how they work together)
○ Understand the layers (minimal understanding from the kernel up to client UI)
○ Learn the tools that can help in troubleshooting (searching logs, checking statuses)
○ Reach out for help (community is amazing!)
6
bit.ly / openstack-troubleshoot
Best approach to troubleshooting
● Avoid troubles!
○ Monitoring, logging
○ Alerting
○ Blue-Green deployments
○ Dev / staging environments
○ Infrastructure-as-code
○ Log analytics, etc.
7
bit.ly / openstack-troubleshoot
What can go wrong during a VM instance creation?
8
bit.ly / openstack-troubleshoot
Nova instance
creation flow
10
bit.ly / openstack-troubleshoot
Nova instance creation flow #1 - debugging
Debugging steps on the user side
11
bit.ly / openstack-troubleshoot
Nova instance creation flow #1 - debugging
Debugging steps on the operators side
12
bit.ly / openstack-troubleshoot
Nova instance creation flow #1 - debugging
● On the client side, use --debug to retrieve Request ID
13
bit.ly / openstack-troubleshoot
Nova instance creation flow #2
2. An authenticated request to Nova is issued by connecting to nova-api
○ https://httpstatuses.com/503 - not quite helpful
14
bit.ly / openstack-troubleshoot
Nova instance creation flow #2 - debugging
Debugging steps on the user side
$ a2ensite nova-api-wsgi.conf
$ curl http://192.168.10.15/compute/v2.1 $ curl http://192.168.10.15/compute/v2.1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> {"error": {"message": "The request you have made
... requires authentication.", "code": 401, "title":
<p>The requested URL /compute was not found on this server.</p> "Unauthorized"}}
<address>Apache/2.4.29 Server at 192.168.10.15 Port 80</address>
...
15
bit.ly / openstack-troubleshoot
Nova instance creation flow #3
3. nova-api queries Keystone for authentication and authorization of the
incoming request
○ Keystone validates the token and replies with an updated authentication headers with
authorization (roles / permissions) data attached
$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test3
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'keystoneauth1.exceptions.discovery.DiscoveryFailure'> (HTTP 500) (Request-ID:
req-35499014-c704-4eb3-bcf0-866f59651482)
16
bit.ly / openstack-troubleshoot
Nova instance creation flow #3 - debugging
Debugging steps on the operators site
● Get a request ID from the client side (--debug)
$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test4
# very long wait time
Unknown Error (HTTP 500)
18
bit.ly / openstack-troubleshoot
Nova instance creation flow #4 - debugging
Debugging steps on the operators side:
● Sometimes it’s worth looking up WARNING / ERROR messages in logs
$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test5
# very, very, very long wait time
Unknown Error (HTTP 500)
20
bit.ly / openstack-troubleshoot
Nova instance creation flow #5 - debugging
Debugging steps on the operators site - got the gist, right?
22
bit.ly / openstack-troubleshoot
Nova instance creation flow #6 - debugging
Debugging steps on the operators site
● Looks like nova-api is working
● Check again clients command line to get an error message and query other nova services
If anything goes wrong, debugging steps on the operators side are similar to previous ones
● Check the nova-scheduler and nova-compute health, logs and configuration
● Check Database and MQ health, logs, and configuration
24
bit.ly / openstack-troubleshoot
Nova instance creation flow #8
8. The responsible nova-compute instance picks the request from the MQ
and queries nova-conductor to get VM details
○ Such as image id, flavor (RAM,CPU and Disk), etc.
$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test8 # omitted
$ openstack server show test8 -c OS-EXT-STS:task_state -f value
BUILD
# After a long-long while
$ openstack server show test8 -c OS-EXT-STS:task_state -f value
BUILD
26
bit.ly / openstack-troubleshoot
Nova instance creation flow #9 - debugging
Debugging steps on the operators side
$ journalctl -u devstack@n-* -u devstack@placement-api | grep -E "DEBUG|WARNING|ERROR"
2019-04-29 05:07:00.046 DEBUG nova.filters [req-b2a4445d-9e5b-4c7e-81d5-b1ee854a3735 admin admin] Filter ComputeFilter returned 1 host(s) from
(pid=17442) get_filtered_objects /opt/stack/nova/nova/filters.py:104
...
2019-04-29 05:07:00.049 DEBUG nova.scheduler.filters.disk_filter [req-b2a4445d-9e5b-4c7e-81d5-b1ee854a3735 admin admin] (centos70, centos70) ram:
799488MB disk: 0MB io_ops: 0 instances: 0 does not have 1024 MB usable disk, it only has 0.0 MB usable disk. from (pid=17442) host_passes
/opt/stack/nova/nova/scheduler/filters/disk_filter.py:70
2019-04-29 05:07:00.050 INFO nova.filters [req-b2a4445d-9e5b-4c7e-81d5-b1ee854a3735 admin admin] Filter DiskFilter returned 0 hosts
2019-04-29 05:07:00.051 INFO nova.filters [req-b2a4445d-9e5b-4c7e-81d5-b1ee854a3735 admin admin] Filtering removed all hosts for the request with
instance ID '05976d37-8e61-488e-aaf4-9ee770bc5ba0'. Filter results: ['RetryFilter: (start: 1, end: 1)', 'AvailabilityZoneFilter: (start: 1, end: 1)',
'ComputeFilter: (start: 1, end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)', 'ImagePropertiesFilter: (start: 1, end: 1)', 'CoreFilter:
(start: 1, end: 1)', 'RamFilter: (start: 1, end: 1)', 'DiskFilter: (start: 1, end: 0)']
2019-04-29 05:07:00.052 DEBUG nova.scheduler.filter_scheduler [req-b2a4445d-9e5b-4c7e-81d5-b1ee854a3735 admin admin] There are 0 hosts available but
1 instances requested to build. from (pid=17442) select_destinations
28
bit.ly / openstack-troubleshoot
Nova instance creation flow #16
16. nova-compute configures the hypervisor to create the VM
○ At this point, Horizon is able to show remote VNC console, and SSH should work
$ source openrc
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test16 # omitted
$ openstack server show test16 -c addresses -f value
private=fd4f:38ff:47e7:0:f816:3eff:fe22:3d7d, 10.0.0.32
$ ip address list | grep -E '10\.0\.'
# No IP address in the 10.0.* space. How to SSH?
29
bit.ly / openstack-troubleshoot
Nova instance creation flow #16 - debugging
Debugging steps on the user side
30
bit.ly / openstack-troubleshoot
Recovering keystone admin access
● What to do if forgot credentials / lost the openrc file?
● With admin access to the control host, enable token-based auth
○ https://docs.openstack.org/keystone/latest/configuration/samples/keystone-conf.html
● Set the environment variables:
○ OS_TOKEN as in [DEFAULT] / admin_token in keystone.conf
○ OS_URL as in [DEFAULT] / admin_endpoint in keystone.conf
$ export OS_TOKEN=<admin_token>
$ export OS_URL=<admin_endpoint>
$ openstack user set --password <newpassword> admin
31
bit.ly / openstack-troubleshoot
General troubleshooting tips & tricks
32
bit.ly / openstack-troubleshoot
Troubleshooting checklist
● Identify & reproduce the problem
○ What was the user / admin interaction what triggered it
● Collect information
○ Client tools being used, versions, debug output
○ Services being involved, configuration, logs, debug output
○ Check environment: networking, OS, dependent services, storage disk space, etc.
● Fix trivial issues
○ Fix it on the spot, experiment with dev/test environment, home lab
● Ask for help
○ Use web search, reach out to docs, support, developers
● Mitigate carefully
○ Plan and test the steps of the mitigation procedure (aka “do not break prod”)
● Document everything for future reference
33
bit.ly / openstack-troubleshoot
Collecting information
● Networking
○ Neutron troubleshooting is hard
○ Connectivity checks using standard linux tools and openvswitch cli
$ ping <address> $ sudo ovs-vsctl show
$ telnet <address> <port> $ sudo ovs-tcpdump -i br-int
$ ip address list $ sudo tcpdump -i <tap-dev>
$ ip netns list
$ ip netns exec
○ OOM-killer ● Other
○ CPU usage / Load ○ Time synchronization
○ File descriptor limits ○ External network misconfiguration
○ Physical node failure (DNS / Firewall)
35
bit.ly / openstack-troubleshoot
Working with openstack cli tools
● Common options to all subcommands
○ To gather more information about a problem, check --version, read --help, use --debug
○ OpenStack client releases: https://releases.openstack.org/teams/openstackclient.html
$ openstack --version
openstack 3.18.0
$ openstack --help # omitted
$ openstack server create --help # omitted
$ openstack server list --debug # omitted
36
bit.ly / openstack-troubleshoot
Example of collecting debug logs - client side
$ openstack server create --flavor m1.nano --image cirros-0.4.0-x86_64-disk --network private test --debug
...
auth_config_hook(): {'auth_type': 'password', 'beta_command': False, u'image_status_code_retries': '5 Set environment
defaults: {u'auth_type': 'password', u'status': u'active', u'image_status_code_retries': 5, 'api_time
cloud cfg: {'auth_type': 'password', 'beta_command': False, u'image_status_code_retries': '5', u'inte
...
command: server create -> openstackclient.compute.v2.server.CreateServer (auth=True) Parse arguments
...
Using parameters {'username': 'demo', 'project_name': 'demo', 'user_domain_id': 'default', 'auth_url'
Get auth_ref
REQ: curl -g -i -X GET http://192.168.10.15/identity -H "Accept: application/json" -H "User-Agent: op Request auth(n|z)
Starting new HTTP connection (1): 192.168.10.15:80
http://192.168.10.15:80 "GET /identity HTTP/1.1" 300 272
RESP: [300] Connection: close Content-Length: 272 Content-Type: application/json Date: Mon, 29 Apr 20
RESP BODY: {"versions": {"values": [{"status": "stable", "updated": "2019-01-22T00:00:00Z", "media-ty
...
http://192.168.10.15:80 "POST /identity/v3/auth/tokens HTTP/1.1" 201 3253
{"token": {"is_domain": false, "methods": ["password"], "roles": [{"id": "9ae9e8b27dcb419598a8952f4d8
Instantiating image api: <class 'openstackclient.api.image_v2.APIv2'> Request image
curl -g -i -X GET -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'User-Agent: python-glancec
...
REQ: curl -g -i -X GET http://192.168.10.15/compute/v2.1/flavors/m1.nano -H "Accept: application/json
Resetting dropped connection: 192.168.10.15 Request flavor
http://192.168.10.15:80 "GET /compute/v2.1/flavors/m1.nano HTTP/1.1" 404 80
RESP: [404] Connection: close Content-Length: 80 Content-Type: application/json; charset=UTF-8 Date:
RESP BODY: {"itemNotFound": {"message": "Flavor m1.nano could not be found.", "code": 404}} 37
bit.ly / openstack-troubleshoot
Example of collecting debug logs - server side
Configure debug logging
$ grep -i ^debug /etc/nova/*
/etc/nova/nova-cpu.conf:debug = True
/etc/nova/nova-dhcpbridge.conf:debug = True
/etc/nova/nova.conf:debug = True
/etc/nova/nova_cell1.conf:debug = True
38
bit.ly / openstack-troubleshoot
Where to search for help
● Knowledge base ● Collaboration
○ Documentation ○ OpenDev
https://docs.openstack.org/ https://opendev.org/openstack/
○ Wiki ○ Bugs, blueprints (old)
https://wiki.openstack.org/ https://launchpad.net/openstack
○ Project specifications ○ Bugs, features (new)
http://specs.openstack.org/ https://storyboard.openstack.org/
● Support
○ Community Q&A
https://ask.openstack.org/
○ IRC
freenode.net / #openstack
○ Mailing lists
http://lists.openstack.org / openstack-discuss
39
bit.ly / openstack-troubleshoot
Administrator & troubleshooting guides
● Troubleshooting guides
○ Maintenance guide:
https://docs.openstack.org/operations-guide/ops-maintenance.html
○ Compute:
https://docs.openstack.org/nova/latest/admin/support-compute.html
○ Volume:
https://docs.openstack.org/cinder/latest/admin/blockstorage-troubleshoot.html
● Project specific administrator guides
○ Image: https://docs.openstack.org/glance/latest/admin/
○ Networking: https://docs.openstack.org/neutron/latest/admin/
○ Identity: https://docs.openstack.org/keystone/latest/admin/
○ Orchestration: https://docs.openstack.org/heat/latest/admin/
○ Dashboard: https://docs.openstack.org/horizon/latest/admin/
40
bit.ly / openstack-troubleshoot
Thank you!
Questions?
41
bit.ly / openstack-troubleshoot