Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when trying to open corrupted database #105

Open
tmm1 opened this issue Jun 27, 2018 · 12 comments
Open

Crash when trying to open corrupted database #105

tmm1 opened this issue Jun 27, 2018 · 12 comments

Comments

@tmm1
Copy link
Contributor

tmm1 commented Jun 27, 2018

I have an app that takes regular backups of boltdb databases. Sometimes, for unknown reasons, the backups are corrupted.

I also have a restore UI that lets me browse and read from backups. Trying to open and read from these corrupted databases crashes my process. I'm using 4f5275f

unexpected fault address 0x8a6b008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8a6b008 pc=0x42e0e2f]

goroutine 12 [running]:
runtime.throw(0x4a487e4, 0x5)
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc4206eee00 sp=0xc4206eede0 pc=0x402d5b1
runtime.sigpanic()
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc4206eee50 sp=0xc4206eee00 pc=0x4042de1
github.com/coreos/bbolt.(*Cursor).search(0xc4206eefe0, 0xc4206ef118, 0x6, 0x20, 0x63)
	.go/src/github.com/coreos/bbolt/cursor.go:255 +0x5f fp=0xc4206eef08 sp=0xc4206eee50 pc=0x42e0e2f
github.com/coreos/bbolt.(*Cursor).seek(0xc4206eefe0, 0xc4206ef118, 0x6, 0x20, 0x0, 0x0, 0x4063d84, 0x614e000, 0x0, 0x48d8300, ...)
	.go/src/github.com/coreos/bbolt/cursor.go:159 +0xa5 fp=0xc4206eef58 sp=0xc4206eef08 pc=0x42e0725
github.com/coreos/bbolt.(*Bucket).Bucket(0xc4204976d8, 0xc4206ef118, 0x6, 0x20, 0xc4206ef118)
	.go/src/github.com/coreos/bbolt/bucket.go:105 +0xde fp=0xc4206ef010 sp=0xc4206eef58 pc=0x42dc66e
github.com/coreos/bbolt.(*Tx).Bucket(0xc4204976c0, 0xc4206ef118, 0x6, 0x20, 0x6)
	.go/src/github.com/coreos/bbolt/tx.go:101 +0x4f fp=0xc4206ef048 sp=0xc4206ef010 pc=0x42ebbef

test.db.gz

@tmm1
Copy link
Contributor Author

tmm1 commented Jun 27, 2018

I tried to use tx.Check() but it also blows up. Perhaps because I'm using ReadOnly: true?

unexpected fault address 0xaf41008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0xaf41008 pc=0x42e6aa7]

goroutine 90 [running]:
runtime.throw(0x4a48764, 0x5)
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc4205e0be0 sp=0xc4205e0bc0 pc=0x402d2e1
runtime.sigpanic()
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc4205e0c30 sp=0xc4205e0be0 pc=0x4042b11
github.com/coreos/bbolt.(*freelist).read(0xc4200bf500, 0xaf41000)
	.go/src/github.com/coreos/bbolt/freelist.go:236 +0x37 fp=0xc4205e0ce0 sp=0xc4205e0c30 pc=0x42e6aa7
github.com/coreos/bbolt.(*DB).loadFreelist.func1()
	.go/src/github.com/coreos/bbolt/db.go:290 +0x12b fp=0xc4205e0d30 sp=0xc4205e0ce0 pc=0x42ef22b
sync.(*Once).Do(0xc42032f050, 0xc420055d78)
	/usr/local/Cellar/go/1.10.2/libexec/src/sync/once.go:44 +0xbe fp=0xc4205e0d68 sp=0xc4205e0d30 pc=0x406379e
github.com/coreos/bbolt.(*DB).loadFreelist(0xc42032ef00)
	.go/src/github.com/coreos/bbolt/db.go:283 +0x4e fp=0xc4205e0d98 sp=0xc4205e0d68 pc=0x42e201e
github.com/coreos/bbolt.(*Tx).check(0xc420384380, 0xc42039a600)
	.go/src/github.com/coreos/bbolt/tx.go:399 +0x47 fp=0xc4205e0fd0 sp=0xc4205e0d98 pc=0x42ed2c7
runtime.goexit()
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4205e0fd8 sp=0xc4205e0fd0 pc=0x405b871
created by github.com/coreos/bbolt.(*Tx).Check
	.go/src/github.com/coreos/bbolt/tx.go:393 +0x67

@tmm1
Copy link
Contributor Author

tmm1 commented Jun 27, 2018

Without ReadOnly, Open() crashes right away on a different backup:

unexpected fault address 0x8bf2008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8bf2008 pc=0x42e6aa7]

goroutine 79 [running]:
runtime.throw(0x4a48764, 0x5)
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc42047f0d8 sp=0xc42047f0b8 pc=0x402d2e1
runtime.sigpanic()
	/usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc42047f128 sp=0xc42047f0d8 pc=0x4042b11
github.com/coreos/bbolt.(*freelist).read(0xc4205cf320, 0x8bf2000)
	.go/src/github.com/coreos/bbolt/freelist.go:236 +0x37 fp=0xc42047f1d8 sp=0xc42047f128 pc=0x42e6aa7
github.com/coreos/bbolt.(*DB).loadFreelist.func1()
	.go/src/github.com/coreos/bbolt/db.go:290 +0x12b fp=0xc42047f228 sp=0xc42047f1d8 pc=0x42ef22b
sync.(*Once).Do(0xc42038d050, 0xc42047f270)
	/usr/local/Cellar/go/1.10.2/libexec/src/sync/once.go:44 +0xbe fp=0xc42047f260 sp=0xc42047f228 pc=0x406379e
github.com/coreos/bbolt.(*DB).loadFreelist(0xc42038cf00)
	.go/src/github.com/coreos/bbolt/db.go:283 +0x4e fp=0xc42047f290 sp=0xc42047f260 pc=0x42e201e
github.com/coreos/bbolt.Open(0xc4200edc20, 0x41, 0x180, 0xc42047f388, 0xc4206446b8, 0x0, 0x0)
	.go/src/github.com/coreos/bbolt/db.go:260 +0x38e fp=0xc42047f330 sp=0xc42047f290 pc=0x42e1c4e

test2.db.gz

@tmm1
Copy link
Contributor Author

tmm1 commented Jun 27, 2018

Similar issue: boltdb/bolt#698

@tmm1
Copy link
Contributor Author

tmm1 commented Jun 27, 2018

Here's my repro code:

func readBackup(file string) error {
	db, err := bolt.Open(file, 0600, &bolt.Options{Timeout: 1 * time.Second, ReadOnly: true})
	if err != nil {
		return err
	}
	defer db.Close()

	db.View(func(tx *bolt.Tx) error {
		if groups := tx.Bucket([]byte("groups")); groups != nil {
			num := groups.Stats().KeyN
			log.Printf("num: %v", num)
		}
	})
	return nil
}

Would be really nice if there was some way I could check to see if the backup was consistent before trying to read it. Ideally bbolt would be able to deal with truncated/corrupted files itself and not crash the entire process.

@subbu05
Copy link

subbu05 commented Dec 12, 2018

defer func() {
	if err := recover(); err != nil {
		fmt.Printf("Corrupted or invalid boltDB file\n",)
	}
}()

Add code to recover.

@benma
Copy link

benma commented Nov 9, 2022

I am also running into the issue that Check() on a corrupt DB crashes. Check() should definitely return an error instead of panicking.

panics-on-check.db.zip

cc @serathius - I saw you recently committed to the repo - who to ping? Is this repo still maintained?

Edit: the address fault is a segmentation fault, not a panic, so I this can't even be recovered with recover(). This seems to require a bugfix in this library, as it cannot be worked around really.

@serathius
Copy link
Member

@benma etcd project still has maintainers, however we are very stretched with work on etcd. We can review PR and fix bugs, but there is no active development on bbolt.

@cenkalti
Copy link
Member

cenkalti commented Apr 1, 2023

With https://pkg.go.dev/runtime/debug#SetPanicOnFault , segmentation faults can be turned into panics.

@ahrtr
Copy link
Member

ahrtr commented Apr 1, 2023

Check() should definitely return an error instead of panicking.

Agreed.

Fixing corrupted db file is my top priority recently. The most important thing is to figure out how to reproduce the issue. It would be great if anyone provide clues on this. Please do not hesitate to ping me if you have any thoughts. Thanks.

FYI. Recently we added a bbolt surgery clear-page-elements command as a workaround to fix corrupt db file, see #417.

@ahrtr
Copy link
Member

ahrtr commented May 19, 2023

I am also running into the issue that Check() on a corrupt DB crashes. Check() should definitely return an error instead of panicking.

panics-on-check.db.zip

The DB (panics-on-check.db) was somehow corrupted during the last transaction. The corrupted db can be easily fixed by reverting the meta page (It actually rollback the last transaction).

$ ./bbolt surgery revert-meta-page /tmp/panics-on-check.db --output ./new.db
The meta page is reverted.
$ ./bbolt check ./new.db 
OK

I am almost sure that the corruption isn't caused by bbolt. The db file has 6 pages in total, but the bucket's root page is somehow a huge value 7631988 (0x747474). Most likely it's caused by other issues, e.g. hardware or OS issue?

@benma Do you still remember how was the corrupt file generated? Was there anything unusual (e.g. power off, OS crash, etc.) when the corrupt file being generated? BTW, what's the bbolt version?

$ ./bbolt  page /tmp/panics-on-check.db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=4>
Freelist:   <pgid=5>
HWM:        <pgid=6>
Txn ID:     2
Checksum:   eef96d7a2c1b336e

$ ./bbolt  page /tmp/panics-on-check.db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=3>
Freelist:   <pgid=2>
HWM:        <pgid=4>
Txn ID:     1
Checksum:   264c351a5179480f

$ ./bbolt  page /tmp/panics-on-check.db 4
Page ID:    4
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 1

"bucket": <pgid=7631988,seq=0>

@ahrtr
Copy link
Member

ahrtr commented May 19, 2023

test.db.gz

The corrupted file provided by @tmm1 seems like a potential bbolt bug. What's your bbolt version?

The freelist page (108) was somehow reset (all fields have zero value).

What's confusing is that two meta pages have exactly the same Root (99), Freelist (108) and HWM (482). Meta 0 has TXN 64920, while meta 1 has TXN 64920; it indicates that the last RW transaction did not change anything. But the freelist should change anyway (It's a potential improvement point, we shouldn't sync freelist if the RW TXN changes nothing)

$ ./bbolt page /tmp/test.db  0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=99>
Freelist:   <pgid=108>
HWM:        <pgid=482>
Txn ID:     64921
Checksum:   aab8d660770b88f7

$ ./bbolt page /tmp/test.db  1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=99>
Freelist:   <pgid=108>
HWM:        <pgid=482>
Txn ID:     64920
Checksum:   929bdcc802b6f642

@ahrtr
Copy link
Member

ahrtr commented May 26, 2023

test.db.gz

There is even no way to fix the corrupted db file. The file is only 204800 bytes, so it's 50 pages ( 204800/4096 ). Obviously the root page ID (99), Freelist (108) and HWM (482) exceeds the file size. I can't even find the root page in the available 50 pages. It seems that the file was somehow truncated, and the root was in the truncated part.

$ ls -lrt test.db
-rw-r--r-- 1 wachao wheel 204800 May 26 15:15 test.db

@github-actions github-actions bot added the stale label May 11, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2024
@ahrtr ahrtr reopened this Jun 1, 2024
@ahrtr ahrtr removed the stale label Jun 1, 2024
@github-actions github-actions bot added the stale label Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants