Project #2 - Extendible Hash Index After - 08
Project #2 - Extendible Hash Index After - 08
CMU 15-445/645
Home
Assignments
FAQ
Schedule
Syllabus
Youtube
Piazza
Archives
Fall 2020
Fall 2019
Fall 2018
Fall 2017
Overview
The second programming project is to implement a disk-backed hash table for the BusTub DBMS. Your
hash table is responsible for fast data retrieval without having to search through every record in a database
table.
You will need to implement a hash table using the extendible hashing hashing scheme.
This index comprises a directory page that contains pointers to bucket pages. The table will access pages
through your buffer pool from Project #1. The table contains a directory page that stores all the metadata
for the table and buckets. Your hash table needs to support bucket splitting/merging for full/empty buckets,
and directory expansion/contraction for when the global depth must change.
You will need to complete the following tasks in your hash table implementation:
Page Layouts
Extendible Hashing Implementation
Concurrency Control
Project Specification
Like the first project, we are providing you with stub classes that contain the API that you need to
implement. You should not modify the signatures for the pre-defined functions in these classes. If you do
this, then it will break the test code that we will use to grade your assignment you end up getting no credit
for the project. If a class already contains certain member variables, you should not remove them. But you
may add private helper functions/member variables to these classes in order to correctly realize the
functionality.
The correctness of the extendible hash table index depends on the correctness of your buffer pool
implementation. We will not provide solutions for the previous programming projects.
To support reading/writing hash table buckets on top of pages, you will implement two Page classes to
store the data of your hash table. This is meant to teach you how to allocate memory from the
BufferPoolManager as pages.
The bucket_page_ids_ array maps bucket ids to page_id_t ids. The ith element in bucket_page_ids_ is the
page_id for the ith bucket.
You must implement your Hash Table Directory Page in the designated files. You are only allowed to modify
the directory file ( src/include/storage/page/hash_table_directory_page.h) and its corresponding source
file ( src/storage/page/hash_table_directory_page.cpp). The Directory and Bucket pages are fully separate
from the LinearProbeHashTable's Header and Block pages, so please make sure that you are editing the
correct files.
`occupied_` : The ith bit of `occupied_` is 1 if the ith index of `array_` has ever been occupied.
`readable_` : The ith bit of `readable_` is 1 if the ith index of `array_` holds a readable value.
`array_` : The array that holds the key-value pairs.
The number of slots available in a Hash Table Bucket Page depends on the types of the keys and values
being stored. You only need to support fixed-length keys and values. The size of keys/values will be the
same within a single hash table instance, but you cannot assume that they will be the same for all
instances (e.g., hash table #1 can have 32-bit keys and hash table #2 can have 64-bit keys).
You must implement your Hash Table Bucket Page in the designated files. You are only allowed to modify
the header file ( src/include/storage/page/hash_table_bucket_page.h ) and its corresponding source file
( src/storage/page/hash_table_bucket_page.cpp ). The Directory and Bucket pages are fully separate from
the LinearProbeHashTable's Header and Block pages, so please make sure that you are editing the correct
files.
Each Hash Table Directory/Bucket page corresponds to the content (i.e., the byte array data_ ) of a memory
page fetched by buffer pool. Every time you try to read or write a page, you need to first fetch the page
from buffer pool using its unique page_id, then reinterpret cast to either a directory or a bucket page, and
unpin the page after any writing or reading operations.
We have provided various helpers, or documentation suggesting helper functions. The only functions you
must implement are as follows.
You are free to design and implement additional new functions as you see fit. However, you must be
careful to watch for name collisions. These should be rare, but would arise in Gradescope as a compiler
error.
Your hash table must support both unique and non-unique keys. Duplicate values for the same key are not
allowed. This means that (key_0, value_0) and (key_0, value_1) can exist in the same hash table, but not
(key_0, value_0) and (key_0, value_0) . The Insert method only returns false if it tries to insert an
existing key-value pair.
Your hash table implementation must hide the details of the key/value type and associated comparator, like
this:
template <typename KeyType, typename ValueType, typename KeyComparator>
class ExtendibleHashTable {
// ---
};
`KeyType`: The type of each key in the hash table. This will only be GenericKey, the actual size of
GenericKey is specified and instantiated with a template argument and depends on the data type of
indexed attribute.
`ValueType`: The type of each value in the hash table. This will only be 64-bit RID.
`KeyComparator`: The class used to compare whether two KeyType instances are less/greater-than
each other. These will be included in the KeyType implementation files. Variables with the
`KeyComparator` type are essentially functions; for instance, given two keys `KeyType key1` and
`KeyType key2`, and a key comparator `KeyComparator cmp`, you can compare the keys via
`cmp(key1, key2)`.
Note that you can equality-test ValueType instances simply using the == operator.
Directory Indexing
When inserting into your hash index, you will want to use the least-significant bits for indexing into the
directory. Of course, it is possible to use the most-significant bits correctly, but using the least-significant
bits makes the directory expansion operation much simpler.
Splitting Buckets
You must split a bucket if there is no room for insertion. You can ostensibly split as soon as the bucket
becomes full, if you find that easier. However, the reference solution splits only when an insertion would
overflow a page. Hence, you may find that the provided API is more amenable to this approach. As always,
you are welcome to factor your own internal API.
Merging Buckets
Merging must be attempted when a bucket becomes empty. There are ways to merge more aggressively by
checking the occupancy of buckets and their split images, but these expensive checks and extra merges
can increase thrashing.
To keep things relatively simple, we provide the following rules for merging:
If you are confused about a "split image,” please review the algorithm and code documentation. The
concept falls out quite naturally.
Directory Growing
There are no fancy rules for this. You either have to grow the directory, or you don't.
Directory Shrinking
Only shrink the directory if the local depth of every bucket is strictly less than the global depth of the
directory. You may see other tests for the shrinking of the directory, but this one is trivial since we keep the
local depths in the directory page.
Performance
An important performance detail is to only take write locks and latches when they are needed. Always
taking write locks will inevitably timeout on Gradescope.
In addition, one potential optimization is to factor your own scans types over the bucket pages, which can
avoid repeated scans in certain cases. You will find that checking many things about the bucket page often
involves a full scan, so you can potentially collect all this information in one pass.
You will need to have latches on each bucket so that when one thread is writing to a bucket other threads
are not reading or modifying that index as well. You should also allow multiple readers to be reading the
same bucket at the same time.
You will need to latch the whole hash table when you need to split or merge buckets, and when global
depth changes.
There are two latches to be aware of in this project. The first is table_latch_ in extendible_hash_table.h ,
which takes latches on the extendible hash table. This comes from the RWLatch class in
src/include/common/rwlatch.h . As you can see in the code, it is backed by std::mutex . The second is the
built-in page latching functionality in src/include/storage/page.h. This is what you must use to protect
your bucket pages. Note that to take a read-lock on the table_latch_ you call RLock from RWLatch.h , but to
take a read-lock on a bucket page you must reinterpret_cast<Page *> to a page pointer, and call the
RLatch method from page.h.
We suggest you look at the extendible hash table class, look at its members, and analyze exactly which
latches will allow which behavior. We also suggest that you do the same for the bucket pages.
Project 4 will explore concurrency control through locking with the LockManager in LockManager.h . You do
not need the LockManager at all for this project.
Transaction Pointer
You can simply pass nullptr as the Transaction pointer argument when it is required. This Transaction
object comes from src/include/concurrency/transaction.h. It provides methods to store the page on
which you have acquired latch while traversing through the Hash Table. You do not need to do this to pass
the tests.
You must pull the latest changes on our BusTub repo for test files and other supplementary files we
have provided for you. Run `git pull public master`.
Compilation Issues
If you have trouble compiling, try deleting your build directory and making a fresh one. Running make clean
does not completely reset the compilation process, so starting from scratch with cmake .. can help.
We encourage you to use gdb to debug your project if you are having problems. Here is a nice reference for
useful commands in gdb . Make sure you've ran cmake with debug flags on so gdb has debugging symbols
to use.
Important: These tests are only a subset of the all the tests that we will use to evaluate and grade your
project. You should write additional test cases on your own to check the complete functionality of your
implementation. In addition, if Valgrind times out then it’s possible that your implementation is not efficient
enough. If your buffer pool manager implementation was slow, then you might need to debug the BPM as
well.
Formatting
Your code must follow the Google C++ Style Guide. We use Clang to automatically check the quality of
your source code. Your project grade will be zero if your submission fails any of these checks.
Execute the following commands to check your syntax. The format target will automatically correct your
code. The check-lint and check-clang-tidy targets will print errors and instruct you how to fix it to
conform to our style guide.
$ make -j format
$ make -j check-lint
$ make -j check-clang-tidy
If you have issues with any of these commands, it may relate to either your python installation, or file
permissions on your remote files. Please try to read the error, understand the problem, and resolve it by
doing things like inspecting and editing your $PATH , managing permissions with chmod , and managing
installed packages with brew or apt .
Development Hints
Instead of using printf or cout statements for debugging, use the LOG_* macros for logging information like
this:
LOG_INFO("# Pages: %d", num_pages);
LOG_DEBUG("Fetching page %d", page_id);
To enable logging in your project, you will need to reconfigure it like this:
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=DEBUG ..
$ make -j
The different logging levels are defined in src/include/common/logger.h. After enabling logging, the logging
level defaults to LOG_LEVEL_INFO . Any logging method with a level that is equal to or higher than
LOG_LEVEL_INFO (e.g., LOG_INFO , LOG_WARN , LOG_ERROR ) will emit logging information. Note that you will need
to add #include "common/logger.h" to any file that you want to use the logging infrastructure.
Method Signatures
Please do not change any of the method signatures for any of the provided methods. You are free to add
your own helper methods.
You can see that it does not appear directly above the Remove function in extendible_hash_table.h :
bool Remove(Transaction *transaction, const KeyType &key, const ValueType &value);
Also notice the HASH_TABLE_TYPE template on the Remove function. This is defined in
extendible_hash_table.h as:
1. Add function signature to the header file. Use private functions whenever possible. If you need to add
a public function, be wary of name conflicts with the grading harness. You can preempt this by
making your function names unique in some way, but there should typically not be conflicts.
2. Add the function to the .cpp file with templates. You will probably need a template before the
function name like TEMPLATE::FunctionName. If you are using KeyType, ValueType, and/or
KeyComparator, you will need the aforementioned template above the function signature.
Post all of your questions about this project on Piazza. Do not email the TAs directly with questions.
Grading Rubric
Each project submission will be graded based on the following criteria:
1. Does the submission successfully execute all of the test cases and produce the correct answer?
2. Does the submission execute without any memory leaks?
Note that we will use additional test cases that are more complex and go beyond the sample test cases
that we provide you.
Late Policy
See the late policy in the syllabus.
Submission
After completing the assignment, you can submit your implementation of to Gradescope. Since we have
two checkpoints for this project, you will need to submit them separately through the following link.
https://www.gradescope.com/courses/286490/
src/include/buffer/lru_replacer.h
src/buffer/lru_replacer.cpp
src/include/buffer/buffer_pool_manager_instance.h
src/buffer/buffer_pool_manager_instance.cpp
src/include/buffer/parallel_buffer_pool_manager.h
src/buffer/parallel_buffer_pool_manager.cpp
src/include/storage/page/hash_table_directory_page.h
src/storage/page/hash_table_directory_page.cpp
src/include/storage/page/hash_table_bucket_page.h
src/storage/page/hash_table_bucket_page.cpp
src/include/container/hash/extendible_probe_hash_table.h
src/container/hash/extendible_probe_hash_table.cpp
Alternatively, running this zip command from your working directory (aka bustub, bustub-private, etc.) will
create a zip archive called project2-submission.zip that you can submit to Gradescope. You can also put
this command in a bash file and run the bash file to make things easier for you.
$ zip project2-submission.zip src/include/buffer/lru_replacer.h \
src/buffer/lru_replacer.cpp \
src/include/buffer/buffer_pool_manager_instance.h \
src/buffer/buffer_pool_manager_instance.cpp \
src/include/buffer/parallel_buffer_pool_manager.h \
src/buffer/parallel_buffer_pool_manager.cpp \
src/include/storage/page/hash_table_directory_page.h \
src/storage/page/hash_table_directory_page.cpp \
src/include/storage/page/hash_table_bucket_page.h \
src/storage/page/hash_table_bucket_page.cpp \
src/include/container/hash/extendible_hash_table.h \
src/container/hash/extendible_hash_table.cpp
You can submit your answers as many times as you like and get immediate feedback.
Collaboration Policy
Every student has to work individually on this assignment.
Students are allowed to discuss high-level details about the project with others.
Students are not allowed to copy the contents of a white-board after a group meeting with other
students.
Students are not allowed to copy the solutions from another colleague.
WARNING: All of the code for this project must be your own. You may not copy source code from
other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's
Policy on Academic Integrity for additional information.