Articles A New Implementation for a Fast Hash Table

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,438
Credits
573
A New Implementation for a Fast Hash Table
HaniAmmar - 29/Jul/2020
[SHOWTOGROUPS=4,20,22]

A merge of an array and a fast hash table
A data structure that looks like an ordered array and behaves like a hash. Any value can be accessed by its index or by its name. It preserves order of the elements and uses one heap. Therefore, is has an initial. Collisions do not cause new allocations. Instead, every item points to the next one. When expanded, it performs collection and resets its hash base to the new size.

Introduction
Hash tables are data structures that enable locating values using string keys. Some hash tables are fast in insertion, and some in finding values. A few implementations suffer from slow deletion, memory fragmentation, or their performance degrade with more items. Most hash tables do not store items by order, or retrieve a value using an index, and have to consume more memory for fast searching or to keep track of the order if they support that. However, this implementation may offer more speed with less trading.
Implementation
For a data structure to retrieve values using numbers, it needs to be a linear array, and when its memory block reaches its limit, it expands to a new heap and moves all of its items from the old memory to the new one. Therefore, it has to be a one-dimensional array. It is not an array or a hash table, but both:

template <typename Value_>
struct HArray {
size_t size_{0};
size_t capacity_{0};
Value_ *storage_{nullptr};
};
The variable size_ holds the number of items, capacity_ is for the size of the heap/memory block, and storage_ points to the location of the first value. Still, items have to store more than just values; they need the variable key_ and its hash for HArray to function as a hash table.
Hide Copy Code
template <typename Key_, typename Value_>
struct HAItem {
size_t hash_;
Key_ key_;
Value_ value_;
};

template <typename Key_, typename Value_>
struct HArray {
size_t size_{0};
size_t capacity_{0};
HAItem<Key_, Value_> *storage_{nullptr};
};
A linear array insertion looks like the following:
Hide Copy Code
private:
void insert_(Key_ key, Value_ value, size_t hash) {
if (size_ == capacity_) {
// expand
}

HAItem<Key_, Value_> &item = storage_[size_];

item.hash_ = hash;
item.key_ = key;
item.value_ = value;

++size_;
}
But, for the hash table to work, it needs a searching technique and deals with collisions. One way to solve that is to add a linked list by adding an extra variable next_ to the structure, to point to the item that comes next. Therefore, the new structure looks like:
Hide Copy Code
template <typename Key_, typename Value_>
struct HAItem {
size_t hash_;
HAItem *next_;
Key_ key_;
Value_ value_;
};
The searching technique compares a given key to a stored one. It locates it by dividing its hash value over (capacity – 1) and the remainder – of that division point to an index, a location in the allocated heap. If the compare function sought that the two keys are different, then it checks the pointer next_ to see if its item holds the same key or not, and keep on looking until it finds it or hit a null pointer. However, an infinite loop could occur.
Assume capacity_ is 21, and there are two items to be inserted: item1 has a hash value equal to 41, and item2 has its hash set to 60. The location of item1 is 41 % (21 - 1) = 41 % 20 = 1. The algorithm will lookup index 1 and set the last/null next_ pointer to point to item1. For item2, 60 % 20 = 0, thus, storage_[0].next_ is set to point to item2; assuming next_ is null. Now, if a new item comes in with a hash value of 20, 40, 21, 41 … it will point to either item1 or item2, and when it does, it will see item1 pointing to item2 and item2.next_ pointing back to item1, and when it tries to find the end of the linked list, it will end up in an infinite loop.
That is fixable by storing the start address, and checking every next_ variable, then break the loop if they match, then insert the new item between the last next_ and the one after it (it could be null). However, there is a faster way, and that way is by adding another pointer named anchor_. Therefore, the final structure of HAItem looks like:
Hide Copy Code
template <typename Key_, typename Value_>
struct HAItem {
HAItem *anchor_{nullptr};
size_t hash_{0};
HAItem *next_{nullptr};
Key_ key_{};
Value_ value_{};
};
Now next_ is for collision only, and anchor_ points to the location of the item. With that, next_ will point to forward elements (1, 2, 3, 4, 5), and CPU prefetching may bring the next item, while the search function is comparing the given key with the current item’s key_. However, anchor_ could point to anyone at that heap, and if it is null, then the item does not exist.
To understand the concept, assume that we have an array that has the capacity of 6 elements, and we need to insert the following items:
key​
Value​
Hash​
item1​
A​
10​
item2​
B​
15​
item3​
C​
21​
item4​
D​
24​
The base for the hash table is (capacity – 1), which is 5. item1 is inserted in location 0 - because 10 % 5 is 0, hence, storage_[0].anchor_ = the pointer of index 0. item2 is inserted at index 1, but its hash result points to 0 (15 % 5 = 0), the search function will lookup index 0 and finds item1, then it checks its next_ pointer to find the final pointer that has a null value and set it to point to item2. item3 is placed in index 3, and its hash points to index 1. Even though there is an item in index 1, its anchor_ is null, and the function will set storage_[1].anchor_ to point to item3. item4 will get the index 3, and its hash result points to index 4, therefore, storage_[0].anchor_ = &item3. And the data looks like:

[/SHOWTOGROUPS]
 

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,438
Credits
573
[SHOWTOGROUPS=4,20,22]

Image 1


There is no need for rehashing, binary tree, separated hash table, or additional pointers. Nevertheless, having an anchor_ variable means the search function can always return a reference to a pointer, and it can reference anchor_ or next_, and that pointer could point to an item or null. The search function can be defined as:

public:
Value_ *Find(const char *key) const {
HAItem_ **item =
find_(key, std::strlen(key), std::hash<const char *>{}(key));

if ((*item) != nullptr) {
return &((*item)->value_);
}

return nullptr;
}

private:
HAItem_ **find_(const char *key, size_t length, size_t hash) const {
HAItem_ **item = &(storage_[(hash % getBase_())].anchor_);

while ((*item) != nullptr) {
if (((*item)->hash_ == hash) &&
(length == (*item)->key_.length()) &&
((*item)->key_.compare(0, length, key) == 0)) {
break;
}

item = &((*item)->next_);
}

return item;
}
For getBase_():

Hide Copy Code
private:
inline size_t getBase_() const {
return ((capacity_ > 1U) ? (capacity_ - 1U) : 1U);
}
Hash tables check every inserted item to see if it exists or not, and with find_(), it returns a reference to that item. When it existed, it replaces its value with the given one, and if it does not, the reference pointer to where the new item should be placed; two birds, one stone:

public:
Value_ &operator[](const char *key) {
return Add(key, std::strlen(key));
}

Value_ &Add(const char *key, size_t length) {
if (key != nullptr) {
if (capacity_ == 0) {
Resize(1);
}

const size_t hash = std::hash<const char *>{}(key);
HAItem_ **item = find_(key, length, hash);

if ((*item) != nullptr) {
return (*item)->value_;
}

return insert_(Key_(key, length), hash, item)->value_;
}

throw std::runtime_error("key should not be null!");
}

private:
HAItem_ *insert_(Key_ &&key, size_t hash, HAItem_ **position) {
if (size_ == capacity_) {
Resize(capacity_ * 2);
position = find_(key.c_str(), key.length(), hash);
}

HAItem_ &item = storage_[size_];
item.hash_ = hash;
item.key_ = static_cast<Key_ &&>(key);
(*position) = &item;

++size_;

return &item;
}


[/SHOWTOGROUPS]
 

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,438
Credits
573
[SHOWTOGROUPS=4,20,22]

The only time find_() is called twice, is when the size of the memory block changes; because the base depends on the value of capacity_.

public:
void Resize(size_t new_size) {
capacity_ = ((new_size != 0) ? new_size : 2);

HAItem_ *src = storage_;
storage_ = new HAItem_[capacity_];

const size_t base = getBase_();
size_t n = 0;
size_t m = 0;

if (size_ > capacity_) {
size_ = capacity_; // Shrink
}

while (n != size_) {
HAItem_ &src_item = src[n];
++n;

if (src_item.hash_ != 0) {
HAItem_ *des_item = &(storage_[m]);
des_item->hash_ = src_item.hash_;
des_item->key_ = static_cast<Key_ &&>(src_item.key_);
des_item->value_ = static_cast<Value_ &&>(src_item.value_);

HAItem_ **position =
&(storage_[(des_item->hash_ % base)].anchor_);

while (*position != nullptr) {
position = &((*position)->next_);
}

*position = des_item;

++m;
}
}

size_ = m;
delete[] src;
}
Since Resize() does not need to check if the item exists or not, the rehashing mechanize is very simple (4 lines of code):

Hide Copy Code
HAItem_ **position =
&(storage_[(des_item->hash_ % base)].anchor_);

while (*position != nullptr) {
position = &((*position)->next_);
}

*position = des_item;

The line if (src_item.hash_ != 0) drops any deleted items, and deletion is the same as insertion, but it keeps track of the item that comes before the one that is going to remove; to set its next_ to the next_ of the next item. Also, it clears the value and the key, but keeps next_ and anchor_ intact; because they hold the hashing info.

public:
void Delete(const char *key) {
delete_(key, std::strlen(key), std::hash<const char *>{}(key));
}

private:
void delete_(const char *key, size_t length, size_t hash) {
if (key != nullptr) {
HAItem_ **item = &(storage_[(hash % getBase_())].anchor_);
HAItem_ **before = item;

while ((*item) != nullptr) {
if (((*item)->hash_ == hash) &&
(length == (*item)->key_.length()) &&
((*item)->key_.compare(0, length, key) == 0)) {
break;
}

before = item;
item = &((*item)->next_);
}

if ((*item) != nullptr) {
(*item)->hash_ = 0; // Clear hash
(*item)->key_ = Key_{}; // Clear key
(*item)->value_ = Value_{}; // Clear value
// anchor_ and next_ should not be cleared!

(*before)->next_ = (*item)->next_;
}
}
}
Since HArray is an array at its heart, it is possible to add a constructor that takes an initial size:

explicit HArray(size_t size) : capacity_(size) {
if (size != 0) {
storage_ = new HAItem_[capacity_];
}
}
Nevertheless, it is also possible to reach a key or value by an index:

Hide Copy Code
Value_ &operator[](size_t index) const {
if (index < size_) {
return (storage_[index].value_);
}

throw std::runtime_error("Index out of range!");
}

const Key_ *GetKey(size_t index) const {
if (index < size_) {
const HAItem_ &item = storage_[index];

if (item.hash_ != 0) {
return &(item.key_);
}
}

return nullptr;
}
Benchmark
Performance-wise, it is over three times faster in lookups than std::map, and over four times faster in deletion. Insertion is different; When initialized with size, then it is five and half times faster. Otherwise, it is between 25% to 80% faster; depending on the data:

Key-(0-10000000): Key-0, Key-1, Key-N

Insert​
Search​
Delete​
std::map​
2.23s​
1.39s​
2.02s​
HArray​
1.79s​
0.43s​
0.46s​
HA (Initial size)​
0.43s​
0.43s​
0.46s​
(0-10000000)-Key: 0-Key, 1-Key, N-Key

Insert​
Search​
Delete​
std::map​
2.37s​
1.5s​
2.16s​
HArray​
1.7s​
0.43s​
0.46s​
HA (Initial size)​
0.42s​
0.43s​
0.46s​
English Words:

Insert​
Search​
Delete​
std::map​
0.11s​
0.06s​
0.09s​
HArray​
0.06s​
0.02s​
0.03s​
HA (Initial size)​
0.02s​
0.02s​
0.03s​
Note: The file that contains the words is from Для просмотра ссылки Войди или Зарегистрируйся.


[/SHOWTOGROUPS]
 

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,438
Credits
573
[SHOWTOGROUPS=4,20,22]
1596129211082.png

1596129217457.png

1596129223577.png


To run the benchmark:

mkdir build
c++ -O3 -std=c++11 -Wall -Wextra -Werror -I ./include ./src/benchmark.cpp -o ./build/o
./build/o
Conclusion
While there are many implementations of hash tables, very few of them provide accessing elements using an index, and some of them suffer from memory fragmentation. Also, Storing items in order could cause slow lookups. This implementation takes an ordered array and adds a hashing functionality to it, to make a new hash table that is fast and more usable.

[/SHOWTOGROUPS]
 
Последнее редактирование: