Friday, April 25, 2008

C++ STL Hash Containers and Performance

C++ STL Hash Containers and Performance

If you understand the details of your application and data, hash containers are powerful tools to add to your performance toolbox.

By Thomas Johnson, Dr. Dobb's Journal
Apr 05, 2007
URL:http://www.ddj.com/cpp/198800559

Thomas works as a Project Engineer in Advertising.com's Research and Development department, where he leads software projects that implement newly developed algorithms. Thomas can be contacted at thomas.j.johnson@gmail.com.


Whether you're writing business applications, website backend code, games, or nearly any other modern application, it's likely that you need to quickly access large amounts of data and perform calculations on this data. For instance, at Advertising.com we have to rapidly perform complex mathematical calculations on huge amounts of in-memory data so that we can choose the most relevant ads to show website visitors. Of course, there are many different profiling and optimization techniques you can use once your application is completed. But even if you believe that "premature optimization is the root of all evil," you should begin thinking about application performance as early as possible. Why? Because the proper choice of data structures is a make-or-break architectural decision for performance-critical applications.

Fortunately for C++ programmers, the C++ Standard Template Library provides a wealth of commonly used data structures that can be easily incorporated into most programs. Many of the simpler data structures are familiar: The performance trade-offs of vectors, linked lists, and queues are taught in introductory computer-science courses. It's also straightforward to understand and use sorted associative containers like maps and sets without knowing what's going on under the hood, particularly if it's been a while since you've looked at the details of Red/Black Tree algorithms that typically underlie them.

But understanding implementation details and design trade-offs is critical for using some of the most efficient and powerful data structures available in STL distributions—hash-based structures that include hash_map, hash_set, hash_multimap, and hash_multiset. Although these structures are not yet part of the C++ Standard, they are included in many STL packages, including the SGI version that ships with most Linux distributions. They will also be included in TR1 under the slightly different names of unordered_map, unordered_multimap, unordered_set, and unordered_multiset. The examples in this article use the SGI STL implementation. But even if you use a different STL, the implementations are likely to be similar.

Each of these data structures provides functionality similar to their nonhash-based counterparts:

  • hash_maps store key/value pairs that have unique keys.
  • hash_sets store unique values.
  • hash_multimaps store key/value pairs with nonunique keys.
  • hash_multisets store nonunique values.

For many applications, the speedup that can be gained by using, for instance, hash_maps instead of regular maps is significant. In theory, hash-based containers can provide constant-time insertion, deletion, and access to any values in the containers. In contrast, the Red/Black Trees that power STL's maps, sets, multimaps, and multisets require O(log n) time for the equivalent operations.

Factors That Impact Performance

But hash-based containers are not a silver bullet, and wall-clock time varies dramatically based on the hashing function used, the number of objects stored in the container, the time it takes to check values for equality, and other factors. Knowing how these factors interact is crucial to harnessing the performance improvements offered by hash containers.

To understand how each of these factors impacts performance, you have to understand how hash-based containers are implemented. Consider hash sets.

Hash sets can be thought of as arrays of linked lists. Each array element, denoted in Figure 1 as B0...B4, is a "bucket." When a new item is inserted into the hash set, the item's hash value is computed. Next, the bucket number is computed by taking the hash value modulo the number of buckets. In Figure 1, item A, whose hash value is 7, is stored in bucket 2 (Hash value of 7 modulus 5 buckets=bucket 2). When you call find(A) to check if item A is in the hash set, the program computes A's hash value of 7, and looks in bucket 2 to see if A is there. Since there is only one element in bucket 2, this is relatively fast. But even if there is only one element in a bucket, find() still needs to check if this is the particular element that you are searching for, which involves a call to operator==. If equality comparison is slow, the performance of your hash container degrades.

Figure 1: Array elements.

Now suppose you insert another item, item Z, into your hash set. If item Z has a hash value of 12, and you do not increase the number of buckets in the hash set, then item Z also ends up in bucket 2 (Figure 2). When more than one item is in a hash bucket, find() may become slower. If you call find(Z), the program again looks in bucket 2, but it first has to examine item A before moving on to the next element in the linked list and telling you that Z is in fact in your hash set. Thankfully, as the number of items in your hash set grows, the number of buckets automatically increases. This reduces the chance of two items with different hash values ending up in the same bucket.

Figure 2: Inserting another item into a hash set.

But what if you had many items that had the same hash value, as in Listing One? When two different items hash to the same value, there is a "hash collision." These items always end up in the same bucket, no matter how many buckets you have. This significantly degrades the performance of the collection. If the hash collisions happen often enough, finding items in the container effectively becomes O(n) instead of O(1) because you have to traverse the linked lists that form in the hash buckets.

struct terribleHasher {
size_t operator()(const myClass& myObj) const {
return 1;
}
};
Listing One

In addition to being relatively unique, hash values should also be very fast to compute. The SGI STL provides some basic hash functions. Most primitives hash to themselves: integers, characters, and longs all simply cast to size_t, which is the return type for STL-compatible hash functions. The SGI STL also includes functions to hash char*s (strings, for instance) that iterate over each character and apply a simple recursive formula.

Handling other simple types is nearly as easy. Just as with primitives, it makes sense to hash anything that can be quickly and uniquely cast to a size_t to its own value. This cast is trivial and therefore extremely fast. Since it is obviously unique, it should be preferred over other hash functions whenever possible. But rather than reimplementing this hashing concept for each type that can be cast to size_t, use the template in Listing Two.

template<>
struct SizeTCastHasher {
size_t operator()( const T_TypeToHash& i_TypeToHash ) const {
return size_t( i_TypeToHash );
}
};
Listing Two

One use case for the SizeTCastHasher is hashing a pointer. For instance, suppose you are writing a program for an Internet advertising business. You have a class adFactory that creates objects of type myAdvertisement. The factory class also places the ads on websites by calling myWebsite.placeAd(); see Listing Three. The website keeps track of the ads that have been placed on it by putting each ad into a hash_set. Since each myAdvertisement is created at a new location in memory, you can use the address of the myAdvertisement object (the value of the pointer) as the key into the map, and cast the pointer to size_t when you need to calculate the hash value of the key. Of course, this restricts memory management somewhat since two identical objects that are at different memory locations won't hash to the same value. Consequently, you must ensure that you have only one copy of a given myAdvertisement in memory. However, in exchange for satisfying this restriction, calling lookup functions like website.doWeShowAd() is now very fast because the hash function is both computationally trivial and hashes each object to a unique value.

class adFactory {
...
myAdvertisement* createAd(TAdType adType) {
myAdvertisement* newAd=new myAdvertisement(adType);
myWebsite* website=websiteFactory.getBestWebsite(newAd);
website.placeAd(newAd);
return newAd;


}
...
};
class website {
public:
...
void placeAd(const myAdvertisement* const ad) {
myAds.insert(ad);
}
bool doWeShowAd(myAdvertisement* ad) const {
return myAds.find(ad)!=myAds.end();
}
...
private:
__gnu_cxx::hash_set<> > myAds;
};

Listing Three

When designing hash functions for more complex types, it is important to know as much as possible about your data. The range and distribution of the data is particularly important. For example, suppose that in the advertising application, you want to know how many people saw a given ad on a given website. One way to implement this would be a hash map that hashes an ad and a website to a size_t. Further suppose that each ad and website has a unique identifier associated with it. For instance, this could be a row number that our application assigned to the ad or website when it was loaded from a database. In this scenario, you would like the hash map to hash a pair of unsigned integers (the row number of the ad and website) to another unsigned integer. If you make the assumption that the IDs of the website and ad are both smaller than 216, then you can pack both identifiers into a single size_t without any hash collisions. ShiftedPairHasher (in Listing Four) does exactly that.

struct ShiftedPairHasher {
size_t operator()(std::pair key) {
return (key.first << ( sizeof(unsigned int) * 8/2)) ^ key.second; } }
Listing Four

How the Code Works

Here's how this code works. Conceptually, you are placing key.first into the upper 16 bits of the size_t and key.second into the lower 16 bits. To do this, you use the bitwise <<> to shift key.first the appropriate number of bits to the left. The number of bits to shift is half the number of bits in our key type, which you calculate with the formula sizeof(unsigned int)*8/2. Next, you use ^, the bitwise XOR operator, to put key.second into the lower half of the hash value. I have chosen to use XOR instead of OR because if key.second grows beyond 16 bits, XOR yields hash values that are more unique than OR, since OR sets the 17th bit to 1 for all values of key.second larger than 65536. ShiftedPairHasher is quite fast: It does just one left shift operation and one XOR operation. The compiler optimizes the sizeof(unsigned int)*8/2 to a constant at compile time. This same concept can be extended if you want to hash more than two values simply by shifting each value the appropriate number of bits and then XORing it with the other shifted values.

With proper hashers, hash containers can give you O(1) access time. Nonetheless, even if you are able to create excellent hash functions, there are certain cases when nonhashed containers actually give better runtime performance. Consider the case where you create a huge number of containers that hold only a few values, and you will be doing only a few lookups on each container (Listing Five). If you typedef TMap to a regular map in doMapBenchmark(), this program takes about 5.9 seconds to execute (all timings were generated on a Pentium 4 2.8 GHz). But if you change TMap to be a hash_map, the same program takes nearly 16.5 seconds. The reason for this 280 percent increase in runtime lies in how memory is allocated when an application instantiates hash maps and maps.

#include 
#include
#include
int doMapBenchmark() {
typedef __gnu_cxx::hash_map TMap;
//typedef std::map TMap;
std::vector myMaps;
int mySum=0;
for(int mapCnt=0; mapCnt<1000000; i="0;" mysum="myMap.find(0)-">second+myMap.find(1)->second;
myMaps.push_back(myMap);
}
return mySum;
}
Listing Five

Because maps are implemented as binary trees, the map constructor only needs to create a few pointers when the application instantiates a new map. But hash maps are implemented as arrays of linked lists. When you create a new hash map, by default, the STL creates 193 buckets and inserts a NULL pointer into each one. If you are creating a large number of hash maps, this can take a long time. The problem can be somewhat mitigated by asking for a smaller number of buckets. In the benchmark program (Listing Five), you can change "TMap myMap;" to "TMap myMap(0);". Although you won't get zero buckets, you'll get the smallest number of buckets that your STL's implementation offers; in the case of the SGI STL it's 53 buckets. Making this change reduces the benchmark's wall-clock time to about 8.8 seconds—nearly a 50 percent decrease in runtime.

It is also important to consider how long it takes to compute the hash function relative to the function's uniqueness. For instance, if you have two hash functions—one of which produces many hash collisions but is much faster to calculate—and another—which is slower but produces very few collisions—it may make sense to choose the faster one even though this leads to scaling that is closer to O(n) than O(1). For instance, this kind of strategy may make sense when dealing with long strings. Because the default string hasher iterates over each character in the string when calculating the hash, the lookups become O(l) where l is the length of the string. If the characters of the string are well distributed, wall-clock time may be improved by hashing fewer characters, as in Listing Six (available electronically; see "Resource Center," page 5). In this example we are inserting 10,000 randomly generated strings of length 10,000 into a hash map and then looking up each one 10,000 times. If we use the default string hasher, this takes 17.1 seconds. However, if we calculate the hash based only on the first three characters of the string, the runtime is reduced to 8.4 seconds. Clearly the calculation time of the hash function dominates the additional linked list traversals required to resolve hash collisions.

STL Options

If the STL doesn't have a hash function that suits your needs, take a look at the Boost library's hash functions before crafting your own. The boost::hash library (www.boost .org/doc/html/hash.html) provides hashing functions for many primitives, but even more useful is its hash_combine function. This function takes two hash values and calculates a new, unique hash from them. The Boost hashing library includes templates that apply hash_combine recursively to calculate hash values for multiobject data structures such as arrays, pairs, and the STL containers. Starting from a user-specified seed value, the hash value is calculated by iterating over each individual object and combining the object's hash value with the previously calculated hash value (or the seed value if the particular object is the first one in the container). This is a useful concept, but again, the performance of hash containers and their associated hash functions are very heavily dependent on the details of the data. Whenever you have a choice between multiple hash functions, you should always benchmark each option before committing to a final decision.

Conclusion

Hash-based containers can provide a significant speed boost to your application under certain conditions. Still, they aren't appropriate in every situation, and the performance gain is greatly influenced by your choice of hash function, which in turn depends on the properties of the data you are hashing.

If you understand the details of your application and data, hash containers are a powerful tool to add to your performance toolbox.

No comments: