|
Useful STL Sample - Dynamic Array of Generic Strings Now it is time to move from abstract exercises to something really useful. Imagine that you have to keep and maintain very big RAM-based list of text strings - several thousands of items or so. It could be anything, for example, list of project files for your next great software product (yes, today's midsize projects usually consist of few thousands files; big ones, like MAYA, over 30,000). First and simple idea that comes into mind is to use vector of strings (vector<string> myList). Is this so smart? If you have to handle 50 - 100 items, obviously yes. But what if list is over 30,000 items? Look again at the internal representation of STL containers. STL string is actually a vector of char with some additional member functions. All this means that memory manager have to handle the same number of reallocate-able blocks as total count of strings plus one. Moreover, each string will not occupy amount of memory equal to number of characters stored in string. A STL container has minimal size of storage buffer which can be obtained with capacity() member function applied to empty container. Thus, vector<string> is just not efficient (especially if its content is a large amount of empty or small strings). What we will do is to keep all strings in single storage buffer and maintain two tables: size table, which stores size of each item; and offset table, which specifies offset of an item within storage buffer. You can download all source files from SourceForge.net. These sources are Copyright (c) by Andrei Verovski and are subject of the following license agreement. Objective:
So, we can define basic types to be used in CDynGenStrArray. 1) Storage buffer and associated iterator types: typedef vector<char> 2) Size/Offset table and associated iterator types: typedef vector<size_t> Class definition/Data members And now, the CDynGenStrArray class itself: class CDynGenStrArray {
--------------------- to be continued -------------------- Note: in fact, size table is not so necessary since size of items can be calculated using offset values. However, size table simplifies design. Most data members are self-explanatory except probably one - mValid. mValid will be used to flag corruption of internal data structures by dedicated checking function. Also, it is a good idea to allow to restrict maximum length of strings. mMaxStringLen set to 255 by default, but can be increased to any value. Too long strings will be truncated by insert or push_back. CDynGenStrArray has one default constructor and virtual destructor. Since instances of CDynGenStrArray most likely will not exist, copy constructor is not required. Get info function members Next we need to write get info function members. What each function does is clear from its comments. inline size_t Size(void) const { return (mSize); }; /* Returns number of strings in array. */ inline size_t StringLenSum(void) const { return (mStrings.size()); }; /* Returns summary lenght of all strings in array. */ inline bool Empty(void) const { return (mSize == 0); }; /* Returns true if array is empty. */ inline bool Valid(void) const { return (mValid); }; /* Returns true if array internal structure is valid. */ inline size_t GetMaxStringLen(void) const { return (mMaxStringLen); }; /* Returns max lenght of strings in array. */ virtual size_t StringLen(size_t p) const { return (mSizeTable[p]); }; /* Returns lenght of specified string in array. */ Auxiliary function members virtual size_t SetMaxStringLen(size_t max_len, bool trunc_str = true); SetMaxStringLen is used to change maximum length of stored strings. If new length is less than old one, strings will truncated by default (you can change default behavior of SetMaxStringLen by passing second argument equal to false). Debugging/range checking macros Additionally, macros CDYNGENSTRARRAY_THROW_IFBADINDEX and CDYNGENSTRARRAY_THROW_IFNEGINDEX will perform range checking. #define CDYNGENSTRARRAY_THROW_IFBADINDEX(p) \ { if (p < 0 || p >= CDynGenStrArray::Size()) throw((char *) HERE); }; Pay attention to macro HERE. Its part of Utility Macros package written by Radoslav Getov. HERE expands into source file name plus line number, what especially useful for debugging and troubleshooting. Insert function members CDynGenStrArray will have only two main insert functions, other will be derived from these two. The advantage of this method is obvious - less code, less debugging. // Main insert function members. virtual void Insert(VectCIterType &it, size_t p, size_t nc); virtual void Insert(const CCharWrap &ch, size_t p); In fact, even those two can be combined into one main and one derived, but due to different nature of input I have decided to keep both. Importnat notes:
What is arcane CCharWrap class? As I have explained before, CDynGenStrArray should support different format of input: C&Pascal style strings, STL strings, vector<char>. C&Pascal style strings may be translated as char* plus char count specified by different manner. So, we need three insert functions with input of type char*:
In this case we cannot overload Insert. Solution - use intermediate wrapper classes for these three types of input. Each wrapper class will carry pointer to string, default constructor which initializes it and operator () for data access. struct CCharWrap {
}; struct CCStrWrap {
}; struct CPStrWrap {
}; Then, calls of insert with wrapper classes will look like this: a.Insert(CCharWrap("Blah Blah Blah", 4), 0); // Insert piece of text, first four characters. a.Insert(CCStrWrap("C String Sample"), 4); // Insert C string. a.Insert(CPStrWrap("\pPascal String Sample"), 2); // Insert Pascal string. Since none of wrapper classes involves memory copying operations and all function members are inline, they do not cause slowdown. Pay attention that all wrapper have input of generic pointer type void*. This allow to avoid explicit pointer conversion, since different compiler may treat sequences like "abcdefgh" as char or as unsigned char. For example, typical Macintosh C++ compiler will recognize "abcdefgh" (C NULL terminated string) as char, but "\pabcdefgh" (Pascal string with first item indicating length, where prefix "\p" instructs compiler to generate such item) as unsigned char, what is not the same. Below is list of other insert function members which derived from the first two:
// Derived insert function members.
virtual void Insert(VectCConstIterType &it, size_t p, size_t nc) {
};
virtual void Insert(VectCIterType &it1, VectCIterType &it2, size_t p) {
}; virtual void Insert(VectCConstIterType &it1, VectCConstIterType &it2, size_t p) {
}; virtual void Insert(VectCType &v, size_t p) {
}; virtual void Insert(const VectCType &v, size_t p) {
}; // Insert STL string. virtual void Insert(const string &str, size_t p) { CDynGenStrArray::Insert(CCharWrap(str.data(), str.length()), p); }; virtual void Insert(const CPStrWrap &pstr, size_t p) {
}; virtual void Insert(const CCStrWrap &cstr, size_t p) {
}; Push back function members Now, on the basis of insert methods, we can construct push back function members. They will use the same arguments except index. Just one example: virtual void PushBack(const CPStrWrap &pstr) { CDynGenStrArray::Insert(pstr, mSize); }; All the rest of push back function members will be written the same way. Erase function members There will be four erase function members: first, which erases entire content; second, which erases item at specified position; third, which erases items within specified range, and fourth, which erases last item. virtual void Erase(void); virtual void Erase(size_t p); virtual void Erase(size_t sp, size_t ep);
virtual size_t PopBack(void);
Items retrieval function members virtual size_t Get(
virtual size_t Get(string &out_str, size_t p, long nc = CDynGenStrArray::kEntireString) const; virtual size_t Get(char *out_str, size_t p, long nc = CDynGenStrArray::kEntireString) const; First copies item at specified position into vector. It may insert or overwrite destination content depending of the value of last argument. Exact location within output vector specified with iterator. This function checks size of output vector in order to prevent overrun, but is do not verifies if iterator is valid (i.e. if iterator points past container end). Second copies item at specified position into STL string. It resizes string if necessary. Third copies item into buffer specified with pointer (char*). There is no way for Get() to verify size of destination buffer, so this function should be used with caution. All items retrieval functions copy entire content (of an item) by default. Memory management function members As mentioned before, in many cases STL may eliminate need for use of low-level memory management routines at all. The only one memory management function Reserve(size_t str_len_sum, size_t total_items) makes possible to avoid frequent reallocations of storage buffer and tables by means of allocating extra space estimated by total number of items and their summary size. By the way, CDynGenStrArray::Reserve(...) uses only STL reserve for its job. Validation function members virtual bool Validate(void); Checks integrity of internal data structures and sets mValid flag to false if array's data is invalid. However, Validate() assumes that array object itself as well as its data members are not corrupted. Output function members virtual void Cout(size_t p) const; virtual void Cout(size_t sp, size_t ep, bool endl_req = true) const; Outputs either one either range of items to standard console. Design considerations Now think about these issues:
Is it possible to overcome all or at least part of these shortcomings? Definitely, yes. How? Read next chapter... |