Useful STL Sample - Dynamic Array of Generic Strings

Now it is time to move from abstract exercises to something really useful. Imagine that you have to keep and maintain very big RAM-based list of text strings - several thousands of items or so. It could be anything, for example, list of project files for your next great software product (yes, today's midsize projects usually consist of few thousands files; big ones, like MAYA, over 30,000).

First and simple idea that comes into mind is to use vector of strings (vector<string> myList). Is this so smart? If you have to handle 50 - 100 items, obviously yes. But what if list is over 30,000 items? Look again at the internal representation of STL containers. STL string is actually a vector of char with some additional member functions. All this means that memory manager have to handle the same number of reallocate-able blocks as total count of strings plus one. Moreover, each string will not occupy amount of memory equal to number of characters stored in string. A STL container has minimal size of storage buffer which can be obtained with capacity() member function applied to empty container. Thus, vector<string> is just not efficient (especially if its content is a large amount of empty or small strings).

What we will do is to keep all strings in single storage buffer and maintain two tables: size table, which stores size of each item; and offset table, which specifies offset of an item within storage buffer.

You can download all source files from SourceForge.net. These sources are Copyright (c) by Andrei Verovski and are subject of the following license agreement.

Objective:

  • Construct class CDynGenStrArray (stands as Dynamic Generic String Array) whose design outlined above.
  • Class CDynGenStrArray should be platform-neutral.
  • Only standard C++ features should be utilized.
  • Third-party libraries (if any) must be free and available in source code format.
  • Class CDynGenStrArray must support different string formats: C&Pascal style strings, STL strings, vector<char>.
  • Class CDynGenStrArray should provide only basic functionality (get info, insert, erase). Advanced features like sort and other algorithms will be available through derived class of CDynGenStrArray (later you will understand, why).
  • Foolproof implementation (range checking of input data is essential).

So, we can define basic types to be used in CDynGenStrArray.

1) Storage buffer and associated iterator types:

typedef vector<char> VectCType;
typedef VectCType::iterator VectCIterType;
typedef VectCType::const_iterator VectCConstIterType;

2) Size/Offset table and associated iterator types:

typedef vector<size_t> VectSizeTType;
typedef VectSizeTType::iterator VectSizeTIterType;
typedef VectSizeTType::const_iterator VectSizeTConstIterType;

Class definition/Data members

And now, the CDynGenStrArray class itself:

class CDynGenStrArray {

protected:

 

/* Data members */

VectCType mStrings;

VectSizeTType mOffsetTable, mSizeTable;

size_t mMaxStringLen;

size_t mSize;

bool mValid;

--------------------- to be continued --------------------

Note: in fact, size table is not so necessary since size of items can be calculated using offset values. However, size table simplifies design. Most data members are self-explanatory except probably one - mValid. mValid will be used to flag corruption of internal data structures by dedicated checking function. Also, it is a good idea to allow to restrict maximum length of strings. mMaxStringLen set to 255 by default, but can be increased to any value. Too long strings will be truncated by insert or push_back.

CDynGenStrArray has one default constructor and virtual destructor. Since instances of CDynGenStrArray most likely will not exist, copy constructor is not required.

Get info function members

Next we need to write get info function members. What each function does is clear from its comments.

inline size_t Size(void) const { return (mSize); }; /* Returns number of strings in array. */

inline size_t StringLenSum(void) const { return (mStrings.size()); }; /* Returns summary lenght of all strings in array. */

inline bool Empty(void) const { return (mSize == 0); }; /* Returns true if array is empty. */

inline bool Valid(void) const { return (mValid); }; /* Returns true if array internal structure is valid. */

inline size_t GetMaxStringLen(void) const { return (mMaxStringLen); }; /* Returns max lenght of strings in array. */

virtual size_t StringLen(size_t p) const { return (mSizeTable[p]); }; /* Returns lenght of specified string in array. */

Auxiliary function members

virtual size_t SetMaxStringLen(size_t max_len, bool trunc_str = true);

SetMaxStringLen is used to change maximum length of stored strings. If new length is less than old one, strings will truncated by default (you can change default behavior of SetMaxStringLen by passing second argument equal to false).

Debugging/range checking macros

Additionally, macros CDYNGENSTRARRAY_THROW_IFBADINDEX and CDYNGENSTRARRAY_THROW_IFNEGINDEX will perform range checking.

#define CDYNGENSTRARRAY_THROW_IFBADINDEX(p) \

{ if (p < 0 || p >= CDynGenStrArray::Size()) throw((char *) HERE); };

Pay attention to macro HERE. Its part of Utility Macros package written by Radoslav Getov. HERE expands into source file name plus line number, what especially useful for debugging and troubleshooting.

Insert function members

CDynGenStrArray will have only two main insert functions, other will be derived from these two. The advantage of this method is obvious - less code, less debugging.

// Main insert function members.

virtual void Insert(VectCIterType &it, size_t p, size_t nc);

virtual void Insert(const CCharWrap &ch, size_t p);

In fact, even those two can be combined into one main and one derived, but due to different nature of input I have decided to keep both.

Importnat notes:

  1. Both insert functions will throw an exception if specified index is less than zero.
  2. If specified index is past the end of array, insert will append item.
  3. Both insert functions will make prior reservation of memory for data and tables. Thus, in case of memory shortage, exception will be thrown before array's data is modified.

What is arcane CCharWrap class? As I have explained before, CDynGenStrArray should support different format of input: C&Pascal style strings, STL strings, vector<char>. C&Pascal style strings may be translated as char* plus char count specified by different manner. So, we need three insert functions with input of type char*:

  1. char* plus char count - generic insert function for char* type.
  2. char* with last NULL character - C string.
  3. char* or unsigned char* (on MacOS) with the first item specifying string length - Pascal style string.

In this case we cannot overload Insert. Solution - use intermediate wrapper classes for these three types of input. Each wrapper class will carry pointer to string, default constructor which initializes it and operator () for data access.

struct CCharWrap {

char *mC;

size_t mNc;

CCharWrap(const void *c, size_t nc)

{ mC = const_cast<char *>((char *) c), mNc = nc; };

char* operator()(void) const { return (mC); };

size_t Count(void) const { return (mNc); };

};


struct CCStrWrap {

char *mCStr;

CCStrWrap(const void *cstr)

{ mCStr = const_cast<char *>((char *) cstr); };

char* operator()(void) const { return (mCStr); };

};


struct CPStrWrap {

char *mPStr;

CPStrWrap(const void *pstr)

{ mPStr = const_cast<char *>((char *) pstr); };

char* operator()(void) const { return (mPStr); };

size_t Len(void) const { return (mPStr[0]); };

};


Then, calls of insert with wrapper classes will look like this:

a.Insert(CCharWrap("Blah Blah Blah", 4), 0); // Insert piece of text, first four characters.

a.Insert(CCStrWrap("C String Sample"), 4); // Insert C string.

a.Insert(CPStrWrap("\pPascal String Sample"), 2); // Insert Pascal string.


Since none of wrapper classes involves memory copying operations and all function members are inline, they do not cause slowdown. Pay attention that all wrapper have input of generic pointer type void*. This allow to avoid explicit pointer conversion, since different compiler may treat sequences like "abcdefgh" as char or as unsigned char. For example, typical Macintosh C++ compiler will recognize "abcdefgh" (C NULL terminated string) as char, but "\pabcdefgh" (Pascal string with first item indicating length, where prefix "\p" instructs compiler to generate such item) as unsigned char, what is not the same.

Below is list of other insert function members which derived from the first two:

// Derived insert function members.

virtual void Insert(VectCConstIterType &it, size_t p, size_t nc)

{

VectCIterType i = const_cast<VectCIterType>(it);

CDynGenStrArray::Insert(i, p, nc);

};

virtual void Insert(VectCIterType &it1, VectCIterType &it2, size_t p)

{

long nc = distance(it1, it2);

CDYNGENSTRARRAY_THROW_IFNEGINDEX(nc);

CDynGenStrArray::Insert(it1, p, (size_t) nc);

};

virtual void Insert(VectCConstIterType &it1, VectCConstIterType &it2, size_t p)

{

long nc = distance(it1, it2);

CDYNGENSTRARRAY_THROW_IFNEGINDEX(nc);

CDynGenStrArray::Insert(it1, p, (size_t) nc);

};

virtual void Insert(VectCType &v, size_t p)

{

VectCIterType i1 = v.begin(), i2 = v.end();

CDynGenStrArray::Insert(i1, i2, p);

};

virtual void Insert(const VectCType &v, size_t p)

{

VectCConstIterType i1 = v.begin(), i2 = v.end();

CDynGenStrArray::Insert(i1, i2, p);

};


// Insert STL string.

virtual void Insert(const string &str, size_t p)

{ CDynGenStrArray::Insert(CCharWrap(str.data(), str.length()), p); };


virtual void Insert(const CPStrWrap &pstr, size_t p)

{

CDynGenStrArray::Insert(CCharWrap(pstr() + 1, pstr.Len()), p);

};

virtual void Insert(const CCStrWrap &cstr, size_t p)

{

size_t len = strlen(cstr());

CDynGenStrArray::Insert(CCharWrap(cstr(), len), p);

};


Push back function members

Now, on the basis of insert methods, we can construct push back function members. They will use the same arguments except index. Just one example:

virtual void PushBack(const CPStrWrap &pstr)

{ CDynGenStrArray::Insert(pstr, mSize); };

All the rest of push back function members will be written the same way.

Erase function members

There will be four erase function members: first, which erases entire content; second, which erases item at specified position; third, which erases items within specified range, and fourth, which erases last item.

virtual void Erase(void);

virtual void Erase(size_t p);

virtual void Erase(size_t sp, size_t ep);

virtual size_t PopBack(void);

Items retrieval function members

virtual size_t Get(

VectCType &out_v,

VectCIterType &out_it,

size_t p,

long nc = CDynGenStrArray::kEntireString,

bool insert_copy = false) const;

virtual size_t Get(string &out_str, size_t p, long nc = CDynGenStrArray::kEntireString) const;

virtual size_t Get(char *out_str, size_t p, long nc = CDynGenStrArray::kEntireString) const;

First copies item at specified position into vector. It may insert or overwrite destination content depending of the value of last argument. Exact location within output vector specified with iterator. This function checks size of output vector in order to prevent overrun, but is do not verifies if iterator is valid (i.e. if iterator points past container end).

Second copies item at specified position into STL string. It resizes string if necessary.

Third copies item into buffer specified with pointer (char*). There is no way for Get() to verify size of destination buffer, so this function should be used with caution.

All items retrieval functions copy entire content (of an item) by default.

Memory management function members

As mentioned before, in many cases STL may eliminate need for use of low-level memory management routines at all. The only one memory management function Reserve(size_t str_len_sum, size_t total_items) makes possible to avoid frequent reallocations of storage buffer and tables by means of allocating extra space estimated by total number of items and their summary size. By the way, CDynGenStrArray::Reserve(...) uses only STL reserve for its job.

Validation function members

virtual bool Validate(void);

Checks integrity of internal data structures and sets mValid flag to false if array's data is invalid. However, Validate() assumes that array object itself as well as its data members are not corrupted.

Output function members

virtual void Cout(size_t p) const;

virtual void Cout(size_t sp, size_t ep, bool endl_req = true) const;

Outputs either one either range of items to standard console.

Design considerations

Now think about these issues:

  • Insert, move and swap take more time at the beginning of the array.
  • Sort will be extremely time consuming if array stores large number of items.
  • STL algorithm library cannot be used with CDynGenStrArray.

Is it possible to overcome all or at least part of these shortcomings? Definitely, yes. How? Read next chapter...


MacGuruHQ and MacGuruHQ Logo

Copyright© 1996 - 2002 by Andrei Verovski. All right reserved. Content may not be copied in whole or in part, through electronic or other means, without the author's permission. MacGuru Logo (blue penguin) by Sergey Orehov.