Jump to content

Talk:C++ string handling

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Spitzak (talk | contribs) at 23:08, 5 November 2012 (bytes, char16_t and char32_t). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
WikiProject iconComputing: Software Stub‑class Low‑importance
WikiProject iconThis article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
StubThis article has been rated as Stub-class on Wikipedia's content assessment scale.
LowThis article has been rated as Low-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Software.
WikiProject iconC/C++ Unassessed Mid‑importance
WikiProject iconThis article is within the scope of WikiProject C/C++, a collaborative effort to improve the coverage of C and C++ topics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
???This article has not yet received a rating on Wikipedia's content assessment scale.
MidThis article has been rated as Mid-importance on the importance scale.

State of auto strings

If a string is defined as an auto, i.e. a local variable in a function, and not initialised, has it a defined state, e.g. it's a zero-length string, or is it garbage as other simpler auto variables are, e.g. int and char *? The article should say. -- Ralph Corderoy (talk) 11:52, 12 September 2009 (UTC)[reply]

It's the former. Yes, the article could use some improvement. Regards, decltype (talk) 12:22, 12 September 2009 (UTC)[reply]
The article is incomplete, not factually wrong, so it needs {{stub}} template, rather than {{disputed}}
Thanks for reinstating the template. I promise I will fix the issues known to me as soon as possible. decltype (talk) 20:06, 3 October 2009 (UTC)[reply]

incorrect

"when two c-strings are compared, it is implementation defined as to whether the contents or addresses are compared."

Huh? No, it's not implementation defined, it's definitely an address compare. The only freedom is that in:

 char *p1 = "hello";
 char *p2 = "hello";

... the compiler is allowed to share the two, i.e. p1==p2 *maybe*.

mem usage?

I just had a young programmer tell me that an uninitialized std::string uses less memory than an initialized one. Is that true? (I guess that would depend on the implementation; but consider, for example gcc) The code I found suspect was:

class blah {
  private:
     std:string name;
  public:
     blah (std:string in) {
        if (!in.empty()) name = in;  // claimed savings of memory
     } 
 };

linas (talk) 03:31, 27 January 2008 (UTC)[reply]

That sounds bogus to me. Even if you don't touch name, it gets initialized at the beginning of the constructor. You can always look at the source, though. —Ben FrantzDale (talk) 06:10, 27 January 2008 (UTC)[reply]
Looking at glibc source is easier said than done. But I did run an experiment with sbrk(0) and the result was no effect. Wonder why he thought that ... linas (talk) 21:07, 27 January 2008 (UTC)[reply]
A std:string basically looks like this:
struct string { size_t length; char* contents; }
With an unitialized instance the char* is just 0. Otherwise it points to a memory block (allocated with new char[] or malloc). A string of length 0 can be represented with no memory block, or a memory block containing just the terminating zero. Since heap management operates with granularity (e.g. in units of 16 bytes) you will waste 16 bytes in the latter case.
Note that there is no direct mapping between malloc and sbrk. The run-time library typically aquires memory in huge chunks from the OS.
--Alba7 (talk) 19:48, 30 August 2008 (UTC)[reply]

null characters

just curious if string class accepts null characters. I would assume it does. —Preceding unsigned comment added by 66.102.196.17 (talk) 00:56, 28 February 2008 (UTC)[reply]

I dug around in the gcc header files and found the following in basic_string.h: 1. String really contains _M_length + 1 characters: due to 21.3.4 must be kept null-terminated. But I am still not sure what that means fully. Guess I will have to test it. Kind of a lot to go through for a curiosity. I am starting to think it would have to be possible though, or how else would someone do binary file i/o. —Preceding unsigned comment added by 66.102.196.44 (talk) 03:03, 7 March 2008 (UTC)[reply]
It appears to. It's not easy to add them, though, because string foo = "asdf\0asdf"; just sets foo to "asdf" because the null terminator means the string constructor never sees the second half of the string. But you can do str.push_back('\0') and the length will increase and you can put non-null characters after the null terminator. —Ben FrantzDale (talk) 03:19, 8 March 2008 (UTC)[reply]
No need to check anything or experiment. std::string does support \0. It's in the standard. Of course, C strings still do not. 194.237.142.20 (talk) 15:05, 19 March 2010 (UTC)[reply]
Also string.assign("asdf\0asdf", 9) will make the string contain the null byte.Spitzak (talk) 19:49, 19 March 2010 (UTC)[reply]

character sets

Does the C++ standard define what character sets the string class stores? I would assume it only does ASCII (or perhaps you can do UTF8, but it won't gaurantee correct operation with some types of manipulation), but I can't recall ever seeing any mention of this in the docs. I was just looking at GLib and I was wondering why they bothered reimplementing a lot of STL, then I figured proper UTF8 support might be the reason. If it is a major difference, perhaps the article should be expanded to compare/contrast std::string with other libraries' string classes. Yanroy (talk) 20:17, 18 July 2008 (UTC)[reply]

Class std::string is actually just an instanciation of a template.
typedef basic_string<char> string;
You can also use wchar_t instead of char to get UTF16/UTF32 support.
--Alba7 (talk) 16:54, 23 October 2008 (UTC)[reply]

It can handle UTF-8 or any other byte-based encoding. You have copied the typical fallacy of defining "correct operation" as "different than treating the string as bytes". In fact you cannot handle UTF-8 correctly unless you treat the string as bytes, for instance it is quite impossible to reserve enough space for a string to be stored unless you know how many bytes are in it, and it is impossible to quickly and reliably locate a position in the string unless that position is defined by bytes. There is tons of obsolete documentation that used "character" when they meant "byte", this erroneous documentation is what needs to be fixed, not some perceived need to turn string manipulation into an impossibly complex attempt to use some other metric (often called "characters" but usually meaning "UTF-16 code points") to measure strings. Actual looking at characters is never done except in interative processes from the start of a string, and due to the combining rules of Unicode is quite impossible even in UTF-32.Spitzak (talk) 19:55, 19 March 2010 (UTC)[reply]

Nonetheless the question is good. C has firstly been developed to work with legacy iso646 (digraph and trigraph epoch ) With encodings such as UTF-8 you both need to see it as bytes for memory allocation and to see it as characters for other features such as unicode equivalence. The fact is that standard C/C++ library only provide some of those features (bytes handling) other are only available in alien unicode libraries such as International Components for Unicode.
C and C++ have always be blind (some say agnostic) to those issues, inducing this issue to be treated lately in development, by legacy locale mechanism, or dumb technical interoperability limitation.
Agnostic word is a misnomer for this language as C language provides iso 646 features (as in <iso646.h> ).
C and C++ blindness is such it does not offer any encoding conversion mechanism. Bad old language! 86.75.160.141 (talk) 20:34, 5 November 2012 (UTC)[reply]
Additional, C++ is not so agnostic because it includes some specific UTF8 UTF16 and UTF32 features in codecvt and convert features [1]. 86.75.160.141 (talk) 21:23, 5 November 2012 (UTC)[reply]

Renaming this article to follow a consistent convention

Hi, I am currently considering renaming this article to conform to a common convention for C++ Standard Library components. The full discussion can be found here. decltype 09:47, 6 March 2009 (UTC)[reply]

bytes, char16_t and char32_t

Most operations are described as handling bytes.

Nonetheless, from my understanding, strings can also possibly be made from char16_t and char32_t. So we might write:

   * string::at – Accesses specified code unit with bounds checking.
   * string::operator[] – Accesses specified code unit
   * string::front – Accesses the first code unit
   * string::back – Accesses the last code unit
   * string::data – Accesses the underlying array  — Preceding unsigned comment added by 86.75.160.141 (talk) 20:19, 5 November 2012 (UTC)[reply] 

The object is then a "std_string<T>" and not what C++ headers call "string". I think it is enormously clearer to describe byte strings first and then point out that the template can be reused for other objects.Spitzak (talk) 23:08, 5 November 2012 (UTC)[reply]