Jump to content

Talk:Null-terminated string

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Kpratter (talk | contribs) at 13:40, 2 October 2023 (Notification: listing of CString at WP:Redirects for discussion.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Null-terminated string

Shouldn't this article be at located at null-terminated string (or NUL-terminated string)? And primarily focus on null-terminated strings instead of C's string library? —Ruud 00:21, 19 October 2011 (UTC)[reply]

There's actually not much to say about null terminated string itself apart from the definition. Everything comes down to the operations that are defined on these strings, and the properties of these operations. C string library is the most widely used interface to these operations, so the attention to it seems reasonable to me. 1exec1 (talk) 01:41, 19 October 2011 (UTC)[reply]
You could also say a few other things. I haven’t really read the article :P but perhaps a comparison with other ways of storing strings and its relative strengths and weaknesses; languages and other applications where it is used? Vadmium (talk, contribs) 07:55, 24 October 2011 (UTC).[reply]
There is a comparison with a leading length at the start of the article!
Okay, so there is in the history section. And there’s more at String (computer science)#Representations. Vadmium (talk, contribs) 10:55, 24 October 2011 (UTC).[reply]
I would have to disagree with that. One can easily discuss the asymptotic complexity for various operations on null-terminated strings in terms of abstract functions. In my opinion this article should either be split into an article on null-terminated strings and an article on "Strings in the C programming language", of the latter should be more clearly made into a sub-section of an article whose primary topic is null-terminated strings. —Ruud 13:43, 24 October 2011 (UTC)[reply]
I agree with the suggestion to split the article. 1exec1 (talk) 14:22, 24 October 2011 (UTC)[reply]

Requested move

The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

The result of the move request was: page moved per consensus in the discussion. Vegaswikian (talk) 22:56, 5 November 2011 (UTC)[reply]



C stringNull-terminated stringRelisted. Discussion on going and may lead to something other then a rename. Vegaswikian (talk) 05:13, 31 October 2011 (UTC) Common and language-neutral name. —Ruud 13:55, 24 October 2011 (UTC)[reply]

how is the term C string not neutral?--199.91.207.3 (talk) 17:35, 24 October 2011 (UTC)[reply]
I would conjecture that the terms "C string" and "Pascal string" are mostly used by C programmers interfacing with libraries developed for different ABI's, while computer scientists and programmers from other languages would prefer to use the more descriptive terminology "null-terminated" and "length-prefixed" strings. The former already requires you know that C uses strings which are terminated by a null character and Pascal uses strings which are prefixed by their length, while this is self-evident with the latter. —Ruud 20:59, 24 October 2011 (UTC)[reply]
I think the term "Pascal string" means a 1-byte prefix length, not just the fact that a length is stored.Spitzak (talk) 23:50, 24 October 2011 (UTC)[reply]
True. So a Pascal string would be a particular kind of length-prefixed string. If we would have an article on that topic (which I don't believe we have at the moment), it would likely discuss all kinds of length-prefixed strings, not just 1-byte-length prefixed ones. —Ruud 00:58, 25 October 2011 (UTC)[reply]
I would support either a move to Null-terminated string, and/or integration with String (computer science)#Null-terminated, especially if the C stuff is to be a separate article. Vadmium (talk, contribs) 05:14, 25 October 2011 (UTC).[reply]
Maybe we can move the article containing the remaining C stuff to C standard string functions or similar title? 1exec1 (talk) 18:15, 29 October 2011 (UTC)[reply]
I'd prefer something like "String handling in the C programming language" or (more ambiguously, but more concise) "String handling in C". —Ruud 09:50, 31 October 2011 (UTC)[reply]
I see one problem with a title like this: it is not consistent with other pages about C standard library, like C mathematical functions and so on. In my opinion we should have either all articles in one format or the other. If we change all titles to Mathematical functions in C and similar, they become much more ambiguous, because then they refer to all functions (i.e. not necessarily standard ones) in the particular domain of C. Current solution mostly works, because when saying C mathematical functions, C standard mathematical functions is naturally implied (I must agree that this assertion might be far fetched as I'm not native speaker of English). Alternative solution might be something like Standard mathematical functions in C, but this also doesn't sound well (and might be grammatically incorrect; again, I'm not native speaker). Thus I think that certainly being not ideal, C standard string functions or C string functions might be the best option. However, if we decided to ignore the consistency issue, I would agree that String handling in C is an appropriate title. 1exec1 (talk) 23:29, 31 October 2011 (UTC)[reply]
I think I've already indicate that I find titles such as "C dynamic memory management" to be pretty awkward and that titles such as "Dynamic memory management in C" more clearly indicate the article is actually a sub-article of both Dynamic memory management and C (programming language). Perhaps the title and scope should even be Memory management in C and clearly linked with at {{main|Memory management in C}} from C (programming language)#Memory management. —Ruud 11:53, 1 November 2011 (UTC)[reply]
Ok, you finally convinced me. My previous argument is incorrect in that the scope of the articles is actually broader than the standard functions, as is evident in, for example, the current C string page. So now I think that the in C titles not only sound well, they represent the current and potential scope of the articles much better. Is a change from C *** to *** in C a non-controversial move? Can I implement it without a discussion? 1exec1 (talk) 17:36, 1 November 2011 (UTC)[reply]
I've created a centralized discussion at Talk:C_standard_library#Move_articles_about_C_standard_library_from_C_.2A.2A.2A_to_.2A.2A.2A_in_C. 1exec1 (talk) 12:51, 2 November 2011 (UTC)[reply]
The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

NULL char in UTF-8

In response to this edit: A NULL char is not a part of any valid UTF8 sequence; all characters in all multibyte sequences start with a 1 bit. However, you can encode a NULL char into a UTF8 stream with a 0xC0, 0x80 sequence, which then becomes a 0x0000 when converted to UTF16. - Richfife (talk) 16:09, 13 September 2013 (UTC)[reply]

Zero is a valid code point, and the UTF-8 encoding of it is a NUL (\0) byte. An 0xC0, 0x80 sequence in a UTF-8 string is an invalid overlong encoding. In fact, the overlong encoding of NUL is used as an example of a security issue from an incorrect UTF-8 implementation in the RFC. strcat (talk) 17:40, 13 September 2013 (UTC)[reply]
See the encoding table provided by the Unicode consortium as a resource. strcat (talk) 17:46, 13 September 2013 (UTC)[reply]
That is incorrect, you are describing "modified UTF-8" as used by tcl and some other systems. These also incorrectly encode non-BMP as 6 bytes. Officially the encoding of U+0000 in UTF-8 is a single 0 byte.
However I have reverted this, as the 0 character also exists in ASCII and yet it claims that ASCII works. Thus "works" is defined as "works for all characters other than the zero one". The edit implies that UTF-8 is not supported as well as ASCII, which is false, it is supported equally well with exactly one character not represented.Spitzak (talk) 23:57, 13 September 2013 (UTC)[reply]
I am not describing modified UTF-8, but I think you may just be responding to the parent comment. strcat (talk) 03:54, 19 September 2013 (UTC)[reply]
Can't we say that NUL (\0) is supported by ASCII, extended ASCII and UTF-8, but NUL is not supported by Null-terminated strings, since NUL has an internal meaning to null-terminated strings. But as long as NUL is not part of encoded text (only TAB, CR and LF is among control chars) both ASCII and UTF-8 can be stored in Null-terminated strings. Both ASCII and UTF-8 might sometimes demand storage of NUL, but if you want to store binary data, why not store them as binary data, not as ASCII, UTF-8 or UTF-16 or null-terminated strings.--BIL (talk) 09:09, 14 September 2013 (UTC)[reply]
UTF-8 considers NUL to be totally valid text data. It's the encoding of a valid code point - unlike the explicitly forbidden range of surrogates. I added sources for this, and it was reverted to the previous inaccurate claim of \x00 not being valid UTF-8. strcat (talk) 03:56, 19 September 2013 (UTC)[reply]
There is no difference between UTF-8 and ASCII with respect to null terminated strings. Actually UTF-8 preserves backwards compatibility with traditional null terminated strings, enabling more reliable information processing and the chaining of multilingual string data with Unix pipes between multiple processes. Using a single UTF-8 encoding with characters for all cultures and regions eliminates the need for switching between code sets. See Lunde, Ken (1999). CJKV information processing. O'Reilly Media. p. 466. ISBN 978-1-56592-224-2. Retrieved 2011-12-23. {{cite book}}: Unknown parameter |month= ignored (help). So current wording should be fixed. AgadaUrbanit (talk) 07:02, 19 September 2013 (UTC)[reply]
Regardless of whether there's a difference between ASCII and UTF-8, not all UTF-8 can be stored in a null-terminated string per the Unicode and UTF-8 standards (given as a source). They are the only authoritative sources here because they define the encoding. 99.231.135.5 (talk) 01:39, 20 September 2013 (UTC)[reply]

Do many small strings imply duplicates?

My concern is with the statement, "On modern systems memory usage is less of a concern, so a multi-byte length is acceptable (if you have so many small strings that the space used by this length is a concern, you will have enough duplicates that a hash table will use even less memory)." I can write a program that generates many small strings that are not duplicates. Therefore I do not believe that if I have many small strings then there will be duplicates. There might be many programs where many small strings are duplicates and a hash table will use less memory (e.g. the symbol table in a compiler), but I cannot see that this is always true of all programs. — Preceding unsigned comment added by 80.195.2.190 (talk) 12:56, 6 September 2016 (UTC)[reply]

If "many" is considered to mean tending towards infinity then there will be duplicates strings after all permutation of small strings are generated. So in that sense many strings implies duplicates. However, consider 8 bit byte characters where there are 2^8=256 different characters, of which the zero character '\0' NUL can be used as a terminator, leaving 255 other characters from which to form strings. Now consider the number of bytes required to store all permutations of strings of short lenth for a NUL terminated string representation and a string representation having a 4-byte length:

length L permutations P size NUL terminated = P * (L+1) size 4-byte length = P * (L+4)
0 255^0 = 1 1 4
1 255^1 = 255 510 1275
2 255^2 =65,025 195,075 390,150
3 255^3 = 16,581,375 66,325,500 116,069,625
4 255^4 = 4,228,250,625 21,141,253,125 33,826,005,000

To store all permutations of strings up to length 4 requires 20 gigabytes in a NUL terminated string representation and 32 gigabytes in a string representation with a 4 byte length. So a hash table storing such short strings will exhaust current memory sizes before all permutations of short strings can be generated. Therefore for hash tables, stored in current memory sizes, many short strings does not necessarily imply there will be duplicate strings. — Preceding unsigned comment added by 80.195.2.190 (talk) 10:10, 12 January 2017 (UTC)[reply]

Second column seems off, I get 4, 1275, 390150, 116069625, 33826005000.
However storing these as independent strings would overflow memory just as much as the hash table. The assumption is that the set of strings fits in available memory, and that there are collisions because some strings are used much more often than others ("index" is probably used much more than "X&*v@" in a programming language). Though the length adds 3 bytes (vs nul-terminated) and the hash table adds H more bytes (where H ~= 16), each collision saves length+4+H bytes. So if there are N collisions out of M total strings then the hash table costs M*(3+H)-N*(length+4+H) extra bytes, which could be negative. However I have no idea how to test this on real data or prove whether it is positive or negative.Spitzak (talk) 18:18, 12 January 2017 (UTC)[reply]
Thank you for calculating the correct values. I've now fixed the numbers in that column. I basically agree with your analysis, but to be precise it is duplicate strings rather than collisions that save space in the hash table, because an imperfect hash function can generate the same index, and hence a collision, for different strings, and each different string needs to be stored, so space is saved only for duplicates. I share your intuition that there are many real-world data sets where a hash table (or trie, or other data structure) will save memory by exploiting the frequency of duplication. For a particular application I guess we could get a sufficiently large & representative set of data, and use some kind of statistics-keeping memory management to compare different data structures.

India Education Program course assignment

This article was the subject of an educational assignment supported by Wikipedia Ambassadors through the India Education Program.

The above message was substituted from {{IEP assignment}} by PrimeBOT (talk) on 20:00, 1 February 2023 (UTC)[reply]

The redirect CString has been listed at redirects for discussion to determine whether its use and function meets the redirect guidelines. Readers of this page are welcome to comment on this redirect at Wikipedia:Redirects for discussion/Log/2023 October 2 § CString until a consensus is reached. Kpratter (talk) 13:40, 2 October 2023 (UTC)[reply]