PChars: no strings attached
Rudy Velthuis
Rudy Velthuis
[SHOWTOGROUPS=4,20]
или Зарегистрируйся, or in the Для просмотра ссылки Войди или Зарегистрируйся, I often see that there is still great confusion about the PChar type on one, and the string type on the other hand. In this article I would like to discuss the similarites and the differences between both types, as well as some things you should or shouldn’t do with them.
The general principles layed out in this article apply to all Win32, Win64 and OS X versions of Delphi, including Delphi 2009 and up. There is, however, a special “chapter” at the end of this article especially for those who use Delphi 2009 and up.
PChar
In C, there is no real string type, not like there is in Delphi. Strings are just arrays of characters, and the end of the text is marked by a character with ASCII code zero. This allows them to be very long (unlike Turbo Pascal’s string type, which was limited to 255 characters and a length byte – this is Delphi’s ShortString type now), but a bit awkward to use. The beginning of the array is simply marked by a char *, which is a pointer to a char. The exact Delphi equivalent is ^Char. This has become the type PChar in Turbo Pascal and Delphi.
To traverse a string in C, you can increment or decrement the pointer using code like p++ or --s, or use the pointer as if it were an array — this is true for all pointers in C — and use s[20] to indicate the 21st character — counting starts at 0. But C pointer arithmetic not only allows incrementing and decrementing the pointer, it also allows calculating the sum of a pointer and a number, or the difference between two pointers. In C, *(s + 20) is equivalent to s[20] (* is the C pointer operator, much like Delphi’s ^). Borland introduced almost the same syntax for the PChar type in Turbo Pascal, if the {$X+} (extended syntax) directive was set.
In Delphi version 2009 and up, pointer arithmetic (or pointer math, as the Delphi developers called it) is supported for all pointer types, if the directive {$POINTERMATH ON} is used.
Just a pointer
Despite the slighlty extended syntax and handling described further on, never forget that a PChar is just a pointer, like in C. And also like in C, you can use it as if it were an array (i.e. the pointer points to the first character in the array). But it isn’t! A PChar has no automatic storage, like the convenient Delphi type string. If you copy text to a PChar-“string”, you must always make sure that the PChar actually points to a valid array, and that the array is large enough to hold the text.
Like in C, a PChar variable merely points to a Char. Usually, as in C, this Char is part of an array of Char that ends in a Char with ordinal value 0 and such an array is often used to pass text around between functions, but there is no guarantee that the character is part of a larger array, and there is no guarantee that there is a 0 at the end. This is only a convention.
And like with any other pointers, you can make mistakes with them.
The code above did not allocate storage for the string, so it tries to store the characters starting at some undefined location in memory (the address that is formed by the bit pattern that P happens to hold before it is assigned the address of determined memory location is undefined, see my article on pointers). This can cause problems, like memory corruption and even lead to a program crash, or — worse — wrong results. It is your responsibility to ensure that the array exists. The easiest way is to use a local array:
The above code stores the characters in the array. But if you try to display the string at S, it will probably display lots of nonsense. That is because the string didn’t end in a #0 character. OK, you could simply add another line:
and you would get a display of the text "D6". But storing characters one by one is really inconvenient. To display a text via a PChar is much simpler: you simply set the PChar to an already existing array with a text in it. Luckily, string constants like 'Delphi' are also such arrays, and can be used with PChars:
You should however be aware that that only changes the value of the pointer S. No text is moved or copied around. The text is simply stored somewhere in the program (and has a #0 delimiter), and S is pointed to its start address. If you do:
this does not copy the text 'Delphi' to the array A. Line 6 points S to the array A, but immediately after that, the next line only changes S (a pointer!) to the address of the literal string. If you want to copy text to the array, you must do that using, for instance, StrCopy or StrLCopy:
or
In this simple case it is obvious that 'Delphi' will generously fit in the array, so the use of StrLCopy seems a bit overdone, but in other occasions, where you don’t know the size of the string, you should use StrLCopy to avoid overrunning the array bounds.
A static array like A is useful as a text buffer for small strings of a known maximum size, but often you’ll have strings of a size which is unknown when the program is compiled. In that case you’ll have to use dynamic allocation of a text buffer. You can for instance use StrAlloc or StrNew to create a buffer, or GetMem, but then you’ll have to remember to free the memory again, using StrDispose or FreeMem. If you wanted to avoid low level routines, you could use a dynamic array of Char (or TArray<Char>), but that is not quite as convenient as using a Delphi string as a text buffer. But before I describe how to do that, I want to discuss that type first.
String
But there is more. Although the text is sure to be always terminated by a #0, just to make AnsiStrings compatible with C-style strings, the compiler doesn’t need it. In front of the text in memory, at a negative offset, the length of the string is stored, as an Integer. So to know the length of the string, the compiler only has to read that Integer, and not count characters until it finds a #0. That means that you can store #0 characters in the middle of the string without confusing the compiler. But some output routines, which rely on the #0 and not on the length, might be confused.
Normally, each time you’d assign one string to another variable, the compiler would have to allocate memory and copy the entire string to it. Because Delphi strings can be quite long (theoretically, up to 2GB), this could be slow. To avoid all the copying, Delphi knows a concept that is called “copy on write” (COW), meaning that, on assignment, only a copy of the pointer is made. A copy of the text is only made if the string data is about to be changed. Each string has a few fields of information stored in front of it. One is the reference count: this is the count of string variables that actually reference that particular string in memory. Only if it becomes 0, the string text is not referenced anymore, and the memory can be freed.
The compiler takes care that the reference count is always correct (but you can confuse the compiler by casting – more on that later). If a string variable is declared in a var section of a function or procedure, or as a field of a class or record, it starts its life as nil, the internal representation of the empty string (''). As soon as string text is created and assigned to one of these variables, the reference count of the string is updated to 1. Each additional assignment of that particular string to a new variable increments its reference count. If a string variable leaves its scope (when the function or class in which it was declared ends), or is pointed to a new string, the reference count of the text is decremented.
A simple example:
Now S1 points to the text '123456' and has a reference count of 1.
No text is copied yet, S2 is simply set to the same address as S1, but the reference count of the text '123456' is 2 now.
Now a new, larger buffer is allocated, the text 'The number is ' is copied to it, and the text from '123456' concatenated. But, since S2 doesn’t point to the original text '123456' anymore, the reference count of that text is decremented to 1 again.
Result is set to point to the same address as S2, and the reference count of the text 'The number is 123456' is incremented to 2.
Now S1 and S2 leave their scope. The reference count for '123456' is decremented to 0, and the text buffer is freed. The reference count for 'The number is 123456' is decremented too, but only to 1, since the function result still points to it. So although the function has ended, the string is still around.
What is important to notice here is that strings are more or less independent of the variables or fields that reference them. Only the number of references is important. If that is not 0, the string is still referenced somewhere and it must remain in memory. If it becomes 0, the memory for the string and its associated data (length, reference count, codepage) can be freed.
Complicated? Yes, it is complicated, and can get even more complicated with var, const and out parameters. But fortunately, you normally don’t have to worry about this. Only if you access strings in assembler, or using a typecast to a PChar, this can become important to know. But using strings with a typecast to PChar is something which is not uncommon.
The most importants things to remember about strings are
[/SHOWTOGROUPS]
In the Для просмотра ссылки ВойдиThe string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information. — Alan Perlis
The general principles layed out in this article apply to all Win32, Win64 and OS X versions of Delphi, including Delphi 2009 and up. There is, however, a special “chapter” at the end of this article especially for those who use Delphi 2009 and up.
PChar
PChars were inspired by strings, as used in the C language. Most Windows API functions have a C interface, and accept C style strings. To be able to use APIs, Borland had to introduce a type that mimicked them, in the ancestor of Delphi, Turbo Pascal.Trying to outsmart a compiler defeats much of the purpose of using one. — Kernighan and Plauger, The Elements of Programming Style.
In C, there is no real string type, not like there is in Delphi. Strings are just arrays of characters, and the end of the text is marked by a character with ASCII code zero. This allows them to be very long (unlike Turbo Pascal’s string type, which was limited to 255 characters and a length byte – this is Delphi’s ShortString type now), but a bit awkward to use. The beginning of the array is simply marked by a char *, which is a pointer to a char. The exact Delphi equivalent is ^Char. This has become the type PChar in Turbo Pascal and Delphi.
To traverse a string in C, you can increment or decrement the pointer using code like p++ or --s, or use the pointer as if it were an array — this is true for all pointers in C — and use s[20] to indicate the 21st character — counting starts at 0. But C pointer arithmetic not only allows incrementing and decrementing the pointer, it also allows calculating the sum of a pointer and a number, or the difference between two pointers. In C, *(s + 20) is equivalent to s[20] (* is the C pointer operator, much like Delphi’s ^). Borland introduced almost the same syntax for the PChar type in Turbo Pascal, if the {$X+} (extended syntax) directive was set.
In Delphi version 2009 and up, pointer arithmetic (or pointer math, as the Delphi developers called it) is supported for all pointer types, if the directive {$POINTERMATH ON} is used.
Just a pointer
Despite the slighlty extended syntax and handling described further on, never forget that a PChar is just a pointer, like in C. And also like in C, you can use it as if it were an array (i.e. the pointer points to the first character in the array). But it isn’t! A PChar has no automatic storage, like the convenient Delphi type string. If you copy text to a PChar-“string”, you must always make sure that the PChar actually points to a valid array, and that the array is large enough to hold the text.
Like in C, a PChar variable merely points to a Char. Usually, as in C, this Char is part of an array of Char that ends in a Char with ordinal value 0 and such an array is often used to pass text around between functions, but there is no guarantee that the character is part of a larger array, and there is no guarantee that there is a 0 at the end. This is only a convention.
And like with any other pointers, you can make mistakes with them.
var S: PChar; begin S[0] := 'D'; S[1] := '6'; |
The code above did not allocate storage for the string, so it tries to store the characters starting at some undefined location in memory (the address that is formed by the bit pattern that P happens to hold before it is assigned the address of determined memory location is undefined, see my article on pointers). This can cause problems, like memory corruption and even lead to a program crash, or — worse — wrong results. It is your responsibility to ensure that the array exists. The easiest way is to use a local array:
var S: PChar; A: array[0..100] of Char; begin S := A; S[0] := 'D'; // this is equivalent to A[0] := 'D'; S[1] := '6'; // you could also write: (S + 1)^ := '6'; |
The above code stores the characters in the array. But if you try to display the string at S, it will probably display lots of nonsense. That is because the string didn’t end in a #0 character. OK, you could simply add another line:
S[2] := #0; // or: (S + 2)^ := #0; |
and you would get a display of the text "D6". But storing characters one by one is really inconvenient. To display a text via a PChar is much simpler: you simply set the PChar to an already existing array with a text in it. Luckily, string constants like 'Delphi' are also such arrays, and can be used with PChars:
var S: PChar; begin S := 'Delphi'; |
You should however be aware that that only changes the value of the pointer S. No text is moved or copied around. The text is simply stored somewhere in the program (and has a #0 delimiter), and S is pointed to its start address. If you do:
// WARNING: BAD EXAMPLE var S: PChar; A: array[0..100] of Char; begin S := A; S := 'Delphi'; |
this does not copy the text 'Delphi' to the array A. Line 6 points S to the array A, but immediately after that, the next line only changes S (a pointer!) to the address of the literal string. If you want to copy text to the array, you must do that using, for instance, StrCopy or StrLCopy:
var S: PChar; A: array[0..100] of Char; begin S := A; StrCopy(S, 'Delphi'); |
or
StrLCopy(S, 'Delphi', Length(A) - 1); |
In this simple case it is obvious that 'Delphi' will generously fit in the array, so the use of StrLCopy seems a bit overdone, but in other occasions, where you don’t know the size of the string, you should use StrLCopy to avoid overrunning the array bounds.
A static array like A is useful as a text buffer for small strings of a known maximum size, but often you’ll have strings of a size which is unknown when the program is compiled. In that case you’ll have to use dynamic allocation of a text buffer. You can for instance use StrAlloc or StrNew to create a buffer, or GetMem, but then you’ll have to remember to free the memory again, using StrDispose or FreeMem. If you wanted to avoid low level routines, you could use a dynamic array of Char (or TArray<Char>), but that is not quite as convenient as using a Delphi string as a text buffer. But before I describe how to do that, I want to discuss that type first.
String
Allow me to confuse you: a string or, more precise, AnsiString (in Delphi 2009 and higher: UnicodeString) is in fact a PChar. Just as a PChar, it is a pointer to an array of characters, terminated by a #0 character. But there is one big difference. You normally don’t have to think about how they work. They can be used almost like any other variable. The compiler takes care that the appropriate code to allocate, copy and free the text is called. So instead of calling routines like StrCopy, the compiler takes care of such chores for you.A world without string is chaos — Randolf Smuntz, Mouse Hunt
But there is more. Although the text is sure to be always terminated by a #0, just to make AnsiStrings compatible with C-style strings, the compiler doesn’t need it. In front of the text in memory, at a negative offset, the length of the string is stored, as an Integer. So to know the length of the string, the compiler only has to read that Integer, and not count characters until it finds a #0. That means that you can store #0 characters in the middle of the string without confusing the compiler. But some output routines, which rely on the #0 and not on the length, might be confused.
Normally, each time you’d assign one string to another variable, the compiler would have to allocate memory and copy the entire string to it. Because Delphi strings can be quite long (theoretically, up to 2GB), this could be slow. To avoid all the copying, Delphi knows a concept that is called “copy on write” (COW), meaning that, on assignment, only a copy of the pointer is made. A copy of the text is only made if the string data is about to be changed. Each string has a few fields of information stored in front of it. One is the reference count: this is the count of string variables that actually reference that particular string in memory. Only if it becomes 0, the string text is not referenced anymore, and the memory can be freed.
The compiler takes care that the reference count is always correct (but you can confuse the compiler by casting – more on that later). If a string variable is declared in a var section of a function or procedure, or as a field of a class or record, it starts its life as nil, the internal representation of the empty string (''). As soon as string text is created and assigned to one of these variables, the reference count of the string is updated to 1. Each additional assignment of that particular string to a new variable increments its reference count. If a string variable leaves its scope (when the function or class in which it was declared ends), or is pointed to a new string, the reference count of the text is decremented.
A simple example:
function PlayWithStrings: string; var S1, S2: string; begin S1 := IntToStr(123456); |
Now S1 points to the text '123456' and has a reference count of 1.
S2 := S1; |
No text is copied yet, S2 is simply set to the same address as S1, but the reference count of the text '123456' is 2 now.
S2 := 'The number is ' + S2; |
Now a new, larger buffer is allocated, the text 'The number is ' is copied to it, and the text from '123456' concatenated. But, since S2 doesn’t point to the original text '123456' anymore, the reference count of that text is decremented to 1 again.
Result := S2; |
Result is set to point to the same address as S2, and the reference count of the text 'The number is 123456' is incremented to 2.
end; |
Now S1 and S2 leave their scope. The reference count for '123456' is decremented to 0, and the text buffer is freed. The reference count for 'The number is 123456' is decremented too, but only to 1, since the function result still points to it. So although the function has ended, the string is still around.
What is important to notice here is that strings are more or less independent of the variables or fields that reference them. Only the number of references is important. If that is not 0, the string is still referenced somewhere and it must remain in memory. If it becomes 0, the memory for the string and its associated data (length, reference count, codepage) can be freed.
Complicated? Yes, it is complicated, and can get even more complicated with var, const and out parameters. But fortunately, you normally don’t have to worry about this. Only if you access strings in assembler, or using a typecast to a PChar, this can become important to know. But using strings with a typecast to PChar is something which is not uncommon.
The most importants things to remember about strings are
- that text is only copied to a new string buffer if it is modified;
- that the reference count and the length are not connected to a string variable, but to a specific text buffer (also known as payload), to which more than one string variable can point;
- that the reference count is always correct unless you fool the compiler by casting to a different type;
- that assignments to a variable decrement the reference count of the text buffer it previously pointed to;
- that if the reference count becomes 0, the string buffer is freed.
[/SHOWTOGROUPS]