Monday, December 14, 2009

You're being lied to "Objects are copied by reference by default in PHP5"

You're being lied to.

If you're among the crowd who have migrated an OOP based application from PHP4 to PHP5, then I'm sure you've heard the expression "Objects are copied by reference by default in PHP5". Whoever told you that, was lying.

Now, to be fair, it's an innocent lie, since objects do behave in a reference-like manner, but references are NOT what they are. Let's start with a simple illustration proving that they aren't references:
$a = new stdClass;
$b = $a;

$a->foo = 'bar';
/* Notice at this point, that $a and $b are,
* indeed sharing the same object instance.
* This is their reference-like behavior at work.

$a = 'baz';
/* Notice now, that $b is still that original object.
* Had it been an actual reference with $a,
* it would have changed to a simple string as well.

What's going on here? Well, the answer is easiest to explain by explaining what the underlying structure of objects are. In PHP5, a variable containing an object identifies the instance by storing a simple numeric value. When an action is going to be performed on an object, that numeric value is used with a lookup table to retreive the actual instance. In PHP4, by contrast, a variable containing an array identifies that object by carrying around the actual properties table itself. What this means in practice is that when you assign (not by reference) a PHP5 object to a new variable, that integer handle is copied into the new variable, but it still points at the same instance, because it's still the same number. Assigning a PHP4 object however, means copying all the properties, effectively generating a new instance, since changes to one will not effect the other.

To put this another way, PHP4 objects are basically Arrays with functions associated with them, PHP5 objects are basicly Resources (a la MySQL result handles, or file pointers) again with functions loosely associated to them. Consider the following code in PHP4 (or any version):
$fp = fopen('foo.txt', 'r');
$otherVar = $fp;
fwrite($fp, "One\n");
fwrite($otherVar, "Two\n");
fwrite($otherVar, "Three\n"); /* This fails, because the file is closed */
You'd fully expect data to be written to the same file, as though you'd used $fp everywhere, rather than interchanging the variables right? Well, PHP5 objects are the same. The instance itself isn't duplicated when you assign to a new variable, just the unique identifier.

I'm lying to you also

"Copying" a variable doesn't exactly mean copying. Take the following code block:
$a = 'foo';
$b = $a;
$a = 'bar';
Now, you know PHP well enough to know that by the end of this code block, the value of $b will still be 'foo'. What you may not know, is that the original copy of 'foo' that was in $a, was never actually duplicated.

To understand what PHP is doing, you need to understand the internal structure of the variable and how it relates to userspace visible variable names ('a' and 'b' in this case). First off, the actual contents of a variable (known as a zval) consists of four parts: type (e.g. NULL, Boolean, Integer, Float, String, Array, Resource, Object), a specific value (e.g. 123, 3.1415926535, etc...), is_ref - a flag indicating if the value is a reference or not, and refcount which tells how many times this value is being shared.

What you think of as a variable (e.g. $x) is actually just a label, that label ('x' in this case) is used as a lookup to find the zval which conatins the actual value. These are just like keys in an associative array, in fact, the mechanisms are identical.

With me so far? Good. Now, when you first create a variable (e.g. $x = 123;, PHP allocates a new zval for it, stores the specific value, and associates the label with the value:
'x' => zval ( type => IS_LONG, value.lval = 123, is_ref = 0, refcount = 1 )
So far, refcount is 1 since the zval value is only being referenced by one label. If we now put this value into a full-reference set using $y =& $x;, the same zval is reused. It's simply associated with a new label and it's reference counters are adjusted properly.
'x' => zval ( type => IS_LONG, value.lval = 123, is_ref = 1, refcount = 2 )
'y' /
This way, when you later change the value of $x, $y appears to change as well because it's looking at the same internal value. But what if we hadn't done a reference assignment, what if we'd done a normal assignment: $y = $x;, surprisingly, the result would be almost the same.
'x' => zval ( type => IS_LONG, value.lval = 123, is_ref = 0, refcount = 2 )
'y' /
Again, the original zval associated with $x is reused, the only difference this time is that is_ref is not set to 1. This is known as a copy-on-write reference set (as opposed to the full-reference set described above). This 0 flag tells the engine that if anyone tries to change this value (regardless of which label they use to reach it), any other references to it should be left alone. Here's what happens if we take that current state and do $x = 456;
'y' => zval ( type => IS_LONG, value.lval = 123, is_ref = 0, refcount = 1 )
'x' => zval ( type => IS_LONG, value.lval = 456, is_ref = 0, refcount = 1 )
$x has been disassociated from the original zval (thus dropping its refcount back to 1), and new zval has been created for it.

Why referencing when you don't have to is a bad idea.

Let's consider one more situation, take a look at this code block:
$a = 'foo';
$b = $a;
$c = &$a;
At the first instruction, a single zval is created, associated to a single label:
'a' => zval ( type => IS_STRING, value.str.val = 'foo', is_ref = 0, refcount = 1 )
At the second intstruction, that zval is associated to a second label, so far so good:
'a' => zval ( type => IS_STRING, value.str.val = 'foo', is_ref = 0, refcount = 2 )
'b' /
At the third intstruction, however, we run into problems. Since this zval is already tied up in a copy-on-write reference set which include $b, that zval can't be simply promoted to is_ref==1. Doing so would drag $b into $a and $c's full-reference set, and that would be wrong. In order to resolve this, the engine is forced to duplicate that zval into two identical copies, from which it can begin to shuffle around reference flags and counts:
'b' => zval ( type => IS_STRING, value.str.val = 'foo', is_ref = 0, refcount = 1 )
'a' => zval ( type => IS_STRING, value.str.val = 'foo', is_ref = 1, refcount = 2 )
'c' /
Now you've got two copies of the same literal value, so you're wasting memory for the storage, and processing time required to actually make the duplication. Since a LOT of events lead to copy-on-write uses (including simply passing an argument to a function), this sort of forced duplication actually happens very commonly when you start involving actual references.

The moral of the story

Assigning values by references when you don't need to (in order to later modify the original value through a different label) is NOT a case of you outsmarting the silly engine and gaining speed and performance. It's the opposite, it's you TRYING to outsmart the engine and failing, because the engine is already doing a better job than you think.

How does this reflect on objects? They're not special. They're not different from other variables. They are not pretty snowflakes. In this code block:
$a = new stdClass;
$b = $a;
The labels are still placed into copy-on-write reference sets. What's important, is that even when a duplication does occur, (A) only that unique integer is copied (which is cheap), and (B) the duplicated integer still points to the same place. Hence you get reference-like behavior, but not an actual reference by default.

Hungry for more? Check out my coverage of the zval.

No comments: