php passing pointer parameters performance

by Tim Stamp2015-11-19

A little adds up to a lot, and in the world of code, a tiny change in code performance can have a big impact on application performance overall.

So, pointers aren’t a “thing” in PHP. This article is about the use of PHP References, and I used the word ‘pointer’ in the title because it is alliterative. So there 🙂

how to test performance of small blocks

So there are plenty of theories on how to test code performance, but this simple script below can trivially compare 2 blocks of code. Both blocks must be executed an equal number of times, with as little risk of other system interference as possible. This means you don’t run one block of code 1000 times and then run another block 1000 times to compare them – this risks external influences affecting one of the execution cycles much more heavily than the other, tainting your results. So you run one, then the other, then repeat, capturing times for each execution as it happens.

function funcA() {
    //do some code
};
function funcB() {
    //do some other code
};
//setup (only run once):
function changeDataA() {}
function changeDataB() {}

$loops = 5000;
$timeA = 0.0;
$timeB = 0.0;

ob_start();
for($i=0; $i<$loops; ++$i) {
    $start = microtime(1);
    funcA();
    $timeA += microtime(1) - $start;

    $start = microtime(1);
    funcB();
    $timeB += microtime(1) - $start;
}
ob_end_clean();

$timeA = round(1000000 * ($timeA / $loops), 3);
$timeB = round(1000000 * ($timeB / $loops), 3);

echo "
TimeA averaged $timeA microseconds
TimeB averaged $timeB microseconds
";

So I’ll be using this code to show how performance differs when writing your code slightly different ways. Sometimes the difference is very small, and you are welcome to reproduce these tests and come to your own conclusions, this blog contains only my professional opinions.

Sometimes the code set up cost should also be taken into account – if the code in question is called a small number of times in a single execution of your code then the set up cost of a block of code will have a bigger impact on performance. I will include the code I used to perform these tests below so you can see where I have included set up cost in the calculation.

passing variables by reference

Passing variables by reference is when you use the `&` prefixed to an argument in a function, so any changes made to that variable in the function will also affect the variable that was passed into the function.
These parameters are called references, passing ‘By Reference’ being the opposite of passing ‘By Value’ which is when you pass a variable without the ‘&’ prefix.

Example:

function funcA() {
    $str = str_shuffle("0123456789");
    $str = changeData1($str);
    strlen($str);
};
function funcB() {
    $str = str_shuffle("0123456789");
    changeData2($str);
    strlen($str);
};
//setup (only run once):
function changeData1($data) {
    return $data . " World";
}
function changeData2(&$data) {
    $data .= " World";
}

results in the following execution times:

TimeA averaged 2.774 microseconds
TimeB averaged 2.748 microseconds

So not a lot of difference there, but you can already see the `funcB()` function is very slightly faster.

But… Consider what happens you are working with larger data blobs. All of a sudden you are calling a function `changeData1()` with large amounts of data that PHP has to allocate memory for, copy it, alter it, then remove the original (and later garbage-collect it). This can happen when modifying the contents of a file, for example:

function funcA() {
    $str = str_shuffle("0123456789");
    $str = str_repeat($str, 100000);
    $str = changeData1($str);
    strlen($str);
};
function funcB() {
    $str = str_shuffle("0123456789");
    $str = str_repeat($str, 100000);
    changeData2($str);
    strlen($str);
};
//setup (only run once):
function changeData1($data) {
    return $data . " World";
}
function changeData2(&$data) {
    $data .= " World";
}

Outputs:

TimeA averaged 542.497 microseconds
TimeB averaged 294.44 microseconds

It was ~45% faster to use a reference here.
So when working with large strings, it is a lot more performant to alter the existing string, than to copy the passed string and return it.

On a large array:

function funcA() {
    $str = str_shuffle("0123456789");
    $str = str_repeat($str, 10000);
    $str = explode('0', $str);
    $str = changeData1($str);
    count($str);
};
function funcB() {
    $str = str_shuffle("0123456789");
    $str = str_repeat($str, 10000);
    $str = explode('0', $str);
    changeData2($str);
    count($str);
};
//setup (only run once):
function changeData1($data) {
    $data[] = " World";
    return $data;
}
function changeData2(&$data) {
    $data[] = " World";
}

outputs:

TimeA averaged 2980.23 microseconds
TimeB averaged 2028.161 microseconds

So we took ~32% off the processing time here by passing our variable by reference.

we can also use the reference operator `&` in a regular `foreach` loop.

Here’s an example:

function funcA() {
    $str = str_shuffle("0123456789");
    $str = str_repeat($str, 1000);
    $str = explode('0', $str);
    foreach($str as $k => $v) {
        $str[$k] .= 'a'; //<-- we have to look up the key every iteration!
    }
    count($str);
};
function funcB() {
    $str = str_shuffle("0123456789");
    $str = str_repeat($str, 1000);
    $str = explode('0', $str);
    foreach($str as &$v) { //<-- & used here, and no key required any more
        $v .= 'a'; //<-- modify the value in the array using the reference 
    }
    count($str);
};

And our survey says:

TimeA averaged 296.358 microseconds
TimeB averaged 160.289 microseconds

Sooooo…. We nearly improve the performance of this code by ~54%, just by using a reference instead of modifying the array values by key.

using a reference to reduce calls to nested variables

Why write this code:

$myBigArray['firstVar']['secondVar']['thirdVar'][] = "FOO";
$myBigArray['firstVar']['secondVar']['thirdVar'][] = "BAR";
$myBigArray['firstVar']['secondVar']['thirdVar'][] = "Hello";
$myBigArray['firstVar']['secondVar']['thirdVar'][] = "World";

when you can do this instead:

$thirdVar =& $myBigArray['firstVar']['secondVar']['thirdVar'];
$thirdVar[] = "FOO";
$thirdVar[] = "BAR";
$thirdVar[] = "Hello";
$thirdVar[] = "World";

Much better 🙂

use `unset` when you want to redefine the reference

When working with references, it can be pretty easy to accidentally modify data you didn’t intend to, consider this:

$bar = "bar";
$foo =& $bar;
$foo = null;

This code sets `$bar` to null. The reference `$foo` is still pointed at `$bar`.
If you want to stop `$foo` pointing at `$bar` you have to point it at something else,
eg. $foo =& $somethingElse
or you need to call unset($foo).

Common methods in PHP already use references for performance reasons, such as:

for sorting the elements in the given array
function sort (array &$array, $sort_flags = null) {} (and other array sorting functions)

for randomly re-ordering elements in the given array
function shuffle (array &$array) {}

for iterating over the given array, modifying values as it loops over items in the given array
function array_walk (array &$array, $funcname, $userdata = null) {}

for getting the current element in the given array
function current (array &$array) {}

for adding a variable to the end of the given array
function array_push (array &$array, $var, $_ = null) {}

So all of these methods modify the given array in-situ, rather than returning a modified copy of it.
If you think about it, this makes a lot of sense, because if you want to retain the original array before you modify it, then it’s just one line of code to copy the array; but if these functions always returned a copy of the given array, you would have to write your own function to modify the original array in-situ if you wanted the performance gain.

Other core functions in PHP use references for passing data back to the calling code, for example:

$count = null;
str_replace("_", "-", "ab_cd_12_34", $count);
echo "count: $count"; //says "count: 3"

The signature for this function is:

function str_replace ($search, $replace, $subject, &$count = null) {}

This can be very handy when you want to change an existing function to return more than one variable, but don’t want to change all the existing usages of the function in your code. Just tack on an optional reference parameter to the function signature and use that to return your extra data. And maybe design your system better next time 😀

you don’t need to use references to objects

When passing objects around, copies of them are NOT created by default. So you don’t need to use references in this case, the original object will still be modified by your function. See example:

class Foo {
    public $foo;
}
$a = new Foo;
function updateMe($a) { //foo = "bar";
}
updateMe($a);
echo "Foo: ". $a->foo; // says "Foo: bar"

If you use a reference on the parameter here it doesn’t give you any warning, because the parameter may accept primitive types as well as objects, which would otherwise be passed `ByVal` instead of `ByRef`.

ye olden days of references in php

For those who are interested, in PHP’s history, parameters to be passed by reference use to have to be specified in the calling code, instead of the function signature, like so:

$a = "hello";
function modifyMe($a) { //<-- reference not used here
    $a .= " world!";
}
modifyMe(&$a); //<-- reference used here

But if you try to do this today, you get a Fatal error….

PHP Fatal error:  Call-time pass-by-reference has been removed;
If you would like to pass argument by reference, modify the declaration of modifyMe().

primary path code style

by Tim Stamp2015-10-19

This is a quick breakdown of a simple technique I use when laying out code within methods.

Basically it refers back to a common rule used when drawing logic flow diagrams, that the primary path in a logic flow diagram will follow the central vertical line from top to bottom.

This indicates that any non-standard route will follow a logic path that deviates from the central vertical line, and perhaps rejoins it further on.

Working on this idea, code methods and logic flow diagrams should be written in a similar fashion – if there are fewer lines in a logic flow diagram, then it is a simpler diagram. A line in a flow diagram basically translates to a code block in code, so the fewer nested blocks there are, the fewer lines in the diagram there would be.

If two flows in a diagram share roughly equal likelihood, then this rule has less significance, as mutually exclusive nested blocks must exist within the code to represent this. Consider whether the method could be split out into separate methods if this is the case.

As personal preference, I try not to nest code more than 3 levels deeper than the initial depth of the method. (This will start at 2 levels deep for a class method, one deep for a function.)

Here’s an example of some code:

public function doSomething()
{
    $a = funcA();
    $b = funcB();
    if(null !== $a) {
        funcC();
        funcD();
        //more code goes here
    } else {
        throw new Exception("a should not be null");
    }
}

This code demonstrates a problem I see regularly, and it follows a pattern that breaks the primary path code style. In this example the primary path only executes `funcA()` and `funcB()` and then deviates into two parallel paths, instead of suggesting which path is most likely. Normally having this separation would be fine, because it is entirely possible that the two paths (the if/else blocks) are equally likely – but in this case the `else` case only exists to throw an `Exception` if the `if` statement is not met.

This breaks our rule, as an exception should only be thrown in cases of exceptional circumstances, hence the name. The only caveat to this is if the `doSomething()` method serves the primary purpose of throwing an exception in the first place, which in this case it does not.

This code can be simplified following the primary path code style as follows:

public function doSomething()
{
    $a = funcA();
    $b = funcB();
    if(null === $a) {
        throw new Exception("a should not be null");
    }
    funcC();
    funcD();
    //more code goes here
}

Here we can see the primary path now says that `funcC()` and `funcD()` are part of the primary path, and has a greater likelihood if being executed than throwing the Exception does.
You should also note that this version of the code takes up less space on disk, as there is reduced indenting, and the `} else {` line is no longer required.

To a further extreme, I have seen code that looks like this, but with very many more lines:

public function doSomething()
{
    $a = funcA();
    $b = funcB();
    if(null !== $a) {
        funcC();
        $d = funcD();
        if(null !== $d) {
            $e = funcE();
            $f = funcF();
            if(null !== $f) {
                //further horrible indented code
            } else {
                throw new FIsNullException("f is null, this should never happpen");
            }
        } else {
            throw new DIsNullException("d should not be null");
        }
        //more code goes here
    } else {
        throw new AIsNullException("a should not be null");
    }
}

You can see how this pattern-ignoring coding has lead to much unnecessary indenting, and generally more difficult to read code. Here it is simplified:

public function doSomething()
{
    $a = funcA();
    $b = funcB();
    if(null === $a) {
        throw new AIsNullException("a should not be null");
    }
    funcC();
    $d = funcD();
    if(null === $d) {
        throw new DIsNullException("d should not be null");
    }
    $e = funcE();
    $f = funcF();
    if(null === $f) {
        throw new FIsNullException("f is null, this should never happpen");
    }
    //further primary path code
    //more code goes here
}

This code has **exactly** the same functionality, but now follows the pattern, and you can easily see the code that follows the primary path, and you’ve saved some disk space too.

If you think about how `if` statements are actually processed and executed by the CPU, `else` statements don’t really exist in assembly in the same way they do in abstracted languages like PHP – in assembly, an `if` statement is akin to a `JMP` to jump to a different instruction block depending on the outcome of a comparison statement, (==0 `JZ`, >=n `JGE`, <n `JL`, etc.)[1]. If the comparison doesn’t match, the jump doesn’t happen, and the execution point moves to the next instruction. It is therefore more processor efficient (albeit slightly) to have fewer ‘if’ statements that match the statement, as this means fewer jumps, and fewer instruction blocks.
Whether this performance gain translates upwards into highly abstract languages such as PHP is not likely to be noticed, or even detectable, but it’s always good to try to consider the instructions you are sending to the CPU once in a while, and ask yourself, is there a more logical way to walk through this code?

the importance of being logged

by Tim Stamp2015-10-16

This is a helpful reminder to everyone (me included), to always add logging to the system you’re working on.

If a code block is exiting abnormally, or results are not as expected, this obviously needs to be handled in the code, but is often the result of an external problem, and so should be recorded to a log file. In most circumstances this only involves adding one line of code. All logging facilities use different levels of logging output to differentiate the severity of the message, and to disable the more verbose levels in production systems, to save processing, disk space and I/O bandwidth.

Throwing a generic Access Denied Exception can be backed up with a more useful logging message, for analysis:

} catch (Exception $e) {
    $this->container->get('logger')->error(
        "User does not have access to mid $mid"
    );
}

Exceptions should only be thrown under exceptional circumstances, hence the name. This situation is worthy of being logged.
Instead of this:

} catch (Exception $e) {}

We should be doing this:

} catch (Exception $e) {
    $this->container->get('logger')->error(
        'Exception caught when fetching access list: " . $e->getMessage()
    );
}

Logging is a very useful development aid too, but consider not removing all your debugging output when your code is all working and polished.

I’ve used logging numerous times in the past to diagnose when a system is experiencing Fatal errors, beit from Syntax to SegFaults – using logging on a line-by-line basis, outputting simply the __CLASS__ name and __LINE__ number, and flooding files with these statements is sometimes the only way to track down the fault.

So if the message you’re logging contains no static parts, remember that it will be harder to find in the code later on! So include something like the current class name, or use the special built-in magic constants that PHP has to offer:

} catch (Exception $e) {
    $this->container->get('logger')->error(
        __METHOD__ . '#' . __LINE__ . ' : ' . $e->getMessage()
    );
}

stampycode

The Security Architect

php