line-level code deduplication

by stampycodeNovember 5, 2015

When you’re learning about object-oriented code, people use “code deduplication” as one of the reasons for modularising your code, to increase code re-use, and the cut the number of times the same blocks of code are present inside any given system.

The following phrases are some of those that all developers should be thinking:

“If I want to re-use this block of code, maybe I should put it in a method…”
“If I want to re-use these methods, maybe I should put them in a separate class…”
“If I want to re-use this set of classes, maybe I should put them into a bundle…”

These thoughts all focus on code re-use for the purposes of maintainability, and de-duplication of effort. They focus on code-deduplication at Method, Class, and Module levels, respectively.

What these statements have missed, is that there is one more important reason for deduplication, that of code performance.

Code performance is overlooked often in so much of the code that I see, that it is often the number one reason that I end up spending time refactoring.

Consider the following code. It suffers from code duplication at the line-level that reduces readability, performance, and maintainability.

switch ($filters->getDateFilter()->getDateType()) {
    case 'started':
        $searchQueryBuilder->applyFilter(
            new StartDateRangeFilter($filters->getDateFilter()->getFrom(), $filters->getDateFilter()->getTo())
        );
        break;
    case 'ended':
        $searchQueryBuilder->applyFilter(
            new EndDateRangeFilter($filters->getDateFilter()->getFrom(), $filters->getDateFilter()->getTo())
        );
        break;
    case 'created':
        $searchQueryBuilder->applyFilter(
            new CreatedDateRangeFilter($filters->getDateFilter()->getFrom(), $filters->getDateFilter()->getTo())
        );
        break;
    case 'active':
        $searchQueryBuilder->applyFilter(
            new ActiveDateRangeFilter($filters->getDateFilter()->getFrom(), $filters->getDateFilter()->getTo())
        );
        break;
    default:
        throw new Exception("Invalid date_type, valid types are 'started', 'ended', 'created' and 'active'");
}

Repeated calls in this statement:

`$filters->getDateFilter()` – occurs 9 times
`$searchQueryBuilder->applyFilter()` – occurs 4 times
`$filters->getDateFilter()->getFrom()` – occurs 4 times
`$filters->getDateFilter()->getTo()` – occurs 4 times

There are several methods in the above code that are repeatedly called, even though we know that the result will always be the same.
(If we know that the result may change between calls, then this should be illustrated in a comment, but only badly written code would prefix a mutator method with ‘get’)

$dateFilter = $filters->getDateFilter();
$from = $dateFilter->getFrom();
$to = $dateFilter->getTo();

switch ($dateFilter->getDateType()) {
    case 'started':
        $filter = new StartDateRangeFilter($from, $to);
        break;
    case 'ended':
        $filter = new EndDateRangeFilter($from, $to);
        break;
    case 'created':
        $filter = new CreatedDateRangeFilter($from, $to);
        break;
    case 'active':
        $filter = new ActiveDateRangeFilter($from, $to);
        break;
    default:
        throw new Exception("Invalid dateType, valid types are 'started', 'ended', 'created' and 'active'");
}
$searchQueryBuilder->applyFilter($filter);

In the code above, these repeated calls have each been replaced with a single method call, using a local variable to store the result when appropriate, so it can be referenced without repeating the call. Code that is not necessary to be inside of the `switch` statement has been relocated after the switch statement.
This code could be shrunk further with the use of stringified method names, but in this instance is unnecessary, makes code tracing and maintenance harder, and reduces code readability and possibly also affects performance to a small extent.

Here is an example of how you could shrink the code further:

$dateFilter = $filters->getDateFilter();
$dateType = $dateFilter->getDateType();

$types = [
    'started' => 'StartDateRangeFilter',
    'ended' => 'EndDateRangeFilter',
    'created' => 'CreatedDateRangeFilter',
    'active' => 'ActiveDateRangeFilter',
];
if(!array_key_exists($dateType, $types)) {
    throw new Exception("Invalid date_type, valid types are '" . implode("', '", array_keys($types))."'");
}

$searchQueryBuilder->applyFilter(new $types[$dateType]($dateFilter->getFrom(), $dateFilter->getTo()));

This shrunk code does have the advantage that it is the most flexible, in that even the Exception message is dynamic, so adding a new type now only involves adding one line of code.
From the perspective of code tracing and maintainability, you could add a PHPDoc `@var` tag to tell any watching IDE that the resultant class could be an instance of any one of the four listed classes from above:

/** @var StartDateRangeFilter|EndDateRangeFilter|CreatedDateRangeFilter|ActiveDateRangeFilter $f */
$f = new $types[$dateType]($dateFilter->getFrom(), $dateFilter->getTo());
$searchQueryBuilder->applyFilter($f);

But this just starts to get silly, and we are again repeating the class names from the list for the purposes of IDE integration and code documentation.

Some developers would prefer to use an inline formatted layout for the switch statement as below:

$dateFilter = $filters->getDateFilter();
$from = $dateFilter->getFrom();
$to = $dateFilter->getTo();

switch ($dateFilter->getDateType()) {
    case 'started': $filter = new StartDateRangeFilter($from, $to);   break;
    case 'ended':   $filter = new EndDateRangeFilter($from, $to);     break;
    case 'created': $filter = new CreatedDateRangeFilter($from, $to); break;
    case 'active':  $filter = new ActiveDateRangeFilter($from, $to);  break;
    default:
        throw new Exception("Invalid dateType, valid types are 'started', 'ended', 'created' and 'active'");
}
$searchQueryBuilder->applyFilter($filter);

Recently I have been avoiding this code pattern, as it is against the PSR-2 Code Style Guide layed out by PHP-FIG – This is one of the standards I follow, and my IDE is configured to auto-format code which prevents me using this pattern. This is one of the (very few) things I dislike about the PSR-2 guide.

primary path code style

by stampycodeOctober 19, 2015

This is a quick breakdown of a simple technique I use when laying out code within methods.

Basically it refers back to a common rule used when drawing logic flow diagrams, that the primary path in a logic flow diagram will follow the central vertical line from top to bottom.

This indicates that any non-standard route will follow a logic path that deviates from the central vertical line, and perhaps rejoins it further on.

Working on this idea, code methods and logic flow diagrams should be written in a similar fashion – if there are fewer lines in a logic flow diagram, then it is a simpler diagram. A line in a flow diagram basically translates to a code block in code, so the fewer nested blocks there are, the fewer lines in the diagram there would be.

If two flows in a diagram share roughly equal likelihood, then this rule has less significance, as mutually exclusive nested blocks must exist within the code to represent this. Consider whether the method could be split out into separate methods if this is the case.

As personal preference, I try not to nest code more than 3 levels deeper than the initial depth of the method. (This will start at 2 levels deep for a class method, one deep for a function.)

Here’s an example of some code:

public function doSomething()
{
    $a = funcA();
    $b = funcB();
    if(null !== $a) {
        funcC();
        funcD();
        //more code goes here
    } else {
        throw new Exception("a should not be null");
    }
}

This code demonstrates a problem I see regularly, and it follows a pattern that breaks the primary path code style. In this example the primary path only executes `funcA()` and `funcB()` and then deviates into two parallel paths, instead of suggesting which path is most likely. Normally having this separation would be fine, because it is entirely possible that the two paths (the if/else blocks) are equally likely – but in this case the `else` case only exists to throw an `Exception` if the `if` statement is not met.

This breaks our rule, as an exception should only be thrown in cases of exceptional circumstances, hence the name. The only caveat to this is if the `doSomething()` method serves the primary purpose of throwing an exception in the first place, which in this case it does not.

This code can be simplified following the primary path code style as follows:

public function doSomething()
{
    $a = funcA();
    $b = funcB();
    if(null === $a) {
        throw new Exception("a should not be null");
    }
    funcC();
    funcD();
    //more code goes here
}

Here we can see the primary path now says that `funcC()` and `funcD()` are part of the primary path, and has a greater likelihood if being executed than throwing the Exception does.
You should also note that this version of the code takes up less space on disk, as there is reduced indenting, and the `} else {` line is no longer required.

To a further extreme, I have seen code that looks like this, but with very many more lines:

public function doSomething()
{
    $a = funcA();
    $b = funcB();
    if(null !== $a) {
        funcC();
        $d = funcD();
        if(null !== $d) {
            $e = funcE();
            $f = funcF();
            if(null !== $f) {
                //further horrible indented code
            } else {
                throw new FIsNullException("f is null, this should never happpen");
            }
        } else {
            throw new DIsNullException("d should not be null");
        }
        //more code goes here
    } else {
        throw new AIsNullException("a should not be null");
    }
}

You can see how this pattern-ignoring coding has lead to much unnecessary indenting, and generally more difficult to read code. Here it is simplified:

public function doSomething()
{
    $a = funcA();
    $b = funcB();
    if(null === $a) {
        throw new AIsNullException("a should not be null");
    }
    funcC();
    $d = funcD();
    if(null === $d) {
        throw new DIsNullException("d should not be null");
    }
    $e = funcE();
    $f = funcF();
    if(null === $f) {
        throw new FIsNullException("f is null, this should never happpen");
    }
    //further primary path code
    //more code goes here
}

This code has **exactly** the same functionality, but now follows the pattern, and you can easily see the code that follows the primary path, and you’ve saved some disk space too.

If you think about how `if` statements are actually processed and executed by the CPU, `else` statements don’t really exist in assembly in the same way they do in abstracted languages like PHP – in assembly, an `if` statement is akin to a `JMP` to jump to a different instruction block depending on the outcome of a comparison statement, (==0 `JZ`, >=n `JGE`, <n `JL`, etc.)[1]. If the comparison doesn’t match, the jump doesn’t happen, and the execution point moves to the next instruction. It is therefore more processor efficient (albeit slightly) to have fewer ‘if’ statements that match the statement, as this means fewer jumps, and fewer instruction blocks.
Whether this performance gain translates upwards into highly abstract languages such as PHP is not likely to be noticed, or even detectable, but it’s always good to try to consider the instructions you are sending to the CPU once in a while, and ask yourself, is there a more logical way to walk through this code?

the importance of being logged

by stampycodeOctober 16, 2015

This is a helpful reminder to everyone (me included), to always add logging to the system you’re working on.

If a code block is exiting abnormally, or results are not as expected, this obviously needs to be handled in the code, but is often the result of an external problem, and so should be recorded to a log file. In most circumstances this only involves adding one line of code. All logging facilities use different levels of logging output to differentiate the severity of the message, and to disable the more verbose levels in production systems, to save processing, disk space and I/O bandwidth.

Throwing a generic Access Denied Exception can be backed up with a more useful logging message, for analysis:

} catch (Exception $e) {
    $this->container->get('logger')->error(
        "User does not have access to mid $mid"
    );
}

Exceptions should only be thrown under exceptional circumstances, hence the name. This situation is worthy of being logged.
Instead of this:

} catch (Exception $e) {}

We should be doing this:

} catch (Exception $e) {
    $this->container->get('logger')->error(
        'Exception caught when fetching access list: " . $e->getMessage()
    );
}

Logging is a very useful development aid too, but consider not removing all your debugging output when your code is all working and polished.

I’ve used logging numerous times in the past to diagnose when a system is experiencing Fatal errors, beit from Syntax to SegFaults – using logging on a line-by-line basis, outputting simply the __CLASS__ name and __LINE__ number, and flooding files with these statements is sometimes the only way to track down the fault.

So if the message you’re logging contains no static parts, remember that it will be harder to find in the code later on! So include something like the current class name, or use the special built-in magic constants that PHP has to offer:

} catch (Exception $e) {
    $this->container->get('logger')->error(
        __METHOD__ . '#' . __LINE__ . ' : ' . $e->getMessage()
    );
}

stampycode

The Security Architect

Author: stampycode

line-level code deduplication

primary path code style

the importance of being logged