Challenge (part one): Simple ORM

This post is the first post of a two-part challenge in which I am showing how to build a simple web scraper in PHP, and store the results in a database.

In order to translate objects into database table row data, and vice versa, we should use an Object Relational Mapper.
This allows us to basically ignore the translation mechanism between objects and the tables in our project. One of the more popular libraries for providing this functionality is Doctrine, but I fancy writing one myself, in order to demonstrate how to write a simple ORM facility for a small project, using only core PHP functionality.

Entity Class Design

Let’s create 3 simple entity classes, `TPage`, `TType` and `TCount`. These classes will extend the abstract class `AbstractEntity` and will be read from and written to the database using a class named `EntityHandler` – this will take care of the heavy lifting and database interactions.

The `AbstractEntity` class will give us an interface that we can rely on, to ensure that objects being handled by the `EntityHandler` class are in fact instances of this base class, and so are at least intended to be compatible with storing in the database.

In order to make this a simple setup, we need to be careful and strict about the naming conventions used when designing the database and entities and their properties. In this challenge, I have decided that we won’t be using a mapping of field names to object property names, we will be assuming the tables in the database will match the class names of the objects, and that the properties of these entities will also have names that match the fields for their respective tables.
In the case that we have an ID field in the database that refers to a foreign object, the field will have the name of the foreign table, with the letters ‘Id’ suffixed. This means that we can detect when an entity is storing a link (via it’s ID) to a foreign table, and we can automatically deduce the foreign table name from the property name – which means we can automatically handle the relationship in PHP without any further metadata or setup config for each entity class.
If we stick to a naming convention, we can also use an Array to show when there are multiple entities referencing the given entity – we will say that if the property type is an Array, then we can determine the name of the source entity from property name, and suffix ‘List’ to the name, for example.

If we put the above ideas into practice, we end up with something like this:

/**
 * All entities in the system that are storable in the database must extend the AbstractEntity class.
 * Entities must follow a strict naming convention:
 * 
    *
  • property names for normal types (int, string, DateTime) must match the database field * name exactly.
  • *
  • * @author StampyCode */ abstract class AbstractEntity { /** @var int The ID of the object instance, generated by the DB */ public $id; }
/**
 * Class TPage
 *
 * Stores information about the page we've scraped
 */
class TPage extends AbstractEntity
{
    /** @var string The page Title for the URL scraped */
    public $title;

    /** @var string The URL scraped */
    public $url;

    /** @var DateTime The date/time that the scrape was performed, or attempted */
    public $when;

    /** @var bool Whether the page scrape was successful */
    public $success;

    /** @var TCount[] Collection of  */
    public $TCountList = [];
}
/**
 * Class TType
 *
 * Stores the name of the type of entity we are storing a count for
 */
class TType extends AbstractEntity
{
    /** @var string The name of the Type */
    public $name;
}
/**
 * Class TCount
 *
 * Stores details about a specific element we've seen on the page TPage
 */
class TCount extends AbstractEntity
{
    /** @var TPage The scraped page that this count belongs to */
    public $TPage;

    /** @var TType The type of element this count refers to */
    public $TType;

    /** @var string The value of the element this count belongs to */
    public $value;

    /** @var int The number of elements of the given type that were found */
    public $count;
}

A little note about Accessors and Mutators: It is common for objects in the object-oriented world to use setters and getters (formally known as accessors and mutators) to protect access to what would normally be considered private properties of objects. There are many reasons why it is good to use them, but there is never one design pattern that suits every situation. In the above code I have chosen to not use them, simply because these objects are intended to be simple, straightforward data structures that have no validation, no error checking, nothing to prevent silly things being done to them, just like in the good ol’ days of programming in C. Structs in C represent the simple idea of collating several items of data together into a single unit so it can be passed from one routine to the next, or stored for use later on.
Adding setters and getters to these objects specifically, in my opinion, would be an example of premature optimisation – trying to cater for a scenario that doesn’t yet exist, and may never exist, and in doing so, you’ve just created superfluous code.
If, later on, you decide that you want to have a method that gives you the result of two of your entity fields joined together, fine. You can create that method within that entity if you like, but you have to ask yourself, is that good design? Are you breaking the design that the rest of the entities have stuck to rigidly? OR you can use a separate class that specifically contains methods for manipulating your entities. Alternatively if you want to add some validation, you could use the `__set` and `__get` magic methods to ensure that the way the entities are being used is as originally intended.

Database Design

And for the above class structure, we must use the following SQL schema, with some referential integrity thrown in for good measure:

CREATE TABLE `TPage` (
    id SERIAL,
    title VARCHAR(255) NOT NULL COMMENT 'The title of the page we scraped',
    url VARCHAR(4096) NOT NULL COMMENT 'The URL we scraped',
    `when` DATETIME NOT NULL COMMENT 'When we scraped the page',
    success TINYINT(1) NOT NULL COMMENT 'Whether the attempt to scrape this page worked'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `TType` (
    id SERIAL,
    name VARCHAR(64) NOT NULL UNIQUE COMMENT 'The type of value saw on the page, eg. Tag, Content, etc.'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `TCount` (
    id SERIAL,
    TPageId BIGINT UNSIGNED NOT NULL COMMENT 'The page we scraped when we saw this value',
    TTypeId BIGINT UNSIGNED NOT NULL COMMENT 'The type of value we saw',
    value VARCHAR(64) NOT NULL COMMENT 'The value we saw on the page',
    `count` BIGINT UNSIGNED NOT NULL COMMENT 'The number of times we saw the value',
    CONSTRAINT `c_TCount__page_type_value`
        UNIQUE (TPageId, TTypeId, value),
    CONSTRAINT `c_TCount__TPageId`
        FOREIGN KEY (`TPageId`)
        REFERENCES `TPage` (`id`)
        ON DELETE CASCADE,
    CONSTRAINT `c_TCount__TTypeId`
        FOREIGN KEY (`TTypeId`)
        REFERENCES `TType` (`id`)
        ON DELETE RESTRICT
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Don’t forget we need to escape with backticks any table or field names that are reserved words in MySQL, like ‘count’ as shown above.

So we have our entities. We’ve made sure that the implementation is simple, but not too simple that it needs massive amounts of work to add small features. It is dynamic to the point that our ORM doesn’t have any understanding of the database, nor the entities themselves.
This is why our naming convention is important.

Database Connectivity

But what about the database connection itself, I hear you ask? We can create a class to handle that too – if our application is to accept multiple databases of generic types, then we should have an interface that defines the types of interactions we need from our database adapter, so we can create the adapter to fit the interface and act as a mediator between a raw database connection object and the system accessing it. This way we can swap out adapters as necessary, and write new adapters for different database types. (Although we may have to do some further abstraction to allow us to use non-SQL databases at a later date.)
We shall have an interface named `DatabaseInterface` and an adapter named `MysqliDbConnection` that will implement this interface, and extend the built-in `Mysqli` object.

Here is our Database Interface, to allow us to abstract away the fact that we are using Mysqli, and instead provide our system with a generic database adapter that it can throw generic SQL queries at.

/**
 * Interface DatabaseInterface - exists to promise specific methods for accessing a generic database, so we can
 * abstract away the type of database connetion we are using at any particular time. We can then change the
 * database adapter at any time, as long as it provides the methods detailed in this interface, everything
 * will continue working.
 *
 * @package StampyCodeSimpleORM
 * @author StampyCode
 */
interface DatabaseInterface
{
    /**
     * Connect to the database using the given parameters
     * @return $this
     */
    public function connect();

    /**
     * Disconnect from the database
     * @return $this
     */
    public function disconnect();

    /**
     * Set the parameters to connect to the database
     * @param mixed[] $params
     * @return $this
     */
    public function setParameters(array $params);

    /**
     * Execute a SQL query on the connected database
     * Also connects to the database if it is not already connected
     * @param string $sql
     * @return bool on success, false on failure
     * @throws Exception if the query fails
     */
    public function query($sql);

    /**
     * Returns the result of the query passed to the query method
     * @return mixed
     */
    public function getResult();

    /**
     * Returns the mysql-safe escaped string wrapped with quotes
     * @param string $str
     * @return string
     */
    public function escapeString($str);

    /**
     * Returns the last ID that was auto-generated, for using in our objects
     * @return int
     */
    public function getLastAutogeneratedId();
}

And here is our implementation of this interface, wrapping a standard `Mysqli` object:

/**
 * Class DatabaseConnection - allows our system to connect to a database using Mysqli, but enabling our system to
 * not care what type of database we are using.
 *
 * @package StampyCodeSimpleORM
 * @author StampyCode
 */
class MysqliDbConnection implements DatabaseInterface
{
    /** @var Mysqli */
    private $mysqli;

    /** @var string */
    private
        $user = null,
        $pass = null,
        $host = null,
        $dbname = null;

    /** @var int */
    private $port = null;

    /**
     *
     */
    public function __construct()
    {
        $this->mysqli = new Mysqli();
    }

    /**
     * @inheritdoc
     */
    public function setParameters(array $params)
    {
        foreach($params as $k => $v) {
            if(property_exists($this, $k)) {
                $this->$k = $v;
            }
        }
        return $this;
    }

    /**
     * @inheritdoc
     */
    public function connect()
    {
        $this->mysqli->connect($this->host, $this->user, $this->pass, $this->dbname, $this->port);
        return $this;
    }

    /**
     * @inheritdoc
     */
    public function disconnect()
    {
        $this->mysqli->close();
    }

    /**
     * @inheritdoc
     */
    public function query($sql)
    {
        if(!$this->mysqli->ping()) {
            $this->connect();
        }
        $ret = $this->mysqli->real_query($sql);
        if(false === $ret) {
            throw $this->dbErrorFactory();
        }
        return $ret;
    }

    /**
     * @inheritdoc
     */
    public function getResult()
    {
        $res = $this->mysqli->use_result();
        if(false === $res) {
            throw $this->dbErrorFactory();
        }
        return $res->fetch_all(MYSQLI_ASSOC);
    }

    /**
     * @inheritdoc
     */
    public function escapeString($str)
    {
        if(is_numeric($str)) {
            return $str;
        }
        if(is_null($str)) {
            return 'NULL';
        }
        $str = $this->mysqli->real_escape_string($str);
        return '"'.$str.'"';
    }

    /**
     * @inheritdoc
     */
    public function getLastAutogeneratedId()
    {
        return $this->mysqli->insert_id;
    }

    /**
     * Creates an exception detailing the last database error code and message
     * @return Exception
     */
    private function dbErrorFactory()
    {
        return new Exception("Database error {$this->mysqli->errno}: {$this->mysqli->error}");
    }
}

So now we have a database connection object that we can pass to our ORM class, but we haven’t yet built our ORM class…. So here it is.

Our ORM Class

/**
 * Class EntityHandler
 *
 * @package StampyCodeCrawler
 */
class EntityHandler
{
    /** @var DatabaseInterface Our connection to the database for persisting and retrieving objects */
    private $dbConn = null;

    /** @var AbstractEntity[][] Storage of the objects the handler is managing */
    private $objCache = [];

    /**
     * @param DatabaseInterface $dbConn
     */
    public function __construct(DatabaseInterface $dbConn)
    {
        $this->dbConn = $dbConn;
    }

    /**
     * Fetch an object from the database with the given ID and type
     *
     * @param string    $class One of TPage, TType, TCount
     * @param int|mixed $id    The ID or parameters to find the object by
     * @return AbstractEntity
     * @throws InvalidArgumentException if the given class name is not an instance of AbstractEntity
     */
    public function get($class, $id)
    {
        if(!class_exists($class)) {
            throw new InvalidArgumentException("Unknown entity '$class'");
        }
        if(!is_subclass_of($class, 'AbstractEntity')) {
            throw new InvalidArgumentException("Class '$class' is not an instance of AbstractEntity");
        }
        if(!is_numeric($id)) {
            $id = $this->getIdByParams($class, $id);
            if(!$id) {
                return null;
            }
        }

        $this->stripNamespaceFromClass($class);

        /* we're using a local object cache so we dont end up with multiple instances of the same
         *  database object floating around our system
         */
        if(isset($this->objCache[$class]) && isset($this->objCache[$class][$id])) {
            return $this->objCache[$class][$id];
        }

        //Prepare and execute the select statement to fetch the object's properties
        $id = (int)$id;
        $sql = "SELECT * FROM $class WHERE id = $id";
        $this->dbConn->query($sql);
        $result = $this->dbConn->getResult();
        if(!$result) {
            return null;
        }

        //create our new object, populate it with the found values
        /** @var AbstractEntity $obj */
        $obj = new $class();
        foreach(current($result) as $col => $row) {
            if(!property_exists($obj, $col)) {
                trigger_error("Property '$col' of class '$class' does not exist", E_USER_WARNING);
                continue;
            }
            $obj->$col = $row;
        }

        $this->objCache[$class][$obj->id] = $obj;
        return $obj;
    }

    /**
     * Save a given entity to the database
     *
     * @param AbstractEntity $obj
     */
    public function set(AbstractEntity $obj)
    {
        /* we're creating a stack here so this function can be called recursively for child objects of the given
         * object - as our objects will want their children to be committed to the database as well.
         */
        static $stack = [];
        if(in_array($obj, $stack)) {
            return;
        }
        array_push($stack, $obj);

        $this->persistFields($obj);

        $this->persistObjects($obj);

        //pop the stack, so we can now allow re-running this method with this object
        $o = array_pop($stack);
        if($o !== $obj) {
            trigger_error("Stack Corrupted", E_USER_ERROR);
        }
    }

    /**
     * Fetch an object from the database matching the given parameters
     *
     * @param string  $class
     * @param mixed[] $params
     * @return int
     */
    private function getIdByParams($class, $params)
    {
        if(!class_exists($class)) {
            throw new InvalidArgumentException("Unknown entity '$class'");
        }
        if(!is_subclass_of($class, 'AbstractEntity')) {
            throw new InvalidArgumentException("Class '$class' is not an instance of AbstractEntity");
        }

        $this->stripNamespaceFromClass($class);

        $params = $this->convertArrayToGetStatement($params);

        $sql = "SELECT id FROM $class WHERE $params LIMIT 1";

        $this->dbConn->query($sql);
        $result = $this->dbConn->getResult();

        if(!$result) {
            return null;
        }
        $result = current($result);
        $id = $result['id'];

        return $id;
    }

    /**
     * Return an associative array that describes the properties of the given object so it can be saved to the
     * database. The list of output fields does not include other objects
     *
     * @param AbstractEntity $obj
     * @return mixed[]
     */
    private function getFieldsAsArray(AbstractEntity $obj)
    {
        $props = [];
        $reflect = new ReflectionClass($obj);
        foreach($reflect->getProperties() as $prop) {
            $name = $prop->getName();
            if($name[0] === '_' || $name === 'id') {
                continue;
            }
            $value = $obj->$name;
            if($value instanceof AbstractEntity) {
                if($value->id) {
                    $name = $name . 'Id';
                } else {
                    $this->set($value);
                }
                $value = $value->id;
            }
            if(is_array($value)) {
                continue;
            }
            if($value instanceof DateTime) {
                $value = $value->format('Y-m-d H-i-s');
            }
            $props[$name] = $value;
        }
        return $props;
    }

    /**
     * Returns an array of objects that are referenced directly by the given entity, so for example, the TPage
     * class stores an array of TCount objects - this method will add any instances of AbstractEntity within
     * that list into an array to be returned.
     *
     * @param AbstractEntity $obj
     * @return array
     */
    private function getObjectsFromEntity(AbstractEntity $obj)
    {
        $props = [];
        $reflect = new ReflectionClass($obj);
        foreach($reflect->getProperties() as $prop) {
            $name = $prop->getName();
            if($obj->$name instanceof AbstractEntity) {
                $props[] = $obj->$name;
            } elseif(is_array($obj->$name)) {
                foreach($obj->$name as $elem) {
                    if($elem instanceof AbstractEntity) {
                        $props[] = $elem;
                    }
                }
            }
        }
        return $props;
    }

    /**
     * Persists the properties in the given object to the database, does not interact with any instances of
     * AbstractEntity that are referred to by properties within the given entity
     *
     * @param AbstractEntity $obj
     */
    private function persistFields(AbstractEntity $obj)
    {
        //setup the variables to be sent to the database
        $class = get_class($obj);
        $this->stripNamespaceFromClass($class);
        $props = $this->getFieldsAsArray($obj);

        if(isset($obj->id)) {
            //UPDATE an existing entity
            $update = $this->convertArrayToSetStatement($props);
            if(!$update) {
                return;
            }
            $sql = "UPDATE $class SET $update WHERE id = {$obj->id}";
        } else {
            //CREATE a new entity
            if(!count($props)) {
                return;
            }
            $rows = '`' . implode('`, `', array_keys($props)) . '`';
            $props = array_map([$this->dbConn, 'escapeString'], $props);
            $values = implode(', ', $props);
            $sql = "INSERT INTO $class ($rows) VALUES ($values)";
        }

        //execute the query
        $this->dbConn->query($sql);

        //if CREATE statement called, set the new ID created for our object
        if(!isset($obj->id)) {
            $obj->id = $id = $this->dbConn->getLastAutogeneratedId();
        }
    }

    /**
     * Persists objects referred to by the given class into the database
     *
     * does not persist normal field values
     *
     * @param AbstractEntity $obj
     */
    private function persistObjects(AbstractEntity $obj)
    {
        $objects = $this->getObjectsFromEntity($obj);
        foreach($objects as $obj) {
            $this->set($obj);
        }
    }

    /**
     * Strip the namespace from the front of the class name
     *
     * @param string $class
     */
    private function stripNamespaceFromClass(&$class)
    {
        $ex = explode('\', $class);
        $class = array_pop($ex);
    }

    /**
     * Convert given assoc array to a list of SQL update field parameters, with values escaped, ready to be
     * used in a SQL SET statement
     *
     * @param mixed[] $arr
     * @return string
     */
    private function convertArrayToSetStatement(array $arr)
    {
        $out = [];
        foreach($arr as $k => $val) {
            $out[] = $k . ' = ' . $this->dbConn->escapeString($val);
        }
        return implode(', ', $out);
    }

    /**
     * Convert given associative array to a list of SQL query parameters, with values escaped, ready to be
     * used in a SQL GET statement
     *
     * @param mixed[] $arr
     * @return string
     * @throws InvalidArgumentException if passed an array that has numeric keys
     */
    private function convertArrayToGetStatement(array $arr)
    {
        $out = [];
        foreach($arr as $k => $val) {
            if(is_numeric($k)) {
                throw new InvalidArgumentException("array using numeric indexes instead of assoc array");
            }
            $out[] = $k . ' = ' . $this->dbConn->escapeString($val);
        }
        return implode(' AND ', $out);
    }
}

Essentially, there’s only three parts you need to understand from the outside: the constructor, which accepts our wrapped database adapter, the `get` method, which accepts an entity class name and an ID, (or array of parameters to search for), and the `set` method – which accepts an entity that will be saved to the database.

The `get` method must allow for finding objects in two different ways – either you know the ID of the object you want, or you know a field property value that it has that you wish to search for. So if you pass in an associative array into the `id` field like `[‘name’=>’hello’]` then our ORM class will go to the database and search for an entity (of the type provided) that matches the field parameter described. You can specify multiple parameters in the passed array to narrow the search if desired, and the ORM class will return the first object that matches the searched pattern.
Currently we don’t have need for a ‘find all matches’ functionality in our ORM, so we haven’t implemented one.
The first parameter of the `get` method accepts the string class name of the entity type to be searched for, in this instance, it can one of ‘TPage’, ‘TType’ and ‘TCount’ – any string passed other than one of these will cause an exception to be thrown.
Whenever an object is fetched from the database, it will be added to the local class cache, which maintains a list of all the objects it is currently managing. The reason for this is simple – if we try to fetch the same row twice, we don’t want to have two instances of the same “object” floating around in the same running instance of our application. Aside from being poor management memory-wise, it can also lead to write conflicts and much confusion.

The `set` method allows for two different object statuses – ones that do not exist in the database already, and ones that do. It’s dead easy to figure out if an object is already in there – as it will have its ID property set. Of course there is nothing stopping us manually setting the ID and trying to update a row that doesnt exist, or removing the ID from one that does exist and getting us throwing database constraint exceptions…. But we could be here all night trying to prevent programmer errors, so let’s just keep things simple for now.
If the id is set, then we use the UPDATE method, if it is not set, then we use the INSERT method, and set the ID.
Like with the `get` method, whenever we write a new object to the database, we add it to our local class cache, just in case we refer to it again later, perhaps from a different context within our application.

Trying it out

So there it is, we’ve put together a simple ORM system that we can adapt to our needs, which is lightweight and needs no up-front configuration, just common sense design of our database and class setup enables us to quickly get our stuff saved to our database.
Of course there are limitations in some of the approaches we’ve taken here, but the simplicity of our design is the focus at this point.

Here’s some code to test the above setup.

header('content-type: text/plain');
ini_set('show_errors', 1);
ini_set('max_execution_time', 0);
ini_set('memory_limit', '1G');
error_reporting(E_ALL);
date_default_timezone_set('UTC');

try {    
    $dbConn = new MysqliDbConnection();
    $dbConn->setParameters(
        [
            'user' => 'scraperdbuser',
            'pass' => '',
            'host' => 'localhost',
            'dbname' => 'scraperdb',
            'port' => null
        ]
    );
    $dbConn->connect();
    $entityHandler = new EntityHandler($dbConn);

    $testPage = new TPage();
    $testPage->title = 'Foo';
    $testPage->url = 'http://example.com';
    $testPage->when = new DateTime();
    $testPage->success = true;

    $tagType = $entityHandler->get('TType', ['name'=>'Tag']);
    if(!$tagType) {
        $tagType = new TType();
        $tagType->name = 'Tag';
    }

    $wordType = $entityHandler->get('TType', ['name'=>'Word']);
    if(!$wordType) {
        $wordType = new TType();
        $wordType->name = 'Word';
    }

    $tag1 = new TCount();
    $tag1->count = 2;
    $tag1->TPage = $testPage;
    $tag1->TType = $tagType;
    $tag1->value = 'button';

    $tag2 = new TCount();
    $tag2->count = 4;
    $tag2->TPage = $testPage;
    $tag2->TType = $tagType;
    $tag2->value = 'form';

    $tag3 = new TCount();
    $tag3->count = 6;
    $tag3->TPage = $testPage;
    $tag3->TType = $wordType;
    $tag3->value = 'hello';

    $tag4 = new TCount();
    $tag4->count = 8;
    $tag4->TPage = $testPage;
    $tag4->TType = $wordType;
    $tag4->value = 'world';

    $testPage->TCountList[] = $tag1;
    $testPage->TCountList[] = $tag2;
    $testPage->TCountList[] = $tag3;
    $testPage->TCountList[] = $tag4;

    $entityHandler->set($testPage);

    print_r($testPage);

} catch (Exception $e) {
    echo $e->getMessage()."nn".$e->getTraceAsString();
}

Enjoy 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s