Challenge (part two): Web Scraper

by Tim Stamp2015-04-11

The Task:

Write a php application that accepts a URL. Download the page the URL references. The page contents should then be broken into two parts. The first part determines all the different kinds of HTML tags on the page and the frequency counts for each. The second part determines all the different words that aren’t part of the HTML on the page and the frequency counts for each. The results from the two parts should be stored in a database.

There are a number of reasons why someone would want to do this. Part of this challenge is to create re-usable code, but the main aim is to use best practice code and style to achieve the task, in an efficient, understandable and coherent manner.

In part one of this challenge, we built our simple ORM classes for storing well structured entities into a database, and here we are going to build on that, to store the results of our page scraping into a set of tables.

Database Design

Firstly let’s do some database design. We need to store URLs, counts of tags on the page, and counts of words on the page that aren’t part of the HTML markup.

Let’s assume that we are writing this for a small application, and will only be scraping upto a few hundred sites. This allows us to make some assumptions about database capacity and performance considerations, like column widths, choice of database type, column sizes, etc. We will also begin with the assumption that this scraper will only scrape basic HTML pages – any largely dynamic pages (through Javascript or Flash) will not be processed very well, as they tend to offer less fixed HTML up front, with the focus on the browser enriching the page by making subsequent page requests and modifying the page after the initial load.

USE scraperdb;

CREATE TABLE `TPage` (
    id SERIAL,
    title VARCHAR(255) NOT NULL COMMENT 'The title of the page we scraped',
    url VARCHAR(4096) NOT NULL COMMENT 'The URL we scraped',
    `when` DATETIME NOT NULL COMMENT 'When we scraped the page',
    success TINYINT(1) NOT NULL COMMENT 'Whether the attempt to scrape this page worked'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `TType` (
    id SERIAL,
    name VARCHAR(64) NOT NULL UNIQUE COMMENT 'The type of value saw on the page, eg. Tag, Content, etc.'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `TCount` (
    id SERIAL,
    TPageId BIGINT UNSIGNED NOT NULL COMMENT 'The page we scraped when we saw this value',
    TTypeId BIGINT UNSIGNED NOT NULL COMMENT 'The type of value we saw',
    value VARCHAR(64) NOT NULL COMMENT 'The value we saw on the page',
    `count` BIGINT UNSIGNED NOT NULL COMMENT 'The number of times we saw the value',
    CONSTRAINT `c_TCount__page_type_value`
        UNIQUE (TPageId, TTypeId, value),
    CONSTRAINT `c_TCount__TPageId`
        FOREIGN KEY (`TPageId`)
        REFERENCES `TPage` (`id`)
        ON DELETE CASCADE,
    CONSTRAINT `c_TCount__TTypeId`
        FOREIGN KEY (`TTypeId`)
        REFERENCES `TType` (`id`)
        ON DELETE RESTRICT
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SQL;

So we’ve created a database that will allow us to store URLs, Tags and Content values, and the number of times we have seen each. We’ve put in some referential integrity constraints to prevent us from doing silly things by accident, such as trying to enter two different counts for one Tag/Scrape instance.
We could go one step further and create a separate table for the values we scrape, thus achieving third-normal form
within our data structure – but at this point it would be overkill and premature optimisation of our system.
Also, I have not added any extra column indexes to the tables, as we don’t yet have an idea of how the data will be used – we can add these once we have an established working prototype and want to optimise how we are using the results.

System Design

So lets now design our system. Here is some pseudo-code to establish what we’re going to do.

enter URL to connect to
verify that we want to allow the given URL to be connected to
connect to the URL to check if robots are allowed, abort if not
connect to the URL and download the content in full
run an XML parser to analyse and pull apart our downloaded HTML
save the scraper page details to the database
analyse the tags, save the counts to the database
analyse the tag content, save the counts to the database

Verification

The URL that has been passed into our system may not be be in a form that we wish to accept – we may wish to prevent users from using IP addresses, or using a URL with embedded username and password to connect to.

Robots.txt

You’ll have noticed that I’ve included a check for ‘robots’ – This is an internet standard that has been around for many years – you can read up more about it on the robotstxt.org site. I’ve chosen to not scrape sites that have objected to being automatically scraped, using this method. You will see the code below contains checks for this.

XML Parser

We’re going to use the built-in DomDocument parser to parse our HTML. This library seems to be the most appropriate library to use for parsing HTML, as there is a lot of bad HTML in the wild, and this library is fairly fault-tolerant, and easy to use. HTML is not always XML compliant, some XML parsers will fail to parse HTML because of trivial shortcuts that programmers make when writing HTML, like not closing tags properly, or embedding attributes within tags that don’t have an argument, eg. `<script src=”…” async defer>`. This behaviour is not XML compliant.

ORM Entities

Our system has 3 entity classes, `TPage`, `TType` and `TCount`. These classes will extend the abstract class `AbstractEntity` and will be read from and written to the database using a class named `EntityHandler` – this will take care of the heavy lifting and database interactions.
Lets see what they look like…

/**
 * All entities in the system that are storable in the database must extend the AbstractEntity class.
 * @package StampyCodeScraper
 */
abstract class AbstractEntity
{
    /** @var int The ID of the object instance, generated by the DB */
    public $id;
}

class TPage extends AbstractEntity
{
    /** @var string The page Title for the URL scraped */
    public $title;

    /** @var string The URL scraped */
    public $url;

    /** @var DateTime The date/time that the scrape was performed, or attempted */
    public $when;

    /** @var bool Whether the page scrape was successful */
    public $success;

    /** @var TCount[] Collection of  */
    public $tCounts;
}

class TType extends AbstractEntity
{
    /** @var string The name of the Type */
    public $name;
}

class TCount extends AbstractEntity
{
    /** @var TPage The scraped page that this count belongs to */
    public $TPage;

    /** @var TType The type of element this count refers to */
    public $TType;

    /** @var string The value of the element this count belongs to */
    public $value;

    /** @var int The number of elements of the given type that were found */
    public $count;
}

The Code

Here’s the class definition for our scraper. It contains all the features we’ve described above, commented and ready to be used.

/**
 * Class Scraper
 *
 * @package Scraper
 */
class Scraper
{
    /** @var string */
    private $url;

    /** @var string */
    private $rawContent;

    /** @var int[][] */
    private $results = [];

    /**
     * @param string $url
     */
    public function __construct($url)
    {
        $this->url = $url;
    }

    /**
     */
    public function scrape()
    {
        $this->checkUrlSanity($this->url);
        $this->checkRobotPermission($this->url);
        $this->rawContent = $this->getHttpContent($this->url);
        $this->getTagSummary($this->rawContent);
    }

    /**
     * @return int[][]
     */
    public function getResults()
    {
        return $this->results;
    }

    /**
     * @param string $url
     * @throws Exception
     */
    private function checkUrlSanity($url)
    {
        $urlParts = parse_url($url);
        if($urlParts['scheme'] !== 'http') {
            throw new Exception("Scraper only accepts HTTP URLs");
        }
        if($urlParts['host'] === 'localhost') {
            throw new Exception("Scraper will not scrape the local machine");
        }
        if(filter_var($urlParts['host'], FILTER_VALIDATE_IP)) {
            throw new Exception("URLs must be host-based, not IP based");
        }
        if(!strpos($urlParts['host'], '.')) {
            // try to prevent the hostname from being a locally resolvable one
            throw new Exception("Host names must contain at least a TLD and a gTLD");
        }
        if(isset($urlParts['pass'])) {
            throw new Exception("Scraper will not accept URLs with username/password parameters");
        }
    }

    /**
     * @see http://www.robotstxt.org/
     * @param string $url
     * @throws Exception if the given URL is not scrapable by robots.
     */
    private function checkRobotPermission($url)
    {
        $urlParts = parse_url($url);
        $url = $urlParts['scheme'] . '://' . $urlParts['host'] . ':' . $urlParts['port'] . '/robots.txt';

        $robotsTxt = $this->getHttpContent($url);
        if(!$robotsTxt) {
            //robots.txt file not found, so we can proceed :)
            return;
        }
        if($robotsTxt[0] === '= $valueLength) {
                continue;
            }
            $ruleLength = $valueLength;
            $allowed = ($field === 'allow');
            $lastMatchingRule = $value;
        }
        if(!$allowed) {
            throw new Exception("Robots are not allowed to access path '$lastMatchingRule'");
        }
    }

    /**
     * Retrieve the web content of the URL provided
     *
     * @param string $url
     * @return string
     */
    private function getHttpContent($url)
    {
        $context = stream_context_create(['http' => ['header'=>"Connection: closern"]]);
        return file_get_contents($url, false, $context);
    }

    /**
     * Summarises the given HTML content and stores a count of all seen tag names and non-HTML words used
     *
     * @param string $content
     * @throws Exception if the parser fails
     */
    private function getTagSummary($content)
    {
        $tags = [];
        $words = [];
        $excludedTags = [];
        $excludedWords = [null,''];

        $charHandler = function($data) use (&$words) {
            $data = preg_replace('|[^a-zA-Z0-9_-]|', ' ', $data);
            $data = explode(' ', $data);
            $words = array_merge($words, $data);
        };
        //todo: exclude text content within Script and Style tags

        $doc = new DOMDocument();
        $doc->loadHTML($content);
        $this->processDomNodeList($doc->documentElement->childNodes, $tags);

        $charHandler($doc->textContent);

        $tags = array_diff($tags, $excludedTags);
        $words = array_diff($words, $excludedWords);
        $tags = array_count_values($tags);
        $words = array_count_values($words);
        arsort($tags);
        arsort($words);
        $this->results = [
            'Tag' => $tags,
            'Word' => $words
        ];
    }

    /**
     * Iterates over the given DomNodeList object, identifies the tag name and stores it to the given array
     *
     * @param DomNodeList $list
     * @param string[] $tagList
     */
    private function processDomNodeList(DomNodeList $list, &$tagList)
    {
        for($i=0; $ilength; ++$i) {
            $item = $list->item($i);
            $tagList[] = $item->nodeName;
            if($item->childNodes instanceof DomNodeList) {
                $this->processDomNodeList($item->childNodes, $tagList);
            }
        }
    }
}

Saving the Results

This is a basic working prototype for a web-page scraper. It accepts a URL, processes the given page, and counts the number of Tag types and Words used in the page. In order to store these results, we create instances of our `AbstractEntity` classes, and save them to the database using our `EntityHandler` class.

try {
    //establish our database connection, to pass into the EntityHandler class
    $dbConn = new MysqliDbConnection();
    $dbConn->setParameters(
        [
            'user' => 'scraperdbuser',
            'pass' => '',
            'host' => 'localhost',
            'dbname' => 'scraperdb',
            'port' => null
        ]
    );
    $dbConn->connect();
    $entityHandler = new EntityHandler($dbConn);

    // Create a TPage object to associate and collate our results with
    $testPage = new TPage();
    $testPage->title = 'Foo';
    $testPage->url = 'http://stampy.me';
    $testPage->when = new DateTime();
    $testPage->success = true;

    $scraper = new Scraper($testPage->url);

    try {
        $scraper->scrape();
    } catch (Exception $e) {
        echo "Scrape Failed - ".$e->getMessage() .' - '. $e->getTraceAsString();
        $testPage->success = false;
    }

    foreach($scraper->getResults() as $tagType => $result) {
        //get the tag type class, if it exists
        $typeObj = $entityHandler->get('TType', ['name' => $tagType]);
        if(!$typeObj) {
            // otherwise, create it!
            $typeObj = new TType();
            $typeObj->name = $tagType;
        }
        foreach($result as $value => $count) {
            //create a new TCount object for each count result
            $tag = new TCount();
            $tag->count = $count;
            $tag->TPage = $testPage;
            $tag->TType = $typeObj;
            $tag->value = $value;
            $testPage->TCountList[] = $tag;
        }
    }

    $entityHandler->set($testPage);

    print_r($scraper->getResults());

} catch(Exception $e) {
    echo $e->getMessage() . "nn" . $e->getTraceAsString();
}

Alternatives

There’s a world of options out there for page scraping – it is after all, how search engines operate. They connect to a website, pull useful information from it, which includes links to other pages or websites, and then connect to those as well. This tutorial was written in a couple of days by a single developer – and the simplicity of the classes reflect that.

Enjoy 🙂

Challenge (part one): Simple ORM

by Tim Stamp2015-04-11

This post is the first post of a two-part challenge in which I am showing how to build a simple web scraper in PHP, and store the results in a database.

In order to translate objects into database table row data, and vice versa, we should use an Object Relational Mapper.
This allows us to basically ignore the translation mechanism between objects and the tables in our project. One of the more popular libraries for providing this functionality is Doctrine, but I fancy writing one myself, in order to demonstrate how to write a simple ORM facility for a small project, using only core PHP functionality.

Entity Class Design

Let’s create 3 simple entity classes, `TPage`, `TType` and `TCount`. These classes will extend the abstract class `AbstractEntity` and will be read from and written to the database using a class named `EntityHandler` – this will take care of the heavy lifting and database interactions.

The `AbstractEntity` class will give us an interface that we can rely on, to ensure that objects being handled by the `EntityHandler` class are in fact instances of this base class, and so are at least intended to be compatible with storing in the database.

In order to make this a simple setup, we need to be careful and strict about the naming conventions used when designing the database and entities and their properties. In this challenge, I have decided that we won’t be using a mapping of field names to object property names, we will be assuming the tables in the database will match the class names of the objects, and that the properties of these entities will also have names that match the fields for their respective tables.
In the case that we have an ID field in the database that refers to a foreign object, the field will have the name of the foreign table, with the letters ‘Id’ suffixed. This means that we can detect when an entity is storing a link (via it’s ID) to a foreign table, and we can automatically deduce the foreign table name from the property name – which means we can automatically handle the relationship in PHP without any further metadata or setup config for each entity class.
If we stick to a naming convention, we can also use an Array to show when there are multiple entities referencing the given entity – we will say that if the property type is an Array, then we can determine the name of the source entity from property name, and suffix ‘List’ to the name, for example.

If we put the above ideas into practice, we end up with something like this:

/**
 * All entities in the system that are storable in the database must extend the AbstractEntity class.
 * Entities must follow a strict naming convention:
 * 
 * property names for normal types (int, string, DateTime) must match the database field
 * name exactly.
 * 
 * @author StampyCode
 */
abstract class AbstractEntity
{
    /** @var int The ID of the object instance, generated by the DB */
    public $id;
}

/**
 * Class TPage
 *
 * Stores information about the page we've scraped
 */
class TPage extends AbstractEntity
{
    /** @var string The page Title for the URL scraped */
    public $title;

    /** @var string The URL scraped */
    public $url;

    /** @var DateTime The date/time that the scrape was performed, or attempted */
    public $when;

    /** @var bool Whether the page scrape was successful */
    public $success;

    /** @var TCount[] Collection of  */
    public $TCountList = [];
}

/**
 * Class TType
 *
 * Stores the name of the type of entity we are storing a count for
 */
class TType extends AbstractEntity
{
    /** @var string The name of the Type */
    public $name;
}

/**
 * Class TCount
 *
 * Stores details about a specific element we've seen on the page TPage
 */
class TCount extends AbstractEntity
{
    /** @var TPage The scraped page that this count belongs to */
    public $TPage;

    /** @var TType The type of element this count refers to */
    public $TType;

    /** @var string The value of the element this count belongs to */
    public $value;

    /** @var int The number of elements of the given type that were found */
    public $count;
}

A little note about Accessors and Mutators: It is common for objects in the object-oriented world to use setters and getters (formally known as accessors and mutators) to protect access to what would normally be considered private properties of objects. There are many reasons why it is good to use them, but there is never one design pattern that suits every situation. In the above code I have chosen to not use them, simply because these objects are intended to be simple, straightforward data structures that have no validation, no error checking, nothing to prevent silly things being done to them, just like in the good ol’ days of programming in C. Structs in C represent the simple idea of collating several items of data together into a single unit so it can be passed from one routine to the next, or stored for use later on.
Adding setters and getters to these objects specifically, in my opinion, would be an example of premature optimisation – trying to cater for a scenario that doesn’t yet exist, and may never exist, and in doing so, you’ve just created superfluous code.
If, later on, you decide that you want to have a method that gives you the result of two of your entity fields joined together, fine. You can create that method within that entity if you like, but you have to ask yourself, is that good design? Are you breaking the design that the rest of the entities have stuck to rigidly? OR you can use a separate class that specifically contains methods for manipulating your entities. Alternatively if you want to add some validation, you could use the `__set` and `__get` magic methods to ensure that the way the entities are being used is as originally intended.

Database Design

And for the above class structure, we must use the following SQL schema, with some referential integrity thrown in for good measure:

CREATE TABLE `TPage` (
    id SERIAL,
    title VARCHAR(255) NOT NULL COMMENT 'The title of the page we scraped',
    url VARCHAR(4096) NOT NULL COMMENT 'The URL we scraped',
    `when` DATETIME NOT NULL COMMENT 'When we scraped the page',
    success TINYINT(1) NOT NULL COMMENT 'Whether the attempt to scrape this page worked'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `TType` (
    id SERIAL,
    name VARCHAR(64) NOT NULL UNIQUE COMMENT 'The type of value saw on the page, eg. Tag, Content, etc.'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `TCount` (
    id SERIAL,
    TPageId BIGINT UNSIGNED NOT NULL COMMENT 'The page we scraped when we saw this value',
    TTypeId BIGINT UNSIGNED NOT NULL COMMENT 'The type of value we saw',
    value VARCHAR(64) NOT NULL COMMENT 'The value we saw on the page',
    `count` BIGINT UNSIGNED NOT NULL COMMENT 'The number of times we saw the value',
    CONSTRAINT `c_TCount__page_type_value`
        UNIQUE (TPageId, TTypeId, value),
    CONSTRAINT `c_TCount__TPageId`
        FOREIGN KEY (`TPageId`)
        REFERENCES `TPage` (`id`)
        ON DELETE CASCADE,
    CONSTRAINT `c_TCount__TTypeId`
        FOREIGN KEY (`TTypeId`)
        REFERENCES `TType` (`id`)
        ON DELETE RESTRICT
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Don’t forget we need to escape with backticks any table or field names that are reserved words in MySQL, like ‘count’ as shown above.

So we have our entities. We’ve made sure that the implementation is simple, but not too simple that it needs massive amounts of work to add small features. It is dynamic to the point that our ORM doesn’t have any understanding of the database, nor the entities themselves.
This is why our naming convention is important.

Database Connectivity

But what about the database connection itself, I hear you ask? We can create a class to handle that too – if our application is to accept multiple databases of generic types, then we should have an interface that defines the types of interactions we need from our database adapter, so we can create the adapter to fit the interface and act as a mediator between a raw database connection object and the system accessing it. This way we can swap out adapters as necessary, and write new adapters for different database types. (Although we may have to do some further abstraction to allow us to use non-SQL databases at a later date.)
We shall have an interface named `DatabaseInterface` and an adapter named `MysqliDbConnection` that will implement this interface, and extend the built-in `Mysqli` object.

Here is our Database Interface, to allow us to abstract away the fact that we are using Mysqli, and instead provide our system with a generic database adapter that it can throw generic SQL queries at.

/**
 * Interface DatabaseInterface - exists to promise specific methods for accessing a generic database, so we can
 * abstract away the type of database connetion we are using at any particular time. We can then change the
 * database adapter at any time, as long as it provides the methods detailed in this interface, everything
 * will continue working.
 *
 * @package StampyCodeSimpleORM
 * @author StampyCode
 */
interface DatabaseInterface
{
    /**
     * Connect to the database using the given parameters
     * @return $this
     */
    public function connect();

    /**
     * Disconnect from the database
     * @return $this
     */
    public function disconnect();

    /**
     * Set the parameters to connect to the database
     * @param mixed[] $params
     * @return $this
     */
    public function setParameters(array $params);

    /**
     * Execute a SQL query on the connected database
     * Also connects to the database if it is not already connected
     * @param string $sql
     * @return bool on success, false on failure
     * @throws Exception if the query fails
     */
    public function query($sql);

    /**
     * Returns the result of the query passed to the query method
     * @return mixed
     */
    public function getResult();

    /**
     * Returns the mysql-safe escaped string wrapped with quotes
     * @param string $str
     * @return string
     */
    public function escapeString($str);

    /**
     * Returns the last ID that was auto-generated, for using in our objects
     * @return int
     */
    public function getLastAutogeneratedId();
}

And here is our implementation of this interface, wrapping a standard `Mysqli` object:


/**
 * Class DatabaseConnection - allows our system to connect to a database using Mysqli, but enabling our system to
 * not care what type of database we are using.
 *
 * @package StampyCodeSimpleORM
 * @author StampyCode
 */
class MysqliDbConnection implements DatabaseInterface
{
    /** @var Mysqli */
    private $mysqli;

    /** @var string */
    private
        $user = null,
        $pass = null,
        $host = null,
        $dbname = null;

    /** @var int */
    private $port = null;

    /**
     *
     */
    public function __construct()
    {
        $this->mysqli = new Mysqli();
    }

    /**
     * @inheritdoc
     */
    public function setParameters(array $params)
    {
        foreach($params as $k => $v) {
            if(property_exists($this, $k)) {
                $this->$k = $v;
            }
        }
        return $this;
    }

    /**
     * @inheritdoc
     */
    public function connect()
    {
        $this->mysqli->connect($this->host, $this->user, $this->pass, $this->dbname, $this->port);
        return $this;
    }

    /**
     * @inheritdoc
     */
    public function disconnect()
    {
        $this->mysqli->close();
    }

    /**
     * @inheritdoc
     */
    public function query($sql)
    {
        if(!$this->mysqli->ping()) {
            $this->connect();
        }
        $ret = $this->mysqli->real_query($sql);
        if(false === $ret) {
            throw $this->dbErrorFactory();
        }
        return $ret;
    }

    /**
     * @inheritdoc
     */
    public function getResult()
    {
        $res = $this->mysqli->use_result();
        if(false === $res) {
            throw $this->dbErrorFactory();
        }
        return $res->fetch_all(MYSQLI_ASSOC);
    }

    /**
     * @inheritdoc
     */
    public function escapeString($str)
    {
        if(is_numeric($str)) {
            return $str;
        }
        if(is_null($str)) {
            return 'NULL';
        }
        $str = $this->mysqli->real_escape_string($str);
        return '"'.$str.'"';
    }

    /**
     * @inheritdoc
     */
    public function getLastAutogeneratedId()
    {
        return $this->mysqli->insert_id;
    }

    /**
     * Creates an exception detailing the last database error code and message
     * @return Exception
     */
    private function dbErrorFactory()
    {
        return new Exception("Database error {$this->mysqli->errno}: {$this->mysqli->error}");
    }
}

So now we have a database connection object that we can pass to our ORM class, but we haven’t yet built our ORM class…. So here it is.

Our ORM Class

/**
 * Class EntityHandler
 *
 * @package StampyCodeCrawler
 */
class EntityHandler
{
    /** @var DatabaseInterface Our connection to the database for persisting and retrieving objects */
    private $dbConn = null;

    /** @var AbstractEntity[][] Storage of the objects the handler is managing */
    private $objCache = [];

    /**
     * @param DatabaseInterface $dbConn
     */
    public function __construct(DatabaseInterface $dbConn)
    {
        $this->dbConn = $dbConn;
    }

    /**
     * Fetch an object from the database with the given ID and type
     *
     * @param string    $class One of TPage, TType, TCount
     * @param int|mixed $id    The ID or parameters to find the object by
     * @return AbstractEntity
     * @throws InvalidArgumentException if the given class name is not an instance of AbstractEntity
     */
    public function get($class, $id)
    {
        if(!class_exists($class)) {
            throw new InvalidArgumentException("Unknown entity '$class'");
        }
        if(!is_subclass_of($class, 'AbstractEntity')) {
            throw new InvalidArgumentException("Class '$class' is not an instance of AbstractEntity");
        }
        if(!is_numeric($id)) {
            $id = $this->getIdByParams($class, $id);
            if(!$id) {
                return null;
            }
        }

        $this->stripNamespaceFromClass($class);

        /* we're using a local object cache so we dont end up with multiple instances of the same
         *  database object floating around our system
         */
        if(isset($this->objCache[$class]) && isset($this->objCache[$class][$id])) {
            return $this->objCache[$class][$id];
        }

        //Prepare and execute the select statement to fetch the object's properties
        $id = (int)$id;
        $sql = "SELECT * FROM $class WHERE id = $id";
        $this->dbConn->query($sql);
        $result = $this->dbConn->getResult();
        if(!$result) {
            return null;
        }

        //create our new object, populate it with the found values
        /** @var AbstractEntity $obj */
        $obj = new $class();
        foreach(current($result) as $col => $row) {
            if(!property_exists($obj, $col)) {
                trigger_error("Property '$col' of class '$class' does not exist", E_USER_WARNING);
                continue;
            }
            $obj->$col = $row;
        }

        $this->objCache[$class][$obj->id] = $obj;
        return $obj;
    }

    /**
     * Save a given entity to the database
     *
     * @param AbstractEntity $obj
     */
    public function set(AbstractEntity $obj)
    {
        /* we're creating a stack here so this function can be called recursively for child objects of the given
         * object - as our objects will want their children to be committed to the database as well.
         */
        static $stack = [];
        if(in_array($obj, $stack)) {
            return;
        }
        array_push($stack, $obj);

        $this->persistFields($obj);

        $this->persistObjects($obj);

        //pop the stack, so we can now allow re-running this method with this object
        $o = array_pop($stack);
        if($o !== $obj) {
            trigger_error("Stack Corrupted", E_USER_ERROR);
        }
    }

    /**
     * Fetch an object from the database matching the given parameters
     *
     * @param string  $class
     * @param mixed[] $params
     * @return int
     */
    private function getIdByParams($class, $params)
    {
        if(!class_exists($class)) {
            throw new InvalidArgumentException("Unknown entity '$class'");
        }
        if(!is_subclass_of($class, 'AbstractEntity')) {
            throw new InvalidArgumentException("Class '$class' is not an instance of AbstractEntity");
        }

        $this->stripNamespaceFromClass($class);

        $params = $this->convertArrayToGetStatement($params);

        $sql = "SELECT id FROM $class WHERE $params LIMIT 1";

        $this->dbConn->query($sql);
        $result = $this->dbConn->getResult();

        if(!$result) {
            return null;
        }
        $result = current($result);
        $id = $result['id'];

        return $id;
    }

    /**
     * Return an associative array that describes the properties of the given object so it can be saved to the
     * database. The list of output fields does not include other objects
     *
     * @param AbstractEntity $obj
     * @return mixed[]
     */
    private function getFieldsAsArray(AbstractEntity $obj)
    {
        $props = [];
        $reflect = new ReflectionClass($obj);
        foreach($reflect->getProperties() as $prop) {
            $name = $prop->getName();
            if($name[0] === '_' || $name === 'id') {
                continue;
            }
            $value = $obj->$name;
            if($value instanceof AbstractEntity) {
                if($value->id) {
                    $name = $name . 'Id';
                } else {
                    $this->set($value);
                }
                $value = $value->id;
            }
            if(is_array($value)) {
                continue;
            }
            if($value instanceof DateTime) {
                $value = $value->format('Y-m-d H-i-s');
            }
            $props[$name] = $value;
        }
        return $props;
    }

    /**
     * Returns an array of objects that are referenced directly by the given entity, so for example, the TPage
     * class stores an array of TCount objects - this method will add any instances of AbstractEntity within
     * that list into an array to be returned.
     *
     * @param AbstractEntity $obj
     * @return array
     */
    private function getObjectsFromEntity(AbstractEntity $obj)
    {
        $props = [];
        $reflect = new ReflectionClass($obj);
        foreach($reflect->getProperties() as $prop) {
            $name = $prop->getName();
            if($obj->$name instanceof AbstractEntity) {
                $props[] = $obj->$name;
            } elseif(is_array($obj->$name)) {
                foreach($obj->$name as $elem) {
                    if($elem instanceof AbstractEntity) {
                        $props[] = $elem;
                    }
                }
            }
        }
        return $props;
    }

    /**
     * Persists the properties in the given object to the database, does not interact with any instances of
     * AbstractEntity that are referred to by properties within the given entity
     *
     * @param AbstractEntity $obj
     */
    private function persistFields(AbstractEntity $obj)
    {
        //setup the variables to be sent to the database
        $class = get_class($obj);
        $this->stripNamespaceFromClass($class);
        $props = $this->getFieldsAsArray($obj);

        if(isset($obj->id)) {
            //UPDATE an existing entity
            $update = $this->convertArrayToSetStatement($props);
            if(!$update) {
                return;
            }
            $sql = "UPDATE $class SET $update WHERE id = {$obj->id}";
        } else {
            //CREATE a new entity
            if(!count($props)) {
                return;
            }
            $rows = '`' . implode('`, `', array_keys($props)) . '`';
            $props = array_map([$this->dbConn, 'escapeString'], $props);
            $values = implode(', ', $props);
            $sql = "INSERT INTO $class ($rows) VALUES ($values)";
        }

        //execute the query
        $this->dbConn->query($sql);

        //if CREATE statement called, set the new ID created for our object
        if(!isset($obj->id)) {
            $obj->id = $id = $this->dbConn->getLastAutogeneratedId();
        }
    }

    /**
     * Persists objects referred to by the given class into the database
     *
     * does not persist normal field values
     *
     * @param AbstractEntity $obj
     */
    private function persistObjects(AbstractEntity $obj)
    {
        $objects = $this->getObjectsFromEntity($obj);
        foreach($objects as $obj) {
            $this->set($obj);
        }
    }

    /**
     * Strip the namespace from the front of the class name
     *
     * @param string $class
     */
    private function stripNamespaceFromClass(&$class)
    {
        $ex = explode('\', $class);
        $class = array_pop($ex);
    }

    /**
     * Convert given assoc array to a list of SQL update field parameters, with values escaped, ready to be
     * used in a SQL SET statement
     *
     * @param mixed[] $arr
     * @return string
     */
    private function convertArrayToSetStatement(array $arr)
    {
        $out = [];
        foreach($arr as $k => $val) {
            $out[] = $k . ' = ' . $this->dbConn->escapeString($val);
        }
        return implode(', ', $out);
    }

    /**
     * Convert given associative array to a list of SQL query parameters, with values escaped, ready to be
     * used in a SQL GET statement
     *
     * @param mixed[] $arr
     * @return string
     * @throws InvalidArgumentException if passed an array that has numeric keys
     */
    private function convertArrayToGetStatement(array $arr)
    {
        $out = [];
        foreach($arr as $k => $val) {
            if(is_numeric($k)) {
                throw new InvalidArgumentException("array using numeric indexes instead of assoc array");
            }
            $out[] = $k . ' = ' . $this->dbConn->escapeString($val);
        }
        return implode(' AND ', $out);
    }
}

Essentially, there’s only three parts you need to understand from the outside: the constructor, which accepts our wrapped database adapter, the `get` method, which accepts an entity class name and an ID, (or array of parameters to search for), and the `set` method – which accepts an entity that will be saved to the database.

The `get` method must allow for finding objects in two different ways – either you know the ID of the object you want, or you know a field property value that it has that you wish to search for. So if you pass in an associative array into the `id` field like `[‘name’=>’hello’]` then our ORM class will go to the database and search for an entity (of the type provided) that matches the field parameter described. You can specify multiple parameters in the passed array to narrow the search if desired, and the ORM class will return the first object that matches the searched pattern.
Currently we don’t have need for a ‘find all matches’ functionality in our ORM, so we haven’t implemented one.
The first parameter of the `get` method accepts the string class name of the entity type to be searched for, in this instance, it can one of ‘TPage’, ‘TType’ and ‘TCount’ – any string passed other than one of these will cause an exception to be thrown.
Whenever an object is fetched from the database, it will be added to the local class cache, which maintains a list of all the objects it is currently managing. The reason for this is simple – if we try to fetch the same row twice, we don’t want to have two instances of the same “object” floating around in the same running instance of our application. Aside from being poor management memory-wise, it can also lead to write conflicts and much confusion.

The `set` method allows for two different object statuses – ones that do not exist in the database already, and ones that do. It’s dead easy to figure out if an object is already in there – as it will have its ID property set. Of course there is nothing stopping us manually setting the ID and trying to update a row that doesnt exist, or removing the ID from one that does exist and getting us throwing database constraint exceptions…. But we could be here all night trying to prevent programmer errors, so let’s just keep things simple for now.
If the id is set, then we use the UPDATE method, if it is not set, then we use the INSERT method, and set the ID.
Like with the `get` method, whenever we write a new object to the database, we add it to our local class cache, just in case we refer to it again later, perhaps from a different context within our application.

Trying it out

So there it is, we’ve put together a simple ORM system that we can adapt to our needs, which is lightweight and needs no up-front configuration, just common sense design of our database and class setup enables us to quickly get our stuff saved to our database.
Of course there are limitations in some of the approaches we’ve taken here, but the simplicity of our design is the focus at this point.

Here’s some code to test the above setup.


header('content-type: text/plain');
ini_set('show_errors', 1);
ini_set('max_execution_time', 0);
ini_set('memory_limit', '1G');
error_reporting(E_ALL);
date_default_timezone_set('UTC');

try {    
    $dbConn = new MysqliDbConnection();
    $dbConn->setParameters(
        [
            'user' => 'scraperdbuser',
            'pass' => '',
            'host' => 'localhost',
            'dbname' => 'scraperdb',
            'port' => null
        ]
    );
    $dbConn->connect();
    $entityHandler = new EntityHandler($dbConn);

    $testPage = new TPage();
    $testPage->title = 'Foo';
    $testPage->url = 'http://example.com';
    $testPage->when = new DateTime();
    $testPage->success = true;

    $tagType = $entityHandler->get('TType', ['name'=>'Tag']);
    if(!$tagType) {
        $tagType = new TType();
        $tagType->name = 'Tag';
    }

    $wordType = $entityHandler->get('TType', ['name'=>'Word']);
    if(!$wordType) {
        $wordType = new TType();
        $wordType->name = 'Word';
    }

    $tag1 = new TCount();
    $tag1->count = 2;
    $tag1->TPage = $testPage;
    $tag1->TType = $tagType;
    $tag1->value = 'button';

    $tag2 = new TCount();
    $tag2->count = 4;
    $tag2->TPage = $testPage;
    $tag2->TType = $tagType;
    $tag2->value = 'form';

    $tag3 = new TCount();
    $tag3->count = 6;
    $tag3->TPage = $testPage;
    $tag3->TType = $wordType;
    $tag3->value = 'hello';

    $tag4 = new TCount();
    $tag4->count = 8;
    $tag4->TPage = $testPage;
    $tag4->TType = $wordType;
    $tag4->value = 'world';

    $testPage->TCountList[] = $tag1;
    $testPage->TCountList[] = $tag2;
    $testPage->TCountList[] = $tag3;
    $testPage->TCountList[] = $tag4;

    $entityHandler->set($testPage);

    print_r($testPage);

} catch (Exception $e) {
    echo $e->getMessage()."nn".$e->getTraceAsString();
}

Enjoy 🙂

stampycode

The Security Architect

scraper

Challenge (part two): Web Scraper

Database Design

System Design

Verification

Robots.txt

XML Parser

ORM Entities

The Code

Saving the Results

Alternatives

Challenge (part one): Simple ORM

Entity Class Design

Database Design

Database Connectivity

Our ORM Class

Trying it out