linux tips: screen

When you’re working in a remote terminal environment, being able to resume a session can be an invaluable tool – especially when the connection isn’t stable.

When connecting to a UNIX-based environment (like the many varieties of Linux, or OSX) there is a handy utility called `screen` that effectively allows you to run tabbed terminal consoles within a single terminal console. This has many benefits, not just the ability to resume the connection if the connection drops.

adding a status bar

The first thing to do when starting any new screen session, is to add a config file. This file changes the default look and feel of screen, which isn’t very intuitive if you’re just learning how to use it.
I’m going to use `vim` to write my config file, and I’m going to use a configuration that I pilfered from somewhere on the internet a while ago:

# ~/.screenrc
termcapinfo xterm* ti@:te@
startup_message off
vbell off
autodetach on
altscreen on
shelltitle "$ |bash"
defscrollback 100000
defutf8 on
nonblock on
msgwait 0
hardstatus alwayslastline "%{b kw}%H %{r}%1` %{w}| %{y}%Y-%m-%d %c %{w}| %{g}%l %{w}| %{-b kw}%u %-Lw%{= rW}%50> %n%f %t %{-}%+Lw%
# (This fixes the "Aborted because of window size change" konsole symptoms found
#  in bug #134198)
termcapinfo xterm* 'is=E[rE[mE[2JE[HE[?7hE[?1;4;6l'
# (you may have to change the 'xterm' value to match your $TERM value)

The most useful lines in this config file are the last two – they add the status bar to the screen window, which contains the list of open terminal tabs – which is verrry useful and I don’t know why this isn’t the default setup.
In the configuration shown above, the status bar contains the hostname on the left, the system load on the near right, and the server date and time on the far right, and your list of open terminal tabs in the middle.
Save this file into your home directory (usually `/home/yourusername/.screenrc`) and start start screen by running the command `screen`, and your window should look similar to this:

screen01

tab navigation

Press `ctrl-a c` to open a new tab
Press `ctrl-a shift-a`, change the name, and press `return` to save.
Press `ctrl-a-a` to switch between your most recent 2 tabs, or use `ctrl-a [num]` to switch to the tab numbered `[num]`, i.e. `ctrl-a 1` to switch to tab 1, `ctrl-a 2` for tab 2, etc.

detach and re-attach

I’ve pointed out that I believe the most useful feature of screen is the ability to detach and re-attach to screen sessions in the event of being disconnected from the server – so how do you go about actually doing that?
When you have connected to the server, to create a new screen session you type `screen` – but to re-attach to an existing disconnected screen session, type `screen -R`.
Sometimes, if you have disconnected very recently, the old screen session might still be attached to your old session! In order to tell screen you want to resume an existing screen session, and forecfully disconnect it from any connected session, use `screen -dR`. This does mean that if you have superuser privileges on the system to which you are connecting, anyone else who can assume control of your account can also take control of your screen session!

To detach your current screen session, press `ctrl-a d`.

locking your screen session

While in a screen session, press `ctrl-a x` to lock your session. This protects your open terminal sessions from being taken over by someone who might have access to your account. This won’t protect you from much, but it does add an extra layer of security that can help to delay or prevent security breaches.

This becomes a more useful feature when you realise that as a superuser, you can have multiple terminals open in screen, each one connected to a different server, each one potentially logged in as a more privileged user than the original screen session itself – so if a hacker manages to acquire the user’s username and password, they would be able to log in and resume all of these already logged in sessions with little more than a single command.

… so Lock Your Terminal!

scrolling history

When in cursor mode, you can search for patterns, and highlight and copy text too.

  • Press `ctrl-a Esc` to enter interactive cursor mode
  • Use the cursor keys (`up`, `down`, `left`, `right`, `PgUp`, `PgDn`) to navigate back in the history of the current screen terminal
  • Press `Esc` at any time to exit cursor mode and return to normal interactive mode
  • Press `Return` to start highlighting text at the position of the cursor
  • Use the cursor keys to select desired text
  • Press `Return` again to copy the selected text into the screen paste buffer – This will also exit cursor mode and return to interactive mode
  • To paste the text you’ve just copied, press `ctrl-a ]` when you’re in a suitable location. You can use this technique to copy and paste chunks of text or commands between console windows in the same screen session.

When in cursor mode, you can also search forwards and backwards using `/` and `?` respectively, just like in `vim` – to search “up” the screen from the cursor location, enter `?`, type your search string, and press `return`.
To find the next or the previous piece of text that matches your entered search, press `/` or `?` again and just press `return`.

further help

To access the help menu in screen, press `ctrl-a ?` and you will be presented with a list of further commands you can try out, which aren’t described quite as concisely as they are on this page, but they are a useful cheat-sheet, once you learn how to read the syntax.

screen03

bonus content

You might have noticed that my terminal prompt has also been customised – the code for this is added to the end of the file found in `/home/[yourusername]/.bashrc` – add the following code to the end of your `.bashrc` file:

# ~/.bashrc
PS1="$(if [[ ${EUID} == 0 ]]; then echo '[33[01;31m]h'; else echo '[33[01;32m]u@h'; fi)[33[01;34m] w $([[ $? != 0 ]] && echo "[33[01;31m]:([33[01;34m] ")\$[33[00m] "

why passwords are secret

    Passwords are obviously used everywhere these days, but why is it so important that nobody else knows your passwords?

    The simple reason is obvious – you don’t want other people to be able to access your stuff when you don’t want them to.

    The more complex reason, is the legal one. Businesses, Websites, Internet Service Providers, Internet Cafés, nearly everyone infact, has a document or set of documents called an ‘Acceptable Use Policy’. In this document, they specify (or at least they should specify,) that any password or key they provide you with is not to be shared with anyone else, under any circumstances.

    The reason for this is not just to protect you and your stuff – this is to protect whoever is providing you with access to it. Because if they didn’t explicitly state that sharing these credentials with someone else is against the rules, then if your account becomes linked with some kind of activity that IS against the rules and they want to come after you, they might then have no way of proving you did it – because anyone you have given your password to could have access to it.

    Let me put this another way.

    1. Alice is given a password to access example.com
    2. Alice gives her password to Bob, to upload some files.
    3. Bob uploads some illegally downloaded MP3s to example.com using Alice’s password
    4. example.com finds the illegal music collection, and wants to prosecute Alice
    5. Alice then tells example.com that she didn’t put the files there
    6. Bob never agreed to the usage agreement on example.com, because he just logged in using Alice’s password
    7. example.com is then stuck because the usage agreement hasn’t been broken – they forgot to put in a clause about sharing passwords

    Here’s another example.

    1. Charlie owns example.com
    2. Derek works for example.com
    3. Charlie creates an email account for Derek, and hands him his password
    4. Derek is not allowed to change his password, because Charlie wants to keep a copy of it, for “security reasons”
    5. Derek sends an email to Edwin using his example.com email account
    6. Edwin takes offence at Derek’s email, and decides to sue example.com
    7. Charlie tries to fire Derek for sending the offensive email to Edwin
    8. Charlie cannot prove that the email was sent by Derek – as both Charlie and Derek have access to Derek’s password
    9. Charlie therefore has no reasonable grounds to fire or take other action against Derek due to the email.
    10. If Charlie fires Derek anyway, Derek may try to sue Charlie for unfair dismissal.

    If you work for a company who keeps a copy of your password written down somewhere, or stores your password unencrypted in a database, or stores it in any way that could enable anyone to read it – then it’s a reasonable assumption that someone else could have your password, without you having given it to them.

    Passwords Are Secrets – Nobody should ever know any of your passwords. They are secret and should not be shared. No exceptions.

    Providing access to another individual, either deliberately or through failure to secure its access, is prohibited.

    SANS.org AUP

    User ID’s and passwords are not to be shared. Those who use another person’s user credentials and those who share such credentials with others will be in breach of this policy.
    Initial default passwords issued to any user must be changed immediately following notification of account set up.

    University of Bath AUP

    Each user is issued with a valid username and password that must be kept confidential and must not be shared with anyone else.

    University of Salford AUP

    You are responsible for properly using any user IDs, personal identification numbers (PINs) and passwords needed for the service, if any, and must take all necessary steps to make sure that you keep these confidential and secure, use them properly and do not make these available to unauthorised people.

    BT Terms and Conditions

    To protect your Google Account, keep your password confidential. You are responsible for the activity that happens on or through your Google Account.

    Google Terms of Service

    I did come across several policies that do not specifically mention passwords or access credentials – not all of them need to, as they can protected themselves with other related clauses, but adding a password clause like those above to any policy is such a simple addition that adds a lot of protection with very little effort.


    Further Reading

    1. Wikipedia: Acceptable Use Policy
    2. Gov.uk: Dismissal: your rights
    3. Get Safe Online: Sample Acceptable Usage Policy
    4. Common Sense Education: Essentials – Acceptable Use Policies

Challenge (part two): Web Scraper

The Task:

Write a php application that accepts a URL. Download the page the URL references. The page contents should then be broken into two parts. The first part determines all the different kinds of HTML tags on the page and the frequency counts for each. The second part determines all the different words that aren’t part of the HTML on the page and the frequency counts for each. The results from the two parts should be stored in a database.

There are a number of reasons why someone would want to do this. Part of this challenge is to create re-usable code, but the main aim is to use best practice code and style to achieve the task, in an efficient, understandable and coherent manner.

In part one of this challenge, we built our simple ORM classes for storing well structured entities into a database, and here we are going to build on that, to store the results of our page scraping into a set of tables.

Database Design

Firstly let’s do some database design. We need to store URLs, counts of tags on the page, and counts of words on the page that aren’t part of the HTML markup.

Let’s assume that we are writing this for a small application, and will only be scraping upto a few hundred sites. This allows us to make some assumptions about database capacity and performance considerations, like column widths, choice of database type, column sizes, etc. We will also begin with the assumption that this scraper will only scrape basic HTML pages – any largely dynamic pages (through Javascript or Flash) will not be processed very well, as they tend to offer less fixed HTML up front, with the focus on the browser enriching the page by making subsequent page requests and modifying the page after the initial load.

USE scraperdb;

CREATE TABLE `TPage` (
    id SERIAL,
    title VARCHAR(255) NOT NULL COMMENT 'The title of the page we scraped',
    url VARCHAR(4096) NOT NULL COMMENT 'The URL we scraped',
    `when` DATETIME NOT NULL COMMENT 'When we scraped the page',
    success TINYINT(1) NOT NULL COMMENT 'Whether the attempt to scrape this page worked'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `TType` (
    id SERIAL,
    name VARCHAR(64) NOT NULL UNIQUE COMMENT 'The type of value saw on the page, eg. Tag, Content, etc.'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `TCount` (
    id SERIAL,
    TPageId BIGINT UNSIGNED NOT NULL COMMENT 'The page we scraped when we saw this value',
    TTypeId BIGINT UNSIGNED NOT NULL COMMENT 'The type of value we saw',
    value VARCHAR(64) NOT NULL COMMENT 'The value we saw on the page',
    `count` BIGINT UNSIGNED NOT NULL COMMENT 'The number of times we saw the value',
    CONSTRAINT `c_TCount__page_type_value`
        UNIQUE (TPageId, TTypeId, value),
    CONSTRAINT `c_TCount__TPageId`
        FOREIGN KEY (`TPageId`)
        REFERENCES `TPage` (`id`)
        ON DELETE CASCADE,
    CONSTRAINT `c_TCount__TTypeId`
        FOREIGN KEY (`TTypeId`)
        REFERENCES `TType` (`id`)
        ON DELETE RESTRICT
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SQL;

So we’ve created a database that will allow us to store URLs, Tags and Content values, and the number of times we have seen each. We’ve put in some referential integrity constraints to prevent us from doing silly things by accident, such as trying to enter two different counts for one Tag/Scrape instance.
We could go one step further and create a separate table for the values we scrape, thus achieving third-normal form
within our data structure – but at this point it would be overkill and premature optimisation of our system.
Also, I have not added any extra column indexes to the tables, as we don’t yet have an idea of how the data will be used – we can add these once we have an established working prototype and want to optimise how we are using the results.

System Design

So lets now design our system. Here is some pseudo-code to establish what we’re going to do.

  1. enter URL to connect to
  2. verify that we want to allow the given URL to be connected to
  3. connect to the URL to check if robots are allowed, abort if not
  4. connect to the URL and download the content in full
  5. run an XML parser to analyse and pull apart our downloaded HTML
  6. save the scraper page details to the database
  7. analyse the tags, save the counts to the database
  8. analyse the tag content, save the counts to the database

Verification

The URL that has been passed into our system may not be be in a form that we wish to accept – we may wish to prevent users from using IP addresses, or using a URL with embedded username and password to connect to.

Robots.txt

You’ll have noticed that I’ve included a check for ‘robots’ – This is an internet standard that has been around for many years – you can read up more about it on the robotstxt.org site. I’ve chosen to not scrape sites that have objected to being automatically scraped, using this method. You will see the code below contains checks for this.

XML Parser

We’re going to use the built-in DomDocument parser to parse our HTML. This library seems to be the most appropriate library to use for parsing HTML, as there is a lot of bad HTML in the wild, and this library is fairly fault-tolerant, and easy to use. HTML is not always XML compliant, some XML parsers will fail to parse HTML because of trivial shortcuts that programmers make when writing HTML, like not closing tags properly, or embedding attributes within tags that don’t have an argument, eg. `<script src=”…” async defer>`. This behaviour is not XML compliant.

ORM Entities

Our system has 3 entity classes, `TPage`, `TType` and `TCount`. These classes will extend the abstract class `AbstractEntity` and will be read from and written to the database using a class named `EntityHandler` – this will take care of the heavy lifting and database interactions.
Lets see what they look like…

/**
 * All entities in the system that are storable in the database must extend the AbstractEntity class.
 * @package StampyCodeScraper
 */
abstract class AbstractEntity
{
    /** @var int The ID of the object instance, generated by the DB */
    public $id;
}
class TPage extends AbstractEntity
{
    /** @var string The page Title for the URL scraped */
    public $title;

    /** @var string The URL scraped */
    public $url;

    /** @var DateTime The date/time that the scrape was performed, or attempted */
    public $when;

    /** @var bool Whether the page scrape was successful */
    public $success;

    /** @var TCount[] Collection of  */
    public $tCounts;
}
class TType extends AbstractEntity
{
    /** @var string The name of the Type */
    public $name;
}
class TCount extends AbstractEntity
{
    /** @var TPage The scraped page that this count belongs to */
    public $TPage;

    /** @var TType The type of element this count refers to */
    public $TType;

    /** @var string The value of the element this count belongs to */
    public $value;

    /** @var int The number of elements of the given type that were found */
    public $count;
}

The Code

Here’s the class definition for our scraper. It contains all the features we’ve described above, commented and ready to be used.

/**
 * Class Scraper
 *
 * @package Scraper
 */
class Scraper
{
    /** @var string */
    private $url;

    /** @var string */
    private $rawContent;

    /** @var int[][] */
    private $results = [];

    /**
     * @param string $url
     */
    public function __construct($url)
    {
        $this->url = $url;
    }

    /**
     */
    public function scrape()
    {
        $this->checkUrlSanity($this->url);
        $this->checkRobotPermission($this->url);
        $this->rawContent = $this->getHttpContent($this->url);
        $this->getTagSummary($this->rawContent);
    }

    /**
     * @return int[][]
     */
    public function getResults()
    {
        return $this->results;
    }

    /**
     * @param string $url
     * @throws Exception
     */
    private function checkUrlSanity($url)
    {
        $urlParts = parse_url($url);
        if($urlParts['scheme'] !== 'http') {
            throw new Exception("Scraper only accepts HTTP URLs");
        }
        if($urlParts['host'] === 'localhost') {
            throw new Exception("Scraper will not scrape the local machine");
        }
        if(filter_var($urlParts['host'], FILTER_VALIDATE_IP)) {
            throw new Exception("URLs must be host-based, not IP based");
        }
        if(!strpos($urlParts['host'], '.')) {
            // try to prevent the hostname from being a locally resolvable one
            throw new Exception("Host names must contain at least a TLD and a gTLD");
        }
        if(isset($urlParts['pass'])) {
            throw new Exception("Scraper will not accept URLs with username/password parameters");
        }
    }

    /**
     * @see http://www.robotstxt.org/
     * @param string $url
     * @throws Exception if the given URL is not scrapable by robots.
     */
    private function checkRobotPermission($url)
    {
        $urlParts = parse_url($url);
        $url = $urlParts['scheme'] . '://' . $urlParts['host'] . ':' . $urlParts['port'] . '/robots.txt';

        $robotsTxt = $this->getHttpContent($url);
        if(!$robotsTxt) {
            //robots.txt file not found, so we can proceed :)
            return;
        }
        if($robotsTxt[0] === '= $valueLength) {
                continue;
            }
            $ruleLength = $valueLength;
            $allowed = ($field === 'allow');
            $lastMatchingRule = $value;
        }
        if(!$allowed) {
            throw new Exception("Robots are not allowed to access path '$lastMatchingRule'");
        }
    }

    /**
     * Retrieve the web content of the URL provided
     *
     * @param string $url
     * @return string
     */
    private function getHttpContent($url)
    {
        $context = stream_context_create(['http' => ['header'=>"Connection: closern"]]);
        return file_get_contents($url, false, $context);
    }

    /**
     * Summarises the given HTML content and stores a count of all seen tag names and non-HTML words used
     *
     * @param string $content
     * @throws Exception if the parser fails
     */
    private function getTagSummary($content)
    {
        $tags = [];
        $words = [];
        $excludedTags = [];
        $excludedWords = [null,''];

        $charHandler = function($data) use (&$words) {
            $data = preg_replace('|[^a-zA-Z0-9_-]|', ' ', $data);
            $data = explode(' ', $data);
            $words = array_merge($words, $data);
        };
        //todo: exclude text content within Script and Style tags

        $doc = new DOMDocument();
        $doc->loadHTML($content);
        $this->processDomNodeList($doc->documentElement->childNodes, $tags);

        $charHandler($doc->textContent);

        $tags = array_diff($tags, $excludedTags);
        $words = array_diff($words, $excludedWords);
        $tags = array_count_values($tags);
        $words = array_count_values($words);
        arsort($tags);
        arsort($words);
        $this->results = [
            'Tag' => $tags,
            'Word' => $words
        ];
    }

    /**
     * Iterates over the given DomNodeList object, identifies the tag name and stores it to the given array
     *
     * @param DomNodeList $list
     * @param string[] $tagList
     */
    private function processDomNodeList(DomNodeList $list, &$tagList)
    {
        for($i=0; $ilength; ++$i) {
            $item = $list->item($i);
            $tagList[] = $item->nodeName;
            if($item->childNodes instanceof DomNodeList) {
                $this->processDomNodeList($item->childNodes, $tagList);
            }
        }
    }
}

Saving the Results

This is a basic working prototype for a web-page scraper. It accepts a URL, processes the given page, and counts the number of Tag types and Words used in the page. In order to store these results, we create instances of our `AbstractEntity` classes, and save them to the database using our `EntityHandler` class.

try {
    //establish our database connection, to pass into the EntityHandler class
    $dbConn = new MysqliDbConnection();
    $dbConn->setParameters(
        [
            'user' => 'scraperdbuser',
            'pass' => '',
            'host' => 'localhost',
            'dbname' => 'scraperdb',
            'port' => null
        ]
    );
    $dbConn->connect();
    $entityHandler = new EntityHandler($dbConn);

    // Create a TPage object to associate and collate our results with
    $testPage = new TPage();
    $testPage->title = 'Foo';
    $testPage->url = 'http://stampy.me';
    $testPage->when = new DateTime();
    $testPage->success = true;

    $scraper = new Scraper($testPage->url);

    try {
        $scraper->scrape();
    } catch (Exception $e) {
        echo "Scrape Failed - ".$e->getMessage() .' - '. $e->getTraceAsString();
        $testPage->success = false;
    }

    foreach($scraper->getResults() as $tagType => $result) {
        //get the tag type class, if it exists
        $typeObj = $entityHandler->get('TType', ['name' => $tagType]);
        if(!$typeObj) {
            // otherwise, create it!
            $typeObj = new TType();
            $typeObj->name = $tagType;
        }
        foreach($result as $value => $count) {
            //create a new TCount object for each count result
            $tag = new TCount();
            $tag->count = $count;
            $tag->TPage = $testPage;
            $tag->TType = $typeObj;
            $tag->value = $value;
            $testPage->TCountList[] = $tag;
        }
    }

    $entityHandler->set($testPage);

    print_r($scraper->getResults());

} catch(Exception $e) {
    echo $e->getMessage() . "nn" . $e->getTraceAsString();
}

Alternatives

There’s a world of options out there for page scraping – it is after all, how search engines operate. They connect to a website, pull useful information from it, which includes links to other pages or websites, and then connect to those as well. This tutorial was written in a couple of days by a single developer – and the simplicity of the classes reflect that.

Enjoy 🙂