When to not require

Supply chain attacks are all the rage.

  1. Compromise a popular upstream repository
  2. New version of the repo is published containing your malware
  3. Downstream users update their code & dependencies
  4. Compromise spreads through normal auto-update routines
  5. Profit?

There’s a bunch of ways to reduce the risk of this vulnerability but there’s only one way to completely remove it.

One simple solution is that you don’t use any unnecessary third party packages, and you do what you can with what you already have, usually with a little extra effort (or code).

This is one of the foundational principles I’ve observed throughout my software engineering career; when you only need a small piece of functionality from a large library, don’t install the library, just write the functionality yourself. You learn more, you reduce your risk exposure, and reduce your reliance on third parties to do things properly.

This process might be a bit harder in the short run, because you’re having to actually write code to do something, instead of installing a library that someone else already wrote, but you really should see this as a golden learning opportunity.

You’re not re-inventing the wheel; you’re learning how to make a wheel that perfectly suits your needs, and you then own that wheel, it’s yours, forever.

Here’s an example. Recently I was trying to implement in Python the ability to call docker compose up -d in my code, so that the code itself can make changes to a docker-compose.yml file and implement the changes procedurally.

I’ve been using the docker-py library for some time and find it really useful – the main advantage of this is that the docker-py library only needs access to the docker socket, so I don’t need any of the docker binaries locally accessible to my code, which is very handy because my code runs inside a Python-only docker container – all I have to do is mount the docker socket into the container, and my code has all the access it needs.

Here’s some code which is the equivalent of calling ‘docker image prune’, which deletes all unused and un-linked container images:

from typing import Dict
import docker

client = docker.from_env()
output: Dict[str, int] = client.images.prune(filters={'dangling': True})

So the functionality I wanted to implement is the equivalent of docker compose up -d, but since that functionality is part of ‘docker compose’ and not native to docker itself, I can’t easily use the docker-py library, without re-writing loads of code just to do what docker-compose itself already does with docker.

There’s an easy option – I just mount or install the docker and docker-compose binaries from the underlying host system into the docker container. I’ve done this before and it works well, and is a fairly common practice for controlling docker from within a container running inside docker, but it has security issues, and I would rather just use the docker socket, because it ‘smells a lot better‘ for a machine to be using a socket intended for machine use, rather than calling a binary executable via shell, which is intended for human use.

(I’m acutely aware there are security issues around having a docker container managing the host docker eco-system, but that topic is outside the scope of this discussion.)

The alternative package suggested by Gemini is python-on-whales – which allows programmatic access to docker compose commands. It has 700+ stars in GitHub so seems like a moderately popular library.

First of all, the containers I write are often deployed into resource-constrained environments, so I have to be mindful about how much space, bandwidth, CPU and memory these containers consume, so I always pay attention to those factors. When I included this library into my code, the docker layer grew by around 12MB. That’s a pretty big uplift compared to typical packages, there are a very limited number of reasons for this:

  1. The python-on-whales has a huge amount of functionality, 99% of which I will not be using, and I can’t get rid of without forking the library and doing nasty things to eliminate most of the positive reasons to use a third party library in the first place
  2. The library is bloated, and includes a bunch of third party libraries which aren’t already in my system, because I build very lean and compact apps.
  3. Or a mixture of both, which is most likely.

12MB wouldn’t be a problem for most people, but when you consider that all I’m trying to do is run docker compose up -d in my host docker environment, and I can achieve that with 2 lines of code in Python and a couple of extra mount points into my docker-compose.yml file, the 12MB starts to feel a bit unjustified.

It was only after I discovered the 12MB layer addition that I also found out that the python-on-whales library doesn’t work unless it has access to the docker and docker-compose binary executables, so it’s not even reducing complexity on that side either. It doesn’t even use the docker socket!

Not only do I have an additional 12MB in my stack, it doesn’t even work without me making further changes to my container, or the compose mounting config. In the end, python-on-whales seems to be little more than a binary wrapper, calling instructions the same way an end user would, and parsing the output into something object-oriented.

Now back to security.

By including the python-on-whales library into my code (which is at version v0.81.0 at time of writing) we have to spend some of our time doing due-diligence into that package, to check that they are following good practices with regards to supply-chain security, and that it’s being kept up to date.

This is a process that 99% of developers have no interest in, or time to spend on. We already have huge amounts of stuff we “should” be doing in this area, we should be aiming to minimise it, not contribute to the pile and continue to ignore it, or chuck the quality/security assurance problem over the fence to the next chump.

If I have the choice of

a) investigate the third party library for vulnerabilities, research their CICD strategies and contributors, bloat my deployment by 12MB, and repeat that process at regular intervals to comply with patch-release strategies,
or
b) write a few extra lines of code, and be a bit more mindful about my dependencies.

I know which one I’ll go for, every single time.

Leave a comment