Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A deep dive into the official Docker image for Python (pythonspeed.com)
189 points by itamarst on Aug 20, 2020 | hide | past | favorite | 61 comments


> Debian names all their releases after characters from Toy Story.

I can’t believe I’m just now learning this, but that’s good to know next time someone asks why the names seem random.


And sid (unstable) is the neighbor who breaks the toys, which is why the name never changes :)


Story I heard is that it started with “sid” being an acronym for “still in development”, but once that caught, it blazed the trail for the rest of the Toy Story characters....


I heard it was because when Bruce Perens was Debian Project Leader, he was working for Pixar around the time of the Toy Story release, and chose the naming scheme. The descriptive name for unstable has always been unstable, not "still in development".


Looking back, Sid was an artist being smothered by mundane suburbia, while Andy was a milquetoast who couldn't look after his stuff.




It's cute but really frustrating. I wish they'd just use the version number or force everyone to use

    "<version #>-<stupidname>".
but I'm a FreeBSD user so what do I know?


I wish their /etc/os-release said:

  BUILD_ID=rolling
but I'm an Arch user so what do I know? :)


To be fair, you can just point to the `stable` repositories in the `/etc/apt/sources.list` configuration file and simply just refer to your install as "Debian Stable" ('s/stable/testing/g').

However every few years there might be a not so "stable" update around the new release.


It's not for me, it's for when I get bug reports and I need to decipher what flavour of linux, what release of the kernel, what libc etc. "I'm running decrepit duck" doesn't really narrow it down too much. Especially when multiple distributions use these jovial monikers.


I think Ubuntu strikes a good balance— goofy, lovable names, but a clear alphabetical sequence. Robot operating system (ROS) inherited this approach with its naming as well: https://wiki.ros.org/Distributions#List_of_Distributions


Why do we need the names at all though?

Every now and again I have to translate between a code name and a version string, or between a version string and a code name, and I think 'why on earth am I having to do this lookup work as a human?'


I feel like there are scenarios where it's convenient to have a plain text string that is certain not to gum up parts of the system that would choke on punctuation or a leading digit (see for example this package-mapping yaml: https://github.com/ros/rosdistro/blob/master/rosdep/base.yam...). Of course, Fedora 19 is codenamed `schrödinger` including the unicode character, so that's been an interesting experiment in ensuring that all the systems which ingest and process the codenames can handle that.

Anyway, the obvious thing is that a codename is more fun and memorable than a number; it's something to hang marketing and such off of. Presumably this is why MacOS versions have always had public codenames since 2002. But I think the practical reasons are valid as well.


I can understand that frustration if you are used to the BSD's.

Windows 10 and macOS keep it pretty simple too.


Ubuntu is the most simplest.., 20.04 means it was released in april 2020. Windows 2004 means it was finalized in april 2020, but released some times later (may? june? who knows).., plus they both try to release every 6 months and in case of ubuntu you know that even aprils are LTS, so it's always predictable to know what's the latest version.

mac 10.15 means what? and the next release is gonna be 11.0. someone who doesn't follow or use macs has no idea to figure out if the latest version is 10.15 or 10.14, or 11.0...


Also for some reason many apps refer to the Mac OS 11 beta as "Mac OS 10.16" which furthers the confusion.


That “some reason” is compatibility. macOS has been on 10.x for 20 years now, and there’s a LOT of software that breaks if the reported version number is 11.0. IIRC early betas inconsistency reported either 10.16 or 11.0 depending on where you got the version number from, but now the version number exposed to an app depends on the SDK version the app was compiled against: old apps see 10.16 and new apps see 11.0. See https://eclecticlight.co/2020/07/21/big-sur-is-both-10-16-an...


I think even the Deb + Ian naming is still a fairly unknown piece of knowledge.


Holy cow ...! I guess the reason I never noticed was because I started using Debian around version 7, and the majority of the iconic Toy Story names (Woody, Buzz, Slink, Potato, etc.) are all pre-7.


Same exact thing for me. I rue the day they run out of names :(.


Plenty more characters in the sequels.


Yeah - wow.


> The packages—gcc and so on—needed to compile Python are removed once they are no longer needed.

Is there a reason to prefer this method, where installation, usage, and removal all happen in one RUN, vs. using a multi-stage build? I tend to prefer the latter but am not aware of tradeoffs beyond the readability of the Dockerfile.


The reason is to minimize the number of layers, mostly the layers that are not used in the final image (in this example, a layer would include gcc while another would remove it, but you would still need to download the layer with gcc).


I thought non-final stages in a multi-stage build are left out of the final artifact; is that incorrect?

I'm thinking the "build step" would be done in an earlier stage; it could be the exact same RUN statement, or it could be split into multiple for readability, and wouldn't bother removing any installed packages, since they won't carry over to the final stage. Then the big RUN in the final stage would be replaced with something like `COPY --from=builder ...`.


EDIT: Provide example below

If you are doing multi stage builds it only matters to combined as many statements as possible in the last layer.

I agree that for clarity it is nice not to optimize layers in the build stage - those will be thrown away anyway.

I vastly prefer multi stage builds over having to chain install and cleanup statements

Example: I usually want to use the python:3-slim image, but this doesn't have the tools to compile certain python libraries with C extensions. Generally I will use the python:3 image for my build stage to do my "pip install -r requirements.txt" and then copy the libraries over to my final stage based on the python:3-slim image

Of course I could install and uninstall GCC and other tools in a single stage.. but that actually takes longer to do and is messier in my opinion.


> Example: I usually want to use the python:3-slim image, but this doesn't have the tools to compile certain python libraries with C extensions. Generally I will use the python:3 image for my build stage to do my "pip install -r requirements.txt" and then copy the libraries over to my final stage based on the python:3-slim image

Example on how to do that, please.


For image size, it doesn't matter. It does matter for build time.

Specifically, multi-stage builds let you get better caching and therefore faster rebuilds, since you can cache the pre-installed gcc etc. layer, while still getting the small image.

So if you have a human being waiting on frequent Docker build results, yes, multi-stage is better.

In this case, the builds are automated, no one waits for them, so it doesn't really matter (except for burning some extra CPU cycles).


That makes sense, thank you.

Is there no downside to multi-stage? Even aside from caching behavior I prefer multi-stage builds, as I'd much rather read & maintain a bunch of RUN lines which do one specific thing, rather than dozens joined with &&.


There isn't too much of a downside, except that you need to be a little more careful in your CI, otherwise you can end up rebuilding from scratch each time, thus losing the benefit of faster builds.

See here for why and how to fix it: https://pythonspeed.com/articles/faster-multi-stage-builds/


An interesting aspect of this that isn't touched on in the article is about the much-maligned [1] scheme where Debian's Python is patched to look for packages in `dist-packages` instead of `site-packages`. The scenario in the Docker image is I believe the exact one that this is intended to provide sanity for, since you have two Pythons in play— the Debian-packaged python and also the built-from-source Python. Things get hairy once you start using the pip from either of these pythons to start installing pypi packages, especially if those packages or their dependencies have compiled parts which may not work for the other Python. Anyone who has used a Mac with brewed Python has also likely experienced this pain— especially in the days before wheels, where you basically had to use brewed Python for scientific work unless you wanted to compile big packages like numpy from source.

So the deal is that the system-supplied Python can get deps from apt packages, or from pip (into ~/.local, or as root into /usr/local), but either way will install them into a dist-packages directory. This keeps them separate from the packages which the built-from-source Python installs into site-packages.

Upstream Python has put a bunch of pieces in place to address this natively, for example with the "magic" tags that keep compiled assets separate in a directory of packages being potentially used by multiple different interpreters (see https://www.python.org/dev/peps/pep-3147/#proposal). And obviously the ideal solution where possible is to simply use a virtualenv and be totally isolated from the system python. But there are situations where that isn't possible or desirable, such as in this docker container, and so here we are.

1: eg https://github.com/pypa/setuptools/issues/2232


> […] simply use a virtualenv and be totally isolated from the system python. But there are situations where that isn't possible or desirable, such as in this docker container, and so here we are.

Why would it not be possible or desirable to use a venv in the Docker container? The tools are available and work fine (`venv` is built into modern Python 3, and they don’t break it unlike Debian). venv will give you separation from anything unusual on the system, and let you safely install your specifically pinned requirements.txt, with zero downsides.


One thing I've learned is that rather than use a docker-entrypoint.sh, most Linux software can be ran just using the `--user 1000:1000` or whatever UID/GID you want to use, as long as you map a volume that can use those permissions. It is a lot cleaner this way.


> Why Tini?

> Using Tini has several benefits:

> - It protects you from software that accidentally creates zombie processes, which can (over time!) starve your entire system for PIDs (and make it unusable).

> - It ensures that the default signal handlers work for the software you run in your Docker image. For example, with Tini, SIGTERM properly terminates your process even if you didn't explicitly install a signal handler for it.

> - It does so completely transparently! Docker images that work without Tini will work with Tini without any changes.

[...]

> NOTE: If you are using Docker 1.13 or greater, Tini is included in Docker itself. This includes all versions of Docker CE. To enable Tini, just pass the `--init` flag to docker run.

https://github.com/krallin/tini#why-tini


I didn't know it was included by default. I'll check it out, thanks!


After a few years' break from professional s/w dev, I'm really trying to get into the docker-based dev mindset.

You can use docker (or, less efficiently, a full-blown VM) for any purpose, but it looks like the killer app that has emerged appears to be devops. I guess same goes with k8s, as well as Chef/Puppet/Salt/Ansible/whathaveyou.

However, I'm noticing there is a new way of doing s/w dev emerging, let's call it modern software development, which utilizes some of these tools to maximize s/w dev productivity. I'm just not clear what is the best way to approach this.

The core issue is obviously dependency hell, and install-reinstall-reconfigure hell.

I guess a useful way to think is, what if I want do to web-dev, and android-dev, and, iOS dev, but I don't want these dev environments to interfere with each other, and all these dev environments should be available accessible on a single workstatation or powerful laptop.

I guess I could have docker for web-dev, docker for android-dev, and so on. I came across docker compose, and then I heard it's known to be cumbersome for dev-environments, and someone created binci to address those problems (though it's not a well-known tool).


The solution for this is to do your development in cloud directly. Projects like Skaffold, tilt and VSCode in browser are one step in this direction. Developer machines would become just thin clients and GUIs for these remote machines.


So you mean fire up an instance at AWS/Azure/GCP/etc that acts as an android-dev vm?

So dev environment is on the cloud? and all the data (working dir, data files, pdfs, whatever is involved during the dev phase) is also on the cloud storage?

That would be expensive, especially if you're a indie dev, and a tremendous waste of local compute/storage capabilities.

Again, I'm really not sure what near-future looks like in this space.


Indie devs are small minority, majority of the folks work in companies. You are already paying 2000-4000USD for your laptops, plus you also have to cost in lost productivity because of not being able to replicate your stack locally. Eventbrite is already doing this for all their employees - https://kelda.io/blog/eventbrite-interview/ So the economics of this does check out.


Larger companies can do this with virtual machines in their own datacenters, which cuts down a lot on the cost. Especially for developer machines, which have relatively high requirements on memory.


is it really a deep dive if it doesn't go into the base image at all?

edit: ok reading the article further, there are handwavy explanations. Don't call it a deep dive if you're gonna say

"There’s a lot in there, but the basic outcome is:..."

and then not explain anything beyond that.


Why doesn't it just use the Debian python package?


Debian Stable currently ships Python 3.7. If that's what you want, great.

But if you want Python 3.8, the soon-to-be-released Python 3.9, or other versions, you don't have that.


>Debian Stable currently ships Python 3.7.

That's surprisingly new even. I remember Debian being very behind when it came to Python3.


But isn't that only because it was released just last year? Won't it stay stuck on Python 3.7 until bullseye is released in approx 2 years time? It'll likely be very dated by then.


Next Debian stable should be released in less than a year. It’s been more than a year since Buster (current stable) was released. But yes the point stands that 3.7 will be all you get built-in for the next several months regardless of how many Python releases occur.


Thanks for correcting, you are totally right. No idea where I got the idea their releases were roughly ever 3 years, it's clearly every 2 years.


I can think of a couple of possibilities: 1. Better control of specific python version. There are debian builds for python 3.5, 3.6, etc. + other more specific minor versions. 2. Better isolation from the system python


I would love to read this article on a phone but the top banner is not only highly confusing but also blocking a lot of the content ... please fix this


> A common mistake for people using this base image is to install Python again, by using Debian’s version of Python

Is that really a common mistake?


Whoa boy, is it ever, but maybe not for the reason you're thinking. ie it isn't caused by people typing `apt-get install python`.

There are many packages that have Python as a dependency these days. For example, on my Ubuntu system:

> ~$ apt-cache rdepends python|wc -l

> 4649

I think the best illustration of how this can happen is installing postgres libraries needed to build the psycopg2 PG client. If you know to install `libpq-dev` then you're great. But if you do something that on the surface feels totally reasonable, like installing the `postgresql-client` package... guess what? You just installed another Python interpreter.

edit: formatting


16,328 results just on GitHub, though it fluctuates on refresh, and some are commented out: https://github.com/search?q=%22from+python%3A%22+apt-get+ins...

(And there's likely what to clean up in the search query to make it filter more irrelevant results).

Seems like though "yes" is the answer.


Beyond the other two answers, I added this bit because I keep seeing people make this mistake in their StackOverflow questions about why their Dockerfile isn't working.


There are Alpine [1] and Debian [2] miniconda images (within which you can `conda install python==3.8` and 2.7 and 3.4 in different conda envs)

[1] https://github.com/ContinuumIO/docker-images/blob/master/min...

[2] https://github.com/ContinuumIO/docker-images/blob/master/min...

If you build manylinux wheels with auditwheel [3], they should install without needing compilation for {CentOS, Debian, Ubuntu, and Alpine}; though standard Alpine images have MUSL instead of glibc by default, this [4] may work:

  echo "manylinux1_compatible = True" > $PYTHON_PATH/_manylinux.py

[3] https://github.com/pypa/auditwheel

[4] https://github.com/docker-library/docs/issues/904#issuecomme...

The miniforge docker images aren't yet [5][6] multi-arch, which means it's not as easy to take advantage of all of the ARM64 / aarch64 packages that conda-forge builds now.

[5] https://github.com/conda-forge/docker-images/issues/102#issu...

[6] https://github.com/conda-forge/miniforge/issues/20

There are i686 and x86-64 docker containers for building manylinux wheels that work with many distros: https://github.com/pypa/manylinux/tree/master/docker

A multi-stage Dockerfile build can produce a wheel in the first stage and install that wheel (with `COPY --from=0`) in a later stage; leaving build dependencies out of the production environment for security and performance: https://docs.docker.com/develop/develop-images/multistage-bu...


Interesting! I use miniconda extensively for local development to manage virtual environments for different python versions and love it. I hardly ever actually use the conda packages though.

I assume the main benefit of using these images would be if you are installing from conda repos instead of pip? Otherwise just using the official python images would be as good if not better

Edit: I guess if you needed multiple python versions in a single container this would be a good solution for that as well


Use cases for conda or conda+pip:

- Already-compiled packages (where there may not be binary wheels) instead of requiring reinstallation and subsequent removal of e.g. build-essentials for every install

- Support for R, Julia, NodeJS, Qt, ROS, CUDA, MKL, etc.

- Here's what the Kaggle docker-python Dockerfile installs with conda and with pip: https://github.com/Kaggle/docker-python/blob/master/Dockerfi...

- Build matrix in one container with conda envs

Disadvantages of the official python images as compared with conda+pip:

- Necessary to (re)install build dependencies and a compiler for every build (if there's not a bdist or a wheel for the given architecture) and then uninstall all unnecessary transitive dependencies. This is where a [multi-stage] build of a manylinux wheel may be the best approach.

- No LSM (AppArmor, SELinux, ) for one or more processes in the container (which may have read access to /etc or environment variables and/or --privileged)

- Necessary to build basically everything on non x86[-64] architectures for every container build

Disadvantages of conda / conda+pip:

- Different package repo infrastructure to mirror

- Users complaining that they don't need conda who then proceed to re-download and re-build wheels locally multiple times a day

Additional attributes for comparison:

- The new pip solver (which is slower than the traditional iterative non-solver), conda, and mamba

- repo2docker (and thus BinderHub) can build an up-to-date container from requirements.txt, environment.yml, install.R, postBuild and any of the other dependency specification formats supported by REES: Reproducible Environment Execution Standard; which may be helpful as Docker Hub images will soon be deleted if they're not retrieved at least once every 6 months (possibly with a GitHub Actions cron task)


Quite a few conda packages have patches added by the conda team to help fix problems in packages relying on native code or binaries. Particularly on Windows. If something is available on the primary conda repos it will almost assuredly work with few of any problems cross-platform, whereas pip is hit or miss.

If you’re always on Linux you may never appreciate it but some pip packages are a nightmare to get working properly on Windows.

If you look through the source of the conda repos, you’ll see all kinds of small patches to fix weird and breaking edge cases, particularly in libs with significant C back ends.


Here's the meta.yml for the conda-forge/python-feedstock: https://github.com/conda-forge/python-feedstock/blob/master/...

It includes patches just like distro packages often do.


Why would the image remove the *.pyc files?


It keeps the image size down and they’ll be recreated on first load for what you actually use. If you use the compileall module you can generate them for your app in the final layer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: