Persistent Storage in Containers: Volumes vs. Host Mounts

Volumes vs. Host Mounts in Docker

Whether you’re training machine learning models, analyzing large datasets, or running reproducible pipelines, containers (via Docker or Podman) offer a powerful way to package your tools and environments. But when it comes to handling data – datasets, results, model checkpoints – you need persistent storage.

You have two main options:

Volumes – container-native.
Host Directory Mounts – Connects your local file system to the container.

Here’s how each works, and when you might use them, with examples tailored to data science workflows.

🧪 Option 1: Volumes — Clean and Container-Native

Volumes are designed specifically for containers. They keep your data separate from your code and can be reused across container runs.

🔍 Example: Storing model outputs using a Docker volume

# Create a named volume
docker volume create model_output

# Run a container with the volume mounted
docker run -it --rm \
  -v model_output:/app/output \
  python:3.10 \
  bash -c "echo 'Training complete!' > /app/output/log.txt"

You can inspect the volume’s contents later:

# Access volume contents in a temporary container
docker run --rm -v model_output:/data alpine cat /data/log.txt

Use volumes when:

You want clean separation between code and data.
You’re deploying or sharing a containerized analysis pipeline.
You care about reproducibility and minimizing host pollution.

💻 Option 2: Host Directory Mounts — Fast and Accessible

Mounting a host directory allows you to directly work with files on your machine, perfect for development or quick iteration.

🔍 Example: Using your local dataset in a containerized notebook

Assume you have a dataset in ~/projects/data.

docker run -it --rm \
  -p 8888:8888 \
  -v ~/projects/data:/home/jovyan/data \
  jupyter/scipy-notebook

You can now access your dataset inside Jupyter at /home/jovyan/data and make changes from your local machine instantly.

Use host mounts when:

You’re prototyping or working with live data.
You want real-time access to scripts, notebooks, or results.
You prefer editing with local tools (e.g., VS Code, pandas profiling).

🏁 Final Thoughts

For academic and data science projects, a hybrid approach works best:

Use host mounts for flexibility during development.
Switch to volumes for long-running jobs and reproducibility.

By mastering both methods, you can create robust, portable, and efficient data workflows that adapt to your research and computational needs.

Volumes vs. Host Mounts in Docker#

🧪 Option 1: Volumes — Clean and Container-Native#

🔍 Example: Storing model outputs using a Docker volume#

💻 Option 2: Host Directory Mounts — Fast and Accessible#

🔍 Example: Using your local dataset in a containerized notebook#

🏁 Final Thoughts#