Volumes vs. Host Mounts in Docker

Whether you’re training machine learning models, analyzing large datasets, or running reproducible pipelines, containers (via Docker or Podman) offer a powerful way to package your tools and environments. But when it comes to handling data – datasets, results, model checkpoints – you need persistent storage.

You have two main options:

  1. Volumes – container-native.
  2. Host Directory Mounts – Connects your local file system to the container.

Here’s how each works, and when you might use them, with examples tailored to data science workflows.


🧪 Option 1: Volumes — Clean and Container-Native

Volumes are designed specifically for containers. They keep your data separate from your code and can be reused across container runs.

🔍 Example: Storing model outputs using a Docker volume

# Create a named volume
docker volume create model_output

# Run a container with the volume mounted
docker run -it --rm \
  -v model_output:/app/output \
  python:3.10 \
  bash -c "echo 'Training complete!' > /app/output/log.txt"

You can inspect the volume’s contents later:

# Access volume contents in a temporary container
docker run --rm -v model_output:/data alpine cat /data/log.txt

Use volumes when:

  • You want clean separation between code and data.
  • You’re deploying or sharing a containerized analysis pipeline.
  • You care about reproducibility and minimizing host pollution.

💻 Option 2: Host Directory Mounts — Fast and Accessible

Mounting a host directory allows you to directly work with files on your machine, perfect for development or quick iteration.

🔍 Example: Using your local dataset in a containerized notebook

Assume you have a dataset in ~/projects/data.

docker run -it --rm \
  -p 8888:8888 \
  -v ~/projects/data:/home/jovyan/data \
  jupyter/scipy-notebook

You can now access your dataset inside Jupyter at /home/jovyan/data and make changes from your local machine instantly.

Use host mounts when:

  • You’re prototyping or working with live data.
  • You want real-time access to scripts, notebooks, or results.
  • You prefer editing with local tools (e.g., VS Code, pandas profiling).

🏁 Final Thoughts

For academic and data science projects, a hybrid approach works best:

  • Use host mounts for flexibility during development.
  • Switch to volumes for long-running jobs and reproducibility.

By mastering both methods, you can create robust, portable, and efficient data workflows that adapt to your research and computational needs.