Volumes vs. Host Mounts in Docker
Whether you’re training machine learning models, analyzing large datasets, or running reproducible pipelines, containers (via Docker or Podman) offer a powerful way to package your tools and environments. But when it comes to handling data – datasets, results, model checkpoints – you need persistent storage.
You have two main options:
- Volumes – container-native.
- Host Directory Mounts – Connects your local file system to the container.
Here’s how each works, and when you might use them, with examples tailored to data science workflows.
🧪 Option 1: Volumes — Clean and Container-Native
Volumes are designed specifically for containers. They keep your data separate from your code and can be reused across container runs.
🔍 Example: Storing model outputs using a Docker volume
# Create a named volume
docker volume create model_output
# Run a container with the volume mounted
docker run -it --rm \
-v model_output:/app/output \
python:3.10 \
bash -c "echo 'Training complete!' > /app/output/log.txt"
You can inspect the volume’s contents later:
# Access volume contents in a temporary container
docker run --rm -v model_output:/data alpine cat /data/log.txt
Use volumes when:
- You want clean separation between code and data.
- You’re deploying or sharing a containerized analysis pipeline.
- You care about reproducibility and minimizing host pollution.
💻 Option 2: Host Directory Mounts — Fast and Accessible
Mounting a host directory allows you to directly work with files on your machine, perfect for development or quick iteration.
🔍 Example: Using your local dataset in a containerized notebook
Assume you have a dataset in ~/projects/data
.
docker run -it --rm \
-p 8888:8888 \
-v ~/projects/data:/home/jovyan/data \
jupyter/scipy-notebook
You can now access your dataset inside Jupyter at /home/jovyan/data
and make changes from your local machine instantly.
Use host mounts when:
- You’re prototyping or working with live data.
- You want real-time access to scripts, notebooks, or results.
- You prefer editing with local tools (e.g., VS Code, pandas profiling).
🏁 Final Thoughts
For academic and data science projects, a hybrid approach works best:
- Use host mounts for flexibility during development.
- Switch to volumes for long-running jobs and reproducibility.
By mastering both methods, you can create robust, portable, and efficient data workflows that adapt to your research and computational needs.