Why Mixture Models?

In cognitive modeling, assuming that all observed data originate from a single generative process is often an oversimplification – and in many cases, implausible. A common example is the presence of contaminant trials, where a participant responds using a cognitive process different from the one the model aims to capture. Such trials can distort inferences about the cognitive process of interest.

A powerful way to address this challenge is through latent-mixture models1. The key idea is to extend the basic generative model to incorporate an additional component, for example, one that accounts for the contaminant process. In this framework, each trial is assumed to be generated by one of two (or more) underlying processes: the primary cognitive process of interest or the contaminant process. Since the specific origin of each trial is unknown, the model adopts a mixture structure, probabilistically assigning trials to different latent sources.

By explicitly modeling multiple data-generating processes, latent-mixture models improve robustness and provide more accurate inferences.


Implementing Mixture Models: JAGS vs. Stan

Mixture models are powerful, but their implementation can be tricky—and the approach varies significantly depending on the probabilistic programming language you choose.

Two of the most popular options are JAGS (Just Another Gibbs Sampler) and Stan. While both can handle mixture models, they take fundamentally different approaches:

  • JAGS relies on Gibbs sampling with data augmentation, introducing a discrete indicator variable $z$ to assign each observation to a mixture component.
  • Stan uses Hamiltonian Monte Carlo (HMC), which does not support discrete parameters in sampling. Instead, it requires marginalizing out the indicator variable $z$, leading to a different implementation strategy.

In this post, we’ll compare how mixture models are implemented in JAGS and Stan using a simple Gaussian Mixture Model (GMM).

Buckle up! 🚀


Toy Example: Gaussian Mixture Model (GMM)

A two-component Gaussian mixture model assumes that each data point ($y_i$) is drawn from one of two normal distributions, indicated by a latent variable $z_i$ that assigns it to a component:

$$ y_i \sim \mathcal{N}(\mu_1, \sigma_1), \quad \text{if } z_i = 1 $$

$$ y_i \sim \mathcal{N}(\mu_2, \sigma_2), \quad \text{if } z_i = 0 $$

The probability of a data point belonging to the first component is given by $\lambda$, so

$$ z_i \sim \text{Bernoulli}(\lambda) $$

Our goal is to estimate $\lambda$, $\mu_1$, $\mu_2$, $\sigma_1$, and $\sigma_2$ from observed data.


JAGS Implementation: Explicit Latent Variables

In JAGS, we explicitly introduce the latent variable $z$, which assigns each observation to a component. The model structure follows:

model {
  for (i in 1:N) {
    z[i] ~ dbern(lambda)  # Latent class assignment
    z1[i] <- z[i] + 1  # Convert {0,1} to {1,2} for indexing
    y[i] ~ dnorm(mu[z1[i]], sigma_inv[z1[i]]) 
  }

  # Priors
  lambda ~ dbeta(2, 2)  # Prior for mixing proportion
  for (j in 1:2) {
    mu[j] ~ dnorm(0, 0.01)  # Weakly informative prior
    sigma[j] ~ dunif(0, 5)
    sigma_inv[j] <- 1 / pow(sigma[j], 2)  # Deviation in terms of precision
  }
}

✅ Advantages: Conceptually simple, direct modeling of the indicator variable, making the mixture structure straightforward.
⚠️ Challenges: Discrete sampling can be inefficient, and mixing may be slow.


Stan Implementation: Marginalizing Out $z$

Unlike JAGS, Stan does not allow discrete latent variables in its HMC sampler. Stan’s HMC sampler requires differentiable parameters, meaning discrete latent variables must be marginalized out. Instead of sampling $z$, we compute the total likelihood across both possible assignments.

$$ \log p(y_i) = \log \left( \lambda \cdot \mathcal{N}(y_i \mid \mu_1, \sigma_1) + (1 - \lambda) \cdot \mathcal{N}(y_i \mid \mu_2, \sigma_2) \right) $$

For numerical stability, we use the log_sum_exp trick:

data {
  int<lower=0> N;      
  real y[N];          
}

parameters {
  real mu1;               
  real mu2;               
  real<lower=0> sigma1;    
  real<lower=0> sigma2;    
  real<lower=0, upper=1> lambda;
}

model {
  vector[N] log_lik1;
  vector[N] log_lik2;

  // Priors
  mu1 ~ normal(0, 5);
  mu2 ~ normal(0, 5);
  sigma1 ~ normal(0, 2);
  sigma2 ~ normal(0, 2);
  lambda ~ beta(2, 2);
  
  // Compute log-likelihoods
  for (n in 1:N) {
	target += log_sum_exp(
		log(lambda) + normal_lpdf(y[n] | mu1, sigma1),
		log(1 - lambda) + normal_lpdf(y[n] | mu2, sigma2)
	)
  }
}

✅ Advantages: More efficient sampling, better mixing than Gibbs sampling.
⚠️ Challenges: Requires careful formulation, trickier to interpret than explicit latent assignments.


Final Takeaway

🚀 Stan’s marginalization approach is generally superior for efficiency, but JAGS provides a more intuitive representation when discrete assignments are of interest.


  1. Lee, M. D., & Wagenmakers, E.-J. (2013). Bayesian cognitive modeling: A practical course. Cambridge University Press. ↩︎