Intuitive Bayesian Modeling with PyMC3

at: Data Umbrella

by: Oriol Abril Pla

Onion-like talk, hopefully it will be (partially) useful not only to people interested in PyMC3, but also to people interested in Probabilistic programming or bayes stats in general. The talk has two main parts, the first half is expainatory whereas in the second we'll go over a couple of examples. I will start with a very general intro on key concepts of Bayesian statistics, focused on PPLs, t hen focus on PPLs and why are they becoming so relevant in so many fields, and finish the first part with PyMC3.

Bayesian paradigm

Data is considered fixed once observed

Model parameters are treated as random

* Data are considered fixed. They used to be random, but once they were written into your lab notebook/spreadsheet/IPython notebook they do not change. * Model parameters themselves may not be random, but Bayesians use probability distribtutions to describe their uncertainty in parameter values, and are therefore treated as random. In many cases, it is useful to consider parameters as having been sampled from probability distributions.

Data is considered fixed once observed

Model parameters are treated as random

\[p(\theta \mid y)\]

All conclusions from Bayesian statistical procedures are stated in terms of probability statements

\[p(\theta \mid y)\]

This formulation is generally referred to as inverse probability, because it infers from observations to parameters, or from effects to causes. More often than not, these are the questions we care about. We want the probability of the parameters in the model conditioned on the observed data, not the probability of the observations given some model or parameter.
\[ p(\theta | y) = \frac{p(y|\theta)p(\theta)}{p(y)} \]
The way to performe this "backwards" movement is by using the Bayes theorem, we combine the a priori information about the parameters with the probability of the observations given some parameters.

Prior information

  • Physical constraints
  • Model constraints
  • Prior studies
  • ...
The prior distribution encodes any and all information that is known about the model parameters before even gathering our data, which may not be much but hardly ever is *none*, so why not take advantage of it? Things like * physical constraints: if one parameter is the density of a rock it can't larger than the density of iron. * model constraints: variables representing rates or variance can't be negatives. * We can also use as priors the results from previous studies, * there is work being done too on encoding fairness constraints in priors. And in case of doubt, one can always use weakly informative priors and perform sensitivity analysis.

Uncertainty

As we were saying, statements are made in terms of probability distributions, which is much more informative than using point estimates or summary statistics. A very quick example, these 3 distributions all have the same median, 0, so the probability of the parameter being larger than 0 is 50%, yet, if we only looked at their means and standard deviations, we'd probably say that the parameter with the right skew distribution has the largest probability of being positive

Probabilistic Programming

Building blocks

Probabilistic Programming languages define base probability distributions, transformations, autoregressive models, gaussian processes... giving building blocks to users so they can write their own models, virtually any model. Much like one can combine the commands and functions a programming language gives them to write custom programs.

Generative modeling

Generative modeling

By defining the models with these probability distribution building blocks, we can directly calculate probabilities of the parameters and of the data given some parameters, but we can do more than that, we can also sample from these distributions. We start from the parameters and from there generate/simulate synthetic data. This is called forward sampling, and it is a very powerful technique because it allows you to compare the synthetic data with your observations very easily, both in terms of compute and developer time.

Automagical solvers

Inference algorithms

MCMC: HMC+NUTS, Metropolis, Gibbs...

Variational Inference: ADVI (+mini batch ADVI), OPVI

And last but not least, probabilistic programming libraries provide multiple inference algorithms to approximate the posterior automatically, you only need to define the model. This means that you can use the latest cutting edge inference algorithms without having to write them yourself from scratch. You should probably consider using the inference capabilities of a PPL even if you are not building the model with it.

PyMC3

So what characterizes PyMC3? What makes it an amazing probabilistic programming library?

Intuitive

PyMC3 syntax is concise and clear. It generally follows the mathematical notation quite closely.

Python-driven

PyMC3 is a Python library built on top of Aesara (the daughter of Theano) a python-driven tensor library. This means that you don't need to learn an extra language to do probabilistic programming and can integrate PyMC3 directly within your current Python code.

Extensible

Thanks to being Python driven, PyMC3 can also be easily extended from within Python itself. You can subclass PyMC3 distributions to create new ones, you can use numba to accelerate expensive computations within the model...

Community driven

PyMC3 is a community drive project that integrates with the rest of the PyData stack. There are several libraries in multiple domains built on top of PyMC3 and the PyMC team contributes to other affine libraries too. It integrates with ArviZ for postprocessing, enabling things like label based indexing and computation with xarray, seamless visualizations with either matplotlib or bokeh or model summaries as pandas dataframes.

PyMC3 in practice

Coal mining disasters

Example from Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using PyMC3. PeerJ Computer Science

Video Code
(talk by @fonnesbeck at Montreal Python) (2nd case study)
This first example will showcase how intuitive and concise it is to write and fit models in PyMC3, even in cases like this one where we have discrete parameters and a switchpoint.

Model building

\[\begin{aligned} y & \sim \text{Poisson}(r) \\ \end{aligned} \]

with pm.Model():
    disasters = pm.Poisson(
        "disasters", rate, observed=data
    )
                

Model building

\[\begin{aligned} y & \sim \text{Poisson}(r) \\ r & = \begin{cases} e, \ \text{if}\ t \leq s\\ l, \ \text{if}\ t \gt s \end{cases} \end{aligned} \]

with pm.Model():
    rate = pm.math.switch(s >= years, e, l)

    disasters = pm.Poisson(
        "disasters", rate, observed=data
    )
                

Model building

\[\begin{aligned} y & \sim \text{Poisson}(r) \\ r & = \begin{cases} e, \ \text{if}\ t \leq s\\ l, \ \text{if}\ t \gt s \end{cases} \\ s & \sim \text{DiscreteUniform}(t_0, t_e) \\ \end{aligned} \]

with pm.Model():
    s = pm.DiscreteUniform(
        "switchpoint", lower=t_0, upper=t_e
    )

    rate = pm.math.switch(s >= years, e, l)

    disasters = pm.Poisson(
        "disasters", rate, observed=data
    )
                

Model building

\[\begin{aligned} y & \sim \text{Poisson}(r) \\ r & = \begin{cases} e, \ \text{if}\ t \leq s\\ l, \ \text{if}\ t \gt s \end{cases}\\ s & \sim \text{DiscreteUniform}(t_0, t_e) \\ e & \sim \text{Exponential}(1) \\ l & \sim \text{Exponential}(1) \end{aligned} \]

with pm.Model():
    s = pm.DiscreteUniform(
        "switchpoint", lower=t_0, upper=t_e
    )

    e = pm.Exponential("early_rate", 1.0)
    l = pm.Exponential("late_rate", 1.0)

    rate = pm.math.switch(s >= years, e, l)

    disasters = pm.Poisson(
        "disasters", rate, observed=data
    )
                

Inference


with model:
    trace = pm.sample(3000, tune=2000)
            

Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>Metropolis: [switchpoint]
>NUTS: [late_rate, early_rate]
            

Housing prices in Berlin

Example by @corrieaar at PyConDE & PyData Berlin 2019

Video Code Slides
This second example showcases a very popular feature of bayesian modeling, hierarchical models.

Let's start with a linear regression

Looking at the data it seems like a reasonable choice. However, we may not want to consider all of Berlin at once. Why shouldn't each district have its own slope and intercept?

Let's start with a linear regression

So far so good, but these two districts have both many observations each, which is not always the case.

Let's stop using a linear regression?

If we fit the districts independently, we get very bad results from those with little data, which is not surprising. So what can we do? We can use a hierarchical model. These models build higher level and assume that slopes are all generated from the same distribution. They will be different, but not too different. This is called partial pooling, as opposed to complete pooling -> each district has its own slope and no pooling -> all districts have the same slope

Let's go hierarchical!

Thanks to partial pooling, Schoneberg can "pool" information from the other districts, which gives a much better looking slope.

Acknowledgements

This talk is heavily inspired by many other talks about Bayesian statistics and PyMC3 as well as by interactions with the whole PyMC3 and greater Bayesian OSS community on GitHub and Discourse

Special thanks go to PyMC3 and ArviZ core members, to Thomas Wiecki, Chris Fonnesbeck, Corrie Bartelheimer and Allen Downey, check out their profiles and websites for more amazing content!

Where to go next?

PyMC3 documentation: docs.pymc.io

PyMCon 2020 talks: publicly available on Discourse

Learn Bayesian Statistics Podcast: learnbayesstats.com