QBoard » Statistical modeling » Stats - Conceptual » Representing continuous probability distributions

Representing continuous probability distributions

  • I have a problem involving a collection of continuous probability distribution functions, most of which are determined empirically (e.g. departure times, transit times). What I need is some way of taking two of these PDFs and doing arithmetic on them. E.g. if I have two values x taken from PDF X, and y taken from PDF Y, I need to get the PDF for (x+y), or any other operation f(x,y).

    An analytical solution is not possible, so what I'm looking for is some representation of PDFs that allows such things. An obvious (but computationally expensive) solution is monte-carlo: generate lots of values of x and y, and then just measure f(x, y). But that takes too much CPU time.

    I did think about representing the PDF as a list of ranges where each range has a roughly equal probability, effectively representing the PDF as the union of a list of uniform distributions. But I can't see how to combine them.

    Does anyone have any good solutions to this problem?

    Edit: The goal is to create a mini-language (aka Domain Specific Language) for manipulating PDFs. But first I need to sort out the underlying representation and algorithms.

    Edit 2: dmckee suggests a histogram implementation. That is what I was getting at with my list of uniform distributions. But I don't see how to combine them to create new distributions. Ultimately I need to find things like P(x < y) in cases where this may be quite small.

    Edit 3: I have a bunch of histograms. They are not evenly distributed because I'm generating them from occurance data, so basically if I have 100 samples and I want ten points in the histogram then I allocate 10 samples to each bar, and make the bars variable width but constant area.

    I've figured out that to add PDFs you convolve them, and I've boned up on the maths for that. When you convolve two uniform distributions you get a new distribution with three sections: the wider uniform distribution is still there, but with a triangle stuck on each side the width of the narrower one. So if I convolve each element of X and Y I'll get a bunch of these, all overlapping. Now I'm trying to figure out how to sum them all and then get a histogram that is the best approximation to it.

    I'm beginning to wonder if Monte-Carlo wasn't such a bad idea after all.

    Edit 4: This paper discusses convolutions of uniform distributions in some detail. In general you get a "trapezoid" distribution. Since each "column" in the histograms is a uniform distribution, I had hoped that the problem could be solved by convolving these columns and summing the results.

    However the result is considerably more complex than the inputs, and also includes triangles. Edit 5:[Wrong stuff removed]. But if these trapezoids are approximated to rectangles with the same area then you get the Right Answer, and reducing the number of rectangles in the result looks pretty straightforward too. This might be the solution I've been trying to find.

    Edit 6: Solved! Here is the final Haskell code for this problem:

    <span class="com">-- | Continuous distributions of scalars are represented as a</span> <span class="com">-- | histogram where each bar has approximately constant area but</span> <span class="com">-- | variable width and height. A histogram with N bars is stored as</span> <span class="com">-- | a list of N+1 values.</span> <span class="kwd">data</span><span class="pln"> Continuous </span><span class="pun">=</span><span class="pln"> C </span><span class="pun">{</span><span class="pln"> cN </span><span class="pun">::</span><span class="pln"> Int</span><span class="pun">,</span> <span class="com">-- ^ Number of bars in the histogram.</span><span class="pln"> cAreas </span><span class="pun">::</span> <span class="pun">[</span><span class="pln">Double</span><span class="pun">],</span> <span class="com">-- ^ Areas of the bars. @length cAreas == cN@</span><span class="pln"> cBars </span><span class="pun">::</span> <span class="pun">[</span><span class="pln">Double</span><span class="pun">]</span> <span class="com">-- ^ Boundaries of the bars. @length cBars == cN + 1@</span> <span class="pun">}</span> <span class="kwd">deriving</span> <span class="pun">(</span><span class="pln">Show</span><span class="pun">,</span><span class="pln"> Read</span><span class="pun">)</span> <span class="com">{- | Add distributions. If two random variables @vX@ and @vY@ are taken from distributions @x@ and @y@ respectively then the distribution of @(vX + vY)@ will be @(x .+. y). This is implemented as the convolution of distributions x and y. Each is a histogram, which is to say the sum of a collection of uniform distributions (the "bars"). Therefore the convolution can be computed as the sum of the convolutions of the cross product of the components of x and y. When you convolve two uniform distributions of unequal size you get a trapezoidal distribution. Let p = p2-p1, q - q2-q1. Then we get: > | | > | ______ | > | | | with | _____________ > | | | | | | > +-----+----+------- +--+-----------+- > p1 p2 q1 q2 > > gives h|....... _______________ > | /: :\ > | / : : \ 1 > | / : : \ where h = - > | / : : \ q > | / : : \ > +--+-----+-------------+-----+----- > p1+q1 p2+q1 p1+q2 p2+q2 However we cannot keep the trapezoid in the final result because our representation is restricted to uniform distributions. So instead we store a uniform approximation to the trapezoid with the same area: > h|......___________________ > | | / \ | > | |/ \| > | | | > | /| |\ > | / | | \ > +-----+-------------------+-------- > p1+q1+p/2 p2+q2-p/2 -}</span> <span class="pun">(.+.)</span> <span class="pun">::</span><span class="pln"> Continuous </span><span class="pun">-></span><span class="pln"> Continuous </span><span class="pun">-></span><span class="pln"> Continuous c </span><span class="pun">.+.</span><span class="pln"> d </span><span class="pun">=</span><span class="pln"> C </span><span class="pun">{</span><span class="pln">cN </span><span class="pun">=</span><span class="pln"> length bars </span><span class="pun">-</span> <span class="lit">1</span><span class="pun">,</span><span class="pln"> cBars </span><span class="pun">=</span><span class="pln"> map fst bars</span><span class="pun">,</span><span class="pln"> cAreas </span><span class="pun">=</span><span class="pln"> zipWith barArea bars </span><span class="pun">(</span><span class="pln">tail bars</span><span class="pun">)}</span> <span class="kwd">where</span> <span class="com">-- The convolve function returns a list of two (x, deltaY) pairs.</span> <span class="com">-- These can be sorted by x and then sequentially summed to get</span> <span class="com">-- the new histogram. The "b" parameter is the product of the</span> <span class="com">-- height of the input bars, which was omitted from the diagrams</span> <span class="com">-- above.</span><span class="pln"> convolve b c1 c2 d1 d2 </span><span class="pun">=</span> <span class="kwd">if</span> <span class="pun">(</span><span class="pln">c2</span><span class="pun">-</span><span class="pln">c1</span><span class="pun">)</span> <span class="pun"><</span> <span class="pun">(</span><span class="pln">d2</span><span class="pun">-</span><span class="pln">d1</span><span class="pun">)</span> <span class="kwd">then</span><span class="pln"> convolve1 b c1 c2 d1 d2 </span><span class="kwd">else</span><span class="pln"> convolve1 b d1 d2 c1 c2 convolve1 b p1 p2 q1 q2 </span><span class="pun">=</span> <span class="pun">[(</span><span class="pln">p1</span><span class="pun">+</span><span class="pln">q1</span><span class="pun">+</span><span class="pln">halfP</span><span class="pun">,</span><span class="pln"> h</span><span class="pun">),</span> <span class="pun">(</span><span class="pln">p2</span><span class="pun">+</span><span class="pln">q2</span><span class="pun">-</span><span class="pln">halfP</span><span class="pun">,</span> <span class="pun">(-</span><span class="pln">h</span><span class="pun">))]</span> <span class="kwd">where</span><span class="pln"> halfP </span><span class="pun">=</span> <span class="pun">(</span><span class="pln">p2</span><span class="pun">-</span><span class="pln">p1</span><span class="pun">)/</span><span class="lit">2</span><span class="pln"> h </span><span class="pun">=</span><span class="pln"> b </span><span class="pun">/</span> <span class="pun">(</span><span class="pln">q2</span><span class="pun">-</span><span class="pln">q1</span><span class="pun">)</span><span class="pln"> outline </span><span class="pun">=</span><span class="pln"> map sumGroup </span><span class="pun">$</span><span class="pln"> groupBy </span><span class="pun">((==)</span> <span class="pun">`</span><span class="pln">on</span><span class="pun">`</span><span class="pln"> fst</span><span class="pun">)</span> <span class="pun">$</span><span class="pln"> sortBy </span><span class="pun">(</span><span class="pln">comparing fst</span><span class="pun">)</span> <span class="pun">$</span><span class="pln"> concat </span><span class="pun">[</span><span class="pln">convolve </span><span class="pun">(</span><span class="pln">areaC</span><span class="pun">*</span><span class="pln">areaD</span><span class="pun">)</span><span class="pln"> c1 c2 d1 d2 </span><span class="pun">|</span> <span class="pun">(</span><span class="pln">c1</span><span class="pun">,</span><span class="pln"> c2</span><span class="pun">,</span><span class="pln"> areaC</span><span class="pun">)</span> <span class="pun"><-</span><span class="pln"> zip3 </span><span class="pun">(</span><span class="pln">cBars c</span><span class="pun">)</span> <span class="pun">(</span><span class="pln">tail </span><span class="pun">$</span><span class="pln"> cBars c</span><span class="pun">)</span> <span class="pun">(</span><span class="pln">cAreas c</span><span class="pun">),</span> <span class="pun">(</span><span class="pln">d1</span><span class="pun">,</span><span class="pln"> d2</span><span class="pun">,</span><span class="pln"> areaD</span><span class="pun">)</span> <span class="pun"><-</span><span class="pln"> zip3 </span><span class="pun">(</span><span class="pln">cBars d</span><span class="pun">)</span> <span class="pun">(</span><span class="pln">tail </span><span class="pun">$</span><span class="pln"> cBars d</span><span class="pun">)</span> <span class="pun">(</span><span class="pln">cAreas d</span><span class="pun">)</span> <span class="pun">]</span><span class="pln"> sumGroup pairs </span><span class="pun">=</span> <span class="pun">(</span><span class="pln">fst </span><span class="pun">$</span><span class="pln"> head pairs</span><span class="pun">,</span><span class="pln"> sum </span><span class="pun">$</span><span class="pln"> map snd pairs</span><span class="pun">)</span><span class="pln"> bars </span><span class="pun">=</span><span class="pln"> tail </span><span class="pun">$</span><span class="pln"> scanl </span><span class="pun">(\(_,</span><span class="pln">y</span><span class="pun">)</span> <span class="pun">(</span><span class="pln">x2</span><span class="pun">,</span><span class="pln">dy</span><span class="pun">)</span> <span class="pun">-></span> <span class="pun">(</span><span class="pln">x2</span><span class="pun">,</span><span class="pln"> y</span><span class="pun">+</span><span class="pln">dy</span><span class="pun">))</span> <span class="pun">(</span><span class="lit">0</span><span class="pun">,</span> <span class="lit">0</span><span class="pun">)</span><span class="pln"> outline barArea </span><span class="pun">(</span><span class="pln">x1</span><span class="pun">,</span><span class="pln"> h</span><span class="pun">)</span> <span class="pun">(</span><span class="pln">x2</span><span class="pun">,</span> <span class="kwd">_</span><span class="pun">)</span> <span class="pun">=</span> <span class="pun">(</span><span class="pln">x2 </span><span class="pun">-</span><span class="pln"> x1</span><span class="pun">)</span> <span class="pun">*</span><span class="pln"> h</span>

    Other operators are left as an exercise for the reader.

     
      June 11, 2019 4:48 PM IST
    0
  • Continuous probability distribution: A probability distribution in which the random variable X can take on any value (is continuous). Because there are infinite values that X could assume, the probability of X taking on any one specific value is zero. Therefore we often speak in ranges of values (p(X>0) = .50). The normal distribution is one example of a continuous distribution. The probability that X falls between two values (a and b) equals the integral (area under the curve) from a to b:

    https://sites.nicholas.duke.edu/statsreview/files/2013/06/pdf-300x104.jpg 300w" alt="" width="570" height="198">

     

    https://sites.nicholas.duke.edu/statsreview/files/2013/06/family-150x150.jpeg 150w, https://sites.nicholas.duke.edu/statsreview/files/2013/06/family.jpeg 672w" alt="" width="300" height="300">

    The Normal Probability Distribution

     

     

    A probability distribution is formed from all possible outcomes of a random process (for a random variable X) and the probability associated with each outcome.  Probability distributions may either be discrete (distinct/separate outcomes, such as number of children) or continuous (a continuum of outcomes, such as height). A probability density function is defined such that the likelihood of a value of X between a and b equals the integral (area under the curve) between a and b.  This probability is always positive.  Further, we know that the area under the curve from negative infinity to positive infinity is one.

     

    The normal probability distribution, one of the fundamental continuous distributions of statistics, is actually a family of distributions (an infinite number of distributions with differing means (μ) and standard deviations (σ).  Because the normal distribution is a continuous distribution, we can not calculate exact probability for an outcome, but instead we calculate a probability for a range of outcomes (for example the probability that a random variable X is greater than 10).

      November 25, 2021 12:04 PM IST
    0
  • Probability distributions form a monad; see eg the work of Claire Jones and also the LICS 1989 paper, but the ideas go back to a 1982 paper by Giry (DOI 10.1007/BFb0092872) and to a 1962 note by Lawvere that I cannot track down (http://permalink.gmane.org/gmane.science.mathematics.categories/6541).

    But I don't see the comonad: there's no way to get an "a" out of an "(a->Double)->Double". Perhaps if you make it polymorphic - (a->r)->r for all r? (That's the continuation monad.)
    shareimprove this answer
    edited Feb 13 '14 at 3:03
     
     
    community wiki

    2 revs, 2 users 80%
    Jeremy Gibbons
    add a comment
    2

    Is there anything that stops you from employing a mini-language for this?

    By that I mean, define a language that lets you write f = x + y and evaluates f for you just as written. And similarly for g = x * z, h = y(x), etc. ad nauseum. (The semantics I'm suggesting call for the evaluator to select a random number on each innermost PDF appearing on the RHS at evaluation time, and not to try to understand the composted form of the resulting PDFs. This may not be fast enough...)


    Assuming that you understand the precision limits you need, you can represent a PDF fairly simply with a histogram or spline (the former being a degenerate case of the later). If you need to mix analytically defined PDFs with experimentally determined ones, you'll have to add a type mechanism.


    A histogram is just an array, the contents of which represent the incidence in a particular region of the input range. You haven't said if you have a language preference, so I'll assume something c-like. You need to know the bin-structure (uniorm sizes are easy, but not always best) including the high and low limits and possibly the normalizatation:

    struct histogram_struct {
    int bins; /* Assumed to be uniform */
    double low;
    double high;
    /* double normalization; */
    /* double *errors; */ /* if using, intialize with enough space,
    * and store _squared_ errors
    */
    double contents[];
    };
    This kind of thing is very common in scientific analysis software, and you might want to use an existing implementation.
      June 14, 2019 12:07 PM IST
    0
  • Autonomous mobile robotics deals with similar issue in localization and navigation, in particular the Markov localization and Kalman filter (sensor fusion). See An experimental comparison of localization methods continued for example.

    Another approach you could borrow from mobile robots is path planning using potential fields.

      June 14, 2019 12:21 PM IST
    0
  •  

    I worked on similar problems for my dissertation.

    One way to compute approximate convolutions is to take the Fourier transform of the density functions (histograms in this case), multiply them, then take the inverse Fourier transform to get the convolution.

    Look at Appendix C of my dissertation for formulas for various special cases of operations on probability distributions. You can find the dissertation at: http://riso.sourceforge.net

    I wrote Java code to carry out those operations. You can find the code at: https://sourceforge.net/projects/riso

      June 11, 2019 4:50 PM IST
    0