QBoard » Statistical modeling » Stats - Tech » How to plot empirical cdf in matplotlib in Python?

How to plot empirical cdf in matplotlib in Python?

  • How can I plot the empirical CDF of an array of numbers in matplotlib in Python? I'm looking for the cdf analog of pylab's "hist" function.

    One thing I can think of is:

    from scipy.stats import cumfreq
    a = array([...]) # my array of numbers
    num_bins =  20
    b = cumfreq(a, num_bins)
    plt.plot(b)

    Is that correct though? Is there an easier/better way?

    thanks.

      December 19, 2020 11:47 AM IST
    0
  • That looks to be (almost) exactly what you want. Two things:

    First, the results are a tuple of four items. The third is the size of the bins. The second is the starting point of the smallest bin. The first is the number of points in the in or below each bin. (The last is the number of points outside the limits, but since you haven't set any, all points will be binned.)

    Second, you'll want to rescale the results so the final value is 1, to follow the usual conventions of a CDF, but otherwise it's right.

    Here's what it does under the hood:

    def cumfreq(a, numbins=10, defaultreallimits=None):
        # docstring omitted
        h,l,b,e = histogram(a,numbins,defaultreallimits)
        cumhist = np.cumsum(h*1, axis=0)
        return cumhist,l,b,e

    It does the histogramming, then produces a cumulative sum of the counts in each bin. So the ith value of the result is the number of array values less than or equal to the the maximum of the ith bin. So, the final value is just the size of the initial array.

    Finally, to plot it, you'll need to use the initial value of the bin, and the bin size to determine what x-axis values you'll need.

    Another option is to use numpy.histogram which can do the normalization and returns the bin edges. You'll need to do the cumulative sum of the resulting counts yourself.

    a = array([...]) # your array of numbers
    num_bins = 20
    counts, bin_edges = numpy.histogram(a, bins=num_bins, normed=True)
    cdf = numpy.cumsum(counts)
    pylab.plot(bin_edges[1:], cdf)

    (bin_edges[1:] is the upper edge of each bin.)

      December 28, 2020 11:49 AM IST
    0
  • I have a trivial addition to AFoglia's method, to normalize the CDF

    n_counts,bin_edges = np.histogram(myarray,bins=11,normed=True) 
    cdf = np.cumsum(n_counts)  # cdf not normalized, despite above
    scale = 1.0/cdf[-1]
    ncdf = scale * cdf
    


    Normalizing the histo makes its integral unity, which means the cdf will not be normalized. You've got to scale it yourself.

      August 31, 2021 12:41 PM IST
    0
  • If you want to display the actual true ECDF (which as David B noted is a step function that increases 1/n at each of n datapoints), my suggestion is to write code to generate two "plot" points for each datapoint:

    a = array([...]) # your array of numbers
    sorted=np.sort(a)
    x2 = []
    y2 = []
    y = 0
    for x in sorted: 
        x2.extend([x,x])
        y2.append(y)
        y += 1.0 / len(a)
        y2.append(y)
    plt.plot(x2,y2)

     

    This way you will get a plot with the n steps that are characteristic of an ECDF, which is nice especially for data sets that are small enough for the steps to be visible. Also, there is no no need to do any binning with histograms (which risk introducing bias to the drawn ECDF).

     
      September 18, 2021 12:58 PM IST
    0