Kernel Density Estimations V’s Histograms

One of the first steps of statistical analysis is understanding the distribution of your data. Histograms work by counting the frequency of data points which have values which fall between a particular range. For example, if you was to assess happiness levels on a continous scale of 1-10 in a chosen sample, you could assess how many people scored between 1-3, 3-6, 6-9 e.t.c. These groups are called bins, and the size of the bin can differ. So you could have a bin with a range (otherwise known as width) which is equal to one (i.e. 1-2, 2-3 e.t.c), or a bin with a range equal to four (i.e. 1-4, 4-8 e.t.c). The larger the bins, the more data points will fall into that bin as more scores will have a value between that range, increasing its frequency count.

Using Psychometric Scales, you can measure the salience of particular personality traits. On the example below, participants could score between 1 and 5 on levels of extraversion. Below is a histogram which represents the frequency of extraversion scores within bins of width 0.25, and you can see that greater amounts of people score somewhere in the middle of this extraversion scale.

Here is some R code if you would like to make your own histogram.

 

We can use histograms to estimate the probability of a person scoring between a particular range. As more people score in the middle of the scale, there is a greater likelihood that the next person who takes the personality test will also have a score near the middle, in comparison to scoring at either extremity. You can even calculate the probability of a score falling into a particular bin. However, dependent on how you define the bin width, the shape of the distribution can vary widely.

The smaller the bin width, the greater detail and precision, allowing for better probability estimates. Ideally, because extraversion is a continuous scale, it would be advantageous to be able to take any score on this scale and get a precise estimate of the likelihood that a person will obtain that particular score in future. However, bins are not a continuous measure, and when you get to a certain level of precision, there becomes lots of gaps in the distribution.

Kernel density plots are a way of smoothing the distribution into a line, rather than bars, allowing for continuity. A kernel is a line shape which can be described using a mathematical function. Kernel functions “fill in the gaps”. It does this by applying a kernel to every data point. Look at the graph below. + indicates a data point, and directly above each point is the peak of a gaussian bell curve, which is an example of a kernel.

There are several kernel functions you can choose from. See table below which I pinched from wikipedia.

In R, when using the density function, gaussian is the default kernel function, and is what is used in the picture below (which I stole from the instructional video I referenced at the beginning of this post).

How Kernel Density Plots work: For every possible data point on the continuous scale, you can examine whether it falls under any of the kernels, (in this case a gaussian curve kernel). An example below is depicted by a blue line whereby this score is captured under 4 curves, and the height of each curve at the intercept is recorded. The point of intercept is summed across all four curves to create a kernel density estimate for that score (or region if you do this calculation across a bin). When you do this across all the possible scores, you can create a smooth line depicting a kernel density estimate for the overall distribution of your variable (X).

Kernel density estimates are logical. If there are lots of neighbouring data points, the density distribution will be higher for that area because there is a large number of points close together, meaning any particular point will fall under several curves. Kernels such as the gaussian curve ensures this is weighted, so the closer a score is to the peak of a neighbouring curve, the higher the value at intercept. Kernel density estimates are great for none parametric data, as you take the data which exists in your sample to estimate a distribution of any shape; it doesn’t have to fit a pre-consieved distribution.

Histograms can often be a poor method for determining the shape of a distribution because they are strongly affected by the number of bins used. Whilst density functions get around this issue, the bandwidth of the kernel, (width of the kernel) can influence the estimated distribution. Large bandwidths over smooth the data, and this leads to a loss of detail when examining the structure of the underlying distribution. Undersmoothing (small bandwidths) contain too many counterfeit artefacts and no longer represents your underlying distribution. See below:

Histograms can be done by hand and are arguably easier to interpret than Kernel Density Plots which require statistical packages such as R to compute. However this is very easily done in something like R using packages such as ggplot2:

 

By calculating densities it is also possible to plot a normality curve over the top. You can therefore compare the distribution of your data to the shape of a normal distribution.

If you want to make the two plots above, here is some code for you to adjust:

 

So overall, histograms are more easy to interpret because it uses the measure of count or frequency. However, density scales are much better if you want to compare it to other density models, such as, to the density of a normal distribution. Both require some considerations, such as the binwidth in histograms and the bandwidth in kernel density estimates.

https://www.youtube.com/watch?v=QSNN0no4dSI <- This video helped me understand this issue. Watch up to 1hr

Leave a Reply