Tuesday, 11 December 2018


Introduction

In this post I will use Bayesian approach to analysing a dataset containing responses of students to a test.The full code and the dataset can be found on Github link. Bayesian statistics is built on a philosophy different from the classical/frequentist statistics. As a consequence, instead of observing just 1 outcome, we look at multiple possibilities and identify the probability of occurrence of each of these possibilities. The strength of bayesian analysis is in the basic questions we can ask and the simplicity with which the answers can be communicated. We will see such an example in the following
The underlying assumption in the classical way is that "the phenomenon under testing is fixed and the uncertainty observed is because of the experimental and measurement errors". In this approch,
  • We start with defining a Null hypothesis and an Alternate hypothesis
  • Then we choose an appropriate test to find the p-value
  • Based on the decided significance level, α, we can either reject null or fail to reject null
These last 2 statements are usually read as Null hypothesis is wrong and Alternate hypothesis is correct and Null hypothesis is correct and so Alternate hypothesis is wrong respectively. The real world instances may not be so black and white because of the huge number of factors affecting the case (refer chaotic systems).
In Bayesian approach, the underlying phenomenon is believed to be uncertain and this uncertainty reduces as we observe more information. Steps taken to perform a bayesian analysis are
  • Define a prior belief for the parameters under study
  • Find the posterior distribution of the parameters based on the observed evidence
Posterior is estimated following the bayes rule defined as 
P(Parameters|Evidence) ∝ P(Evidence|Parameters)*P(Parameters)
P(parameters) is the prior distribution
P(parameters|Evidence) is the posterior distribution of the parameters conditioned on observing the evidence
P(Evidence|parameters) is the likelihood of observing the data given a prior distribution
The prior distribution can be informative taken from a previous analysis or a good understanding of the system under study. In case where there is no clear understanding of the system, the prior can be non-informative like a uniform distribution.

Discription of the dataset

In this example we will consider a dataset containing responses of students to a test with Multiple Choice Questions(with a single answer). A subset of the dataset is shown here. Each row has the responses given by a candidate for the corresponding question on the columns. The last column "PointsReceived" is candidate final scoring as calculated by the online judge.
The correct answers to each question and the points alloted to it is shown below.
First, let's quickly check if the scoring calculated by the judge is correct
Looks like the scores calculated are correct hence we can move on to our analysis.

All the questions in this test have single deterministic answer. Hence a candidate response to a question can be either correct or incorrect and so is a bernoulli random variable. Now assuming that all candidates have equal capability to answer all questions, each row of responses are samples from a binomial experiment.

Based on the alloted points, each question can be catergorized as C1(1 pt), C2(2 pts) ans C3(3 pts). This categorization is made by the questioner by choosing unequal weightage to the questions. 

Defining Priors

Lets say a 1pt question can be solved correctly by 80% of the candidates. Since all the candidates are assumed to be equally capable, this means that when posed with this question, the aspirant answers it correctly 80% of the times. Similarly a 2pt question can be expected to be solved correctly by 60% and a 3pt question by 40% of the applicants. Mathematically,
  • E(solving correctly | 1pt question) = 0.8
  • E(solving correctly | 2pt question) = 0.6
  • E(solving correctly | 3pt question) = 0.4
A point to note from above is that E(solving correctly | 1pt question) = 0.8 and not P(solving correctly | 1pt question) = 0.8. This means that there are possibilities(in the multiverse if you may) where a 1pt question can be solved correctly with a probability more or less than 0.8. This is a distinct feature of Bayesian ways as compared to the classical methods.
The distribution of the random variable P(solving a 1pt question correctly) and other two can be taken as beta-distribution. A beta distributed variable has the range [0,1] which is our case. Also, beta-distribution is a conjugate prior for a binomial variable. (ref wikipedia).

There are 2 hyperparameters needed to define the shape of a beta-distribution namely α and β
  • mean = α/(α + β)
  • concentration = α + β
Mean is the mean value and concentration is to control the uncertainty in the parameter.

Analysis and Inferences

By Fixing the concentration to 50 and the mean values defined before we can calculate the shape parameters. From the observed evidence the posterior hyperparameter values can be found. The prior and posterior parameter values are summarised below (in the format α,β) 
Based on these parameters, samples can be taken from the beta-distribution. The probability density function for probability of solving a question correctly from prior and posterior are shown below.
An interesting observation from the above plots is that, the spread or dispersion in posterior is much less compared to the initial belief. This supports the initial assumption that all the applicants are equally capable to start with and so the uncertainty in the probability of answering correctly is little.

If we now say that any question which is solved by 80% or more of candidates is easy, 40-80% is mediumly tough and <40% is hard question. Then by taking samples from the distribution, we can easily visualise the categorization of a question through a bar graph as shown below.  

Conclusion

In this post I have taken Bayesian approach to analysing a dataset containing responses of applicants to a test. The advantage of using bayesian analysis is in getting an idea of alternate scenarios and their associated chances of occuring. The posterior estimates obtained can be used a prior for the next dataset of similarly capable candidates tested on same questions.

References

While the interest in running the analysis is personal, I had to learn the mathematical concepts somewhere and following references have motivated, inspired and taught me