Sunday, 2 August 2020

Design of objective functions for model calibration

Introduction

Model calibration or parameter estimation is the task of tailoring the free parameters of the model to best explain the known observations. This is vital because making new predictions from the model follows this step. For the predictions to be accurate the free parameters have to be first identified with the known data. An effective passage through this step requires fighting on multiple fronts. First one is the design of the model itself. This seems trivial to say but its often not easy to test all the behaviors of the model and so it is an iterative process in general. If the starting point is to use a first order polynomial relation and it doesn't accurately fit the observations then it might be better to try higher order non-linear relations.

The second front for effectively attending the calibration task is to choose a search strategy. In case of continuous and differentiable functions calculus gives us the closed form solutions with the use of derivatives, in reality we often encounter functions too complicated to calculate derivatives. In such cases we rely on search/heuristic methods ranging from a simple grid search to advanced meta-heuristic methods (Genetic Algorithm for example).

Because an effective search relies on evaluating the fitness values at different positions in the search space, it is important to choose an appropriate objective function which is the third battle front. An objective function maps a position in space to a value (Scalar or a vector). If the goal of the search is to reach a known destination point the objective function can be thought of as a distance measure. In some cases we don't know the destination exactly but understand its characteristics, like the final point maximizes or minimizes some measure. In either cases designing the objective function that resonates with the particular instance is crucial to reach the ideal solution. For example, Consider a cube whose edges are the path connecting the vertices. If a particle is to move from one vertex to another, then picking an Euclidean distance measure is ineffective over a Manhattan distance measure. Since the particle cannot move along the diagonal, the Euclidean distance doesn't give a true measure of distance between points on opposite sides of a diagonal but a Manhattan distance does.

Path taken by a particle moving on cube along its edges from A to G is better expressed by Manhattan distance(dark edges) than a Euclidean distance(dotted red)

In this post I will share two instances where designing a functional form by incorporating the nuances of the case helped me implement an effective optimization.

Current value vs Cumulative value:

In a path finding problem the goal is to find a path between two points in a given map. For example we want to find a route between our home to a restaurant. To be efficient one would like to follow the shortest path or a thrifty one. The problem gets more realistic if the map contains roadblocks, circular paths, sections with difficult terrain to drive etc. To gauge the efficiency of a path chosen it is necessary to know the distance between points. On a map like a well planned city, the Manhattan distance is a good measure for the same reason as with the cube. But in such a setting one would easily run into the problem of circular path i.e returning to visited points and hence get struck in a loop. So an objective function better than the just distance to goal is needed. In this instance that would be to use a cumulative distance. That means instead of just learning the distance to destination from the current location, it is better to keep track of the distance traveled.

In a restricted map a simple measure of distance from current location to destination can lead to an impasse. A cumulative measure will play the long game and consider more options

An accumulated measure doesn't avoid taking a path that seems costly immediately but rather plays the long game. Such an approach of fitness measure is needed because there is no shortest segment path linking the origin and destination and ignoring this will lead to failure of reaching the destination.

Choice of function based on the magnitude of values

In an experiment undertaken to understand the relationship between two variables, multiple observations are noted and model fitting is used to find the best parameters of the model. For this fitting job a distance measure like a square of difference between the corresponding points is used to measure the fitness of a particular set of parameters. A slightly different but analogous situation is a multi-objective case where there are several measurable/observable variables and the model has to be optimized simultaneously on all measures. This array of measures can be converted to the usual case of scalar output by finding appropriate weights. These weights deserve special attention when combining measures with different ranges so that a single measure doesn't dominate the search. So a simple least squares distance between the observed data and simulated data may not work well. A better way to handle the situation is by bringing all objectives into a common range (say [0,1]). This can be achieved by dividing each error by the magnitude of its maximum possible error (or just the order of magnitude like 10, 10^2, 10^3...).

Consider two error/distance measures ERR1: Squared distance = ∑(yest-yobs)^2, ERR2: Fraction change = ∑ abs((yest-yobs)/yobs). In ERR2 the magnitude of numerator is proportional to the ERR1, so the difference between them is the denominator. The inverse "yobs" can be thought as weights. If all "yobs" are in similar range then all error terms are weighted similarly. But if there are some outliers with large "yobs", then the function penalizes them with a lower weight. To demonstrate this I will work out a hypothetical example with these two objective functions.

In the figure below, the red dots are the observations noted from a hypothetical experiment which is capturing the relation between two variables X & Y. Theoretically the relation between them is given by Y=X. The experimental observations are vulnerable to external noises and hence the relation is not easily seen. But a careful inspection highlights the outliers at x=3 and x=9. In the code you can see that I added a random normal noise at all points and and extra noise to create outliers at x=3,6,9.

A difference of squared error is sensitive to outliers in the data, designing an objective function incorporating the peculiarities of the case will be more effective 

Now looking at the average error between the family of lines Y=X+intercept and the observations, the lowest minimum error with ERR2 is at intercept=0 whereas the lowest error with ERR1 is at intercept=2. The outlier at x=3 is pulling up the overall fit, proving that the ERR1 is more sensitive to this outlier compared to ERR2.

In the above example the dominant outlier has higher magnitude compared to other observations, hence the weighting 1/yobs reduces the importance attached to these points. So the same weighting scheme may not work in the case where outlier has lower magnitude compared to the general trend. In such a case 1/yobs will add more importance to outliers while fitting. Hence a different approach like multiplication instead of division with yobs will prove useful.

Conclusion :

In this text I discussed the model calibration practiced in a Mathematical model development. To be effective in this phase requires the modeler to assess several items: Mathematical description of relations in the model, A good search strategy and description of error or fitness measure. Here I presented my case on the importance of the design of objective functions with examples to illustrate the point. 

In reality it may be wise to solve the problem on multiple fronts. In my own task I encountered the case of a optimizing the model for multiple objectives simultaneously. I chose to reduce the multiple objectives to a scalar where using fraction error measure helped me avoid worrying over the weights to use. Then I implemented the global optimization strategy Genetic Algorithm. But for the implementation to be practical I vectorized the operations in the program. Further, trying out a different choices of ode solvers gave me an economical option by bringing down the optimization task from an estimated run time of over 20hr+ to slightly over 1hr.

The R code for the last example is hosted on Github

Saturday, 7 March 2020

BIOMD0000000449 - Structure of Network

Introduction

Mathematical modelling is a representaion of a system under study and has a rich history of being used for several complex phenomenon like Biological systems, Financial systems. This approach of building a model replicating/approximating the actual system has the advantage that, once the confidence in the model has been established these can be perturbed in a safe in-silico environment and responses studied. These models also reveal several underlying Mathematical properties, that are hard to notice by working with the actual system, which can give deep insights to an explorer.

Complex systems are characterized with large number of interacting elements and the systemic response is best studied as a whole, different from a reductionist approach. A simple articulation of the response of a complex system is that it is greater than sum of its parts. Owing to their complex nature, models of these systems are a great playground to mine underlying Mathematical structures that are ubiquitous in natural phenomenon. In this post I will use a published model of a biological system to gain an understanding of its architecture.

For this exercise I have chosen BIOMD0000000449 model from EMBL database. The reason for choice is that it is publicly available and simple enough to start with. The EMBL portal has the links to the publication and other information. This is a non-linear dynamical model of the insulin signalling pathways that are crucial in the Type-2 Diabetic condition. There are 27 species in the model whose dynamics are written out as ODEs. Through its parameters, the model identifies the differences between the system in healthy and diseased states.

Analysis on the structure of interactions

A group of interacting species can be depicted as a graph/network. The nodes of the network can be the species and the edges/links between 2 species represents an interaction between them. An easy to visualize example is the road network. In one way, road junctions can be represented as nodes of the graph and the edges are the paths connecting the junctions. Depending on the direction of flow, graphs can be un-directed or directed. A road junction that allows movement in and out is un-directed whereas a junction that allows only one of inflow or outflow is directed.

Following diagram shows the structure of the current system. This is not a simple graph since this is a directed graph and has loops. The species connected via arrows are interacting (species at the tail of arrow affects the species at the head).


The loops around each node represents that the state of any species is affected by its current concentration. Since this is a common trait for all the species in this system, I will ignore these in the rest of this excercise.

Degree of a node is the number of edges connected to it. The size of node is shown proportional to the total degree of the node. The red colored nodes are of high degree and can be seen as high valued species. The indegree, outdegree and the total degree of all the species is shown below.

##        species deg_in deg_out deg_tot
## 1        IRS1p      4       6      10
## 2  PKB308p473p      4       5       9
## 3      mTORC1a      3       5       8
## 4         IRip      2       5       7
## 5      PKB308p      3       4       7
## 6     IRS1p307      2       4       6
## 7      PKB473p      2       4       6
## 8       AS160p      3       3       6
## 9         IRS1      3       2       5
## 10          Xp      2       3       5
## 11     mTORC2a      2       3       5
## 12        S6Kp      2       3       5
## 13          IR      2       2       4
## 14         IRp      2       2       4
## 15         PKB      3       1       4
## 16      mTORC1      3       1       4
## 17       AS160      3       1       4
## 18         IRi      2       1       3
## 19     IRS1307      2       1       3
## 20           X      2       1       3
## 21      mTORC2      2       1       3
## 22      GLUT4m      2       1       3
## 23       GLUT4      2       1       3
## 24         S6K      2       1       3
## 25          S6      2       1       3
## 26         S6p      2       1       3
## 27       IRins      1       1       2

The total degree of nodes in the system ranges from 2 to 10 with a mean degree of 4.74 and a standard deviation of 2.05.

Properties of this directed graph or digraph

In this digraph, there are 27 nodes and 64 edges. The average degree of node = sum of degree/27 = 4.74. In a digraph with N nodes, each node has a possibility of being connected to N-1 other nodes. So, in a completely connected digraph the degree of node = 2*(N-1), multiplied with 2 for the direction of connection, = 2*(27-1)=52 for this case. The average degree of a node in this network is far less than the completely connected graph. So I can say the species in the system are not highly interacting. Further, In a complete digraph, there would be 27P2=702 edges, very high compared to the current network of 64 edges. This means the system is about 9% connected (=100* 64/702), called as the "Density" of the network. Since the system has low density it is a sparse network.

Comparison with a random network

Consider a random digraph G(N,p), where N is the number of nodes(same as our system of 27 nodes) and p is the probability of link between any 2 nodes. Since a total of Np2 = N(N-1) edges are possible in the network, the probability of having L edges is

and the p that maximizes the occurrence of L edges is p* = L/Np2 i.e p = 64/27p2 = 0.09 = Density of the system. So, if the probability of having a directed link between 2 nodes = density of the network, then the observed realization of random network has the highest possibility. If p < p* then it is more likely to have fewer than L edges and if p > p* then it is more likely to have more than L edges.

An exciting difference in the structure of real life observed networks and a random network is the presence of "hubs" in real networks. Hubs are high degree nodes in the network. When hubs are in smaller number compared to the large number of low degree nodes, this gives rise to the Scale free nature of the network. As the name suggests, Scale free networks maintain similar structure while growing. This makes them tolerant to failure. A popular theory for the formation of scale free networks is the preferential engagement of new species to an existing highly connected species, simply understood as rich gets richer.

Just like the current digraph with 27 nodes and probability of link 0.09, G(27,0.09), I can imagine a randomly connected network. In this randomly generated digraph, the probability of a node to have a degree d is given by the binomial term

Here 2(N-1) is the maximum number of edges a node can have. For N=27, Prob(degree = d) = {52,d}*p^d*(1-p)^(52-d). For a sparse random network the Binomial degree distribution can be approximated as a Poisson distribution

where l = Expectation and variance is also l = 52*0.09 = 4.74

A side note: Is it odd that mean of degree of nodes in the system == mean degree in random network ?? As seen above, p = L/Np2 = L/(N*(N-1)). For a binomial random variable with distribution given by {n,r}*p^r (1-p)^(n-r), the expectation = n*p = 52*0.09 ~= 4.74. Mean degree in our system = sum of degree/Number of nodes = 2*L/N (Each edge would add 1 degree to nodes at either sides). n*p = 2*(N-1)*p = 2*(N-1)*L/(N*(N-1)) = 2 * L/N.

Variance of degree distribution of this random sparse network = 4.74, comparable to the variance observed in our network = 4.2. This close match is not what I anticipated. Because of the preferential attachment, I expected that a Scale free network would have a larger variance compared to a randomly generated network(In fact the variance in randomly generated network is higher).

The scale free nature of real network is reflected by the "power law" distribution of their degree. The Expression for power law distribution is

So the logarithmic transformation would look like a straight line. Below is the log transformed degree of distribution in the system.

Other than for degree 2, there seems to be a linear trend. c,a parameters of the power law model can be chosen as the estimates of the best line fitting the Fig 2.

Based on the system's degree distribution, the closest power law distribution is given by coefficients c=0.24, a=-0.66. When I overlay the degree distribution plots from the system, sparse random network and a Scale free network. I get

It looks like the current system's structure is closer to a random network. In the Fig 2, the degree 2 is a peculiar one. This one is an outlier to the known Scale free nature, which has a large number of low degree nodes and fewer high degree nodes. If I ignore this outlier for a moment, The new power law distribution coefficients are c=3.11, a=-1.98. The updated distributions,

Now the match to the Scale free distribution has improved much. I can quantify the resemblance of the system to either of random network or a scale free network through a likelihood estimate.

Likeliness of the system being a random network and a real network

Ok so I have 2 models,

  • The Scale free model - a characteristic of several observed real networks and follows power law for degree distribution and
  • a Random network model - degree distribution in a sparse random network is approximated to Poisson distribution

Let's call these models M1 and M2 respectively. Now I want to understand, in the set of models {M1,M2} which explains my observed system better. It is possible that the observed system is following a model different from these. Also, even if this model set is comprehensive there could be other uncertainties in the development of the system like some species might be under-represented and some over-represented.

Mathematically speaking I am looking for P(M1 | obs) and P(M2 | obs). From the observations I am trying to ascertain which model is best in representing its intricacies. This is an Inverse problem. Inverse problems are not always perfectly solvable, owing to the similar reasoning for why I only want to find the better model and not the perfect. So I need to find a way to convert this Inverse problem into a forward problem. Bayes theorem comes to the rescue here, which states that

So in this case I can convert my Inverse problem of finding P(Mi | obs) as

P(obs|Mi) is answering the question, what is the chance of observing the known data when the system follows the model Mi? and finding it is straight forward. Since I have the probability density function for the two models, I can calculate the probability of observed degree in this distribution.

P(Mi) is the probability of observing model Mi in general. This could come from a prior knowledge or identified through expert knowledge. This is called as "priori distribution". Since I don't know about this I will assume both model are equally likely. So P(M1)=P(M2).

Likelihood estimate of an event can be naively understood as a probability of occurrence of the event but there is a subtle difference. This is nicely explained here. Here, the likelihood that network's degree follow a known distribution is given by,

The probabilities are in [0,1], logarithm is negative and monotonically increasing in this range. This means the model with a lower magnitude of the log-likelihood is a better match(think log(0) -> -Inf and log(1)=0).

Distribution Log likelihood(LL), approx
Poisson Distribution (Sparse random network) -56
Power law distribution (Scale free network) -64
Poisson Distribution (ignoring outlier) -54
Power law distribution (ignoring outlier) -48

Conclusion

The LL values reflect my visual classification of the system's structure. Ignoring the outlier shifted my system's alliance from a random network to a Scale free network (a known characteristic of real networks). Further, by ignoring the outlier improved the likelihood to random network only little bit(abs(δ LL) < 2) compared to the improvement on likeliness to scale free network (abs(δ LL) > 15). This degree pertains to IRins species in the system and this oddity is hinting me at a possible under representation of this species.

Libraries/Tools used

R programming language, iGraph, network library in R

The complete code is shared on Github, Github repo

P.S

This post is spurred out of my interest in understanding observable phenomenon through Mathematical concepts, but the knowledge and guidance has been taken from following sources and in general the internet.

Tuesday, 11 December 2018


Introduction

In this post I will use Bayesian approach to analysing a dataset containing responses of students to a test.The full code and the dataset can be found on Github link. Bayesian statistics is built on a philosophy different from the classical/frequentist statistics. As a consequence, instead of observing just 1 outcome, we look at multiple possibilities and identify the probability of occurrence of each of these possibilities. The strength of bayesian analysis is in the basic questions we can ask and the simplicity with which the answers can be communicated. We will see such an example in the following
The underlying assumption in the classical way is that "the phenomenon under testing is fixed and the uncertainty observed is because of the experimental and measurement errors". In this approch,
  • We start with defining a Null hypothesis and an Alternate hypothesis
  • Then we choose an appropriate test to find the p-value
  • Based on the decided significance level, α, we can either reject null or fail to reject null
These last 2 statements are usually read as Null hypothesis is wrong and Alternate hypothesis is correct and Null hypothesis is correct and so Alternate hypothesis is wrong respectively. The real world instances may not be so black and white because of the huge number of factors affecting the case (refer chaotic systems).
In Bayesian approach, the underlying phenomenon is believed to be uncertain and this uncertainty reduces as we observe more information. Steps taken to perform a bayesian analysis are
  • Define a prior belief for the parameters under study
  • Find the posterior distribution of the parameters based on the observed evidence
Posterior is estimated following the bayes rule defined as 
P(Parameters|Evidence) ∝ P(Evidence|Parameters)*P(Parameters)
P(parameters) is the prior distribution
P(parameters|Evidence) is the posterior distribution of the parameters conditioned on observing the evidence
P(Evidence|parameters) is the likelihood of observing the data given a prior distribution
The prior distribution can be informative taken from a previous analysis or a good understanding of the system under study. In case where there is no clear understanding of the system, the prior can be non-informative like a uniform distribution.

Discription of the dataset

In this example we will consider a dataset containing responses of students to a test with Multiple Choice Questions(with a single answer). A subset of the dataset is shown here. Each row has the responses given by a candidate for the corresponding question on the columns. The last column "PointsReceived" is candidate final scoring as calculated by the online judge.
The correct answers to each question and the points alloted to it is shown below.
First, let's quickly check if the scoring calculated by the judge is correct
Looks like the scores calculated are correct hence we can move on to our analysis.

All the questions in this test have single deterministic answer. Hence a candidate response to a question can be either correct or incorrect and so is a bernoulli random variable. Now assuming that all candidates have equal capability to answer all questions, each row of responses are samples from a binomial experiment.

Based on the alloted points, each question can be catergorized as C1(1 pt), C2(2 pts) ans C3(3 pts). This categorization is made by the questioner by choosing unequal weightage to the questions. 

Defining Priors

Lets say a 1pt question can be solved correctly by 80% of the candidates. Since all the candidates are assumed to be equally capable, this means that when posed with this question, the aspirant answers it correctly 80% of the times. Similarly a 2pt question can be expected to be solved correctly by 60% and a 3pt question by 40% of the applicants. Mathematically,
  • E(solving correctly | 1pt question) = 0.8
  • E(solving correctly | 2pt question) = 0.6
  • E(solving correctly | 3pt question) = 0.4
A point to note from above is that E(solving correctly | 1pt question) = 0.8 and not P(solving correctly | 1pt question) = 0.8. This means that there are possibilities(in the multiverse if you may) where a 1pt question can be solved correctly with a probability more or less than 0.8. This is a distinct feature of Bayesian ways as compared to the classical methods.
The distribution of the random variable P(solving a 1pt question correctly) and other two can be taken as beta-distribution. A beta distributed variable has the range [0,1] which is our case. Also, beta-distribution is a conjugate prior for a binomial variable. (ref wikipedia).

There are 2 hyperparameters needed to define the shape of a beta-distribution namely α and β
  • mean = α/(α + β)
  • concentration = α + β
Mean is the mean value and concentration is to control the uncertainty in the parameter.

Analysis and Inferences

By Fixing the concentration to 50 and the mean values defined before we can calculate the shape parameters. From the observed evidence the posterior hyperparameter values can be found. The prior and posterior parameter values are summarised below (in the format α,β) 
Based on these parameters, samples can be taken from the beta-distribution. The probability density function for probability of solving a question correctly from prior and posterior are shown below.
An interesting observation from the above plots is that, the spread or dispersion in posterior is much less compared to the initial belief. This supports the initial assumption that all the applicants are equally capable to start with and so the uncertainty in the probability of answering correctly is little.

If we now say that any question which is solved by 80% or more of candidates is easy, 40-80% is mediumly tough and <40% is hard question. Then by taking samples from the distribution, we can easily visualise the categorization of a question through a bar graph as shown below.  

Conclusion

In this post I have taken Bayesian approach to analysing a dataset containing responses of applicants to a test. The advantage of using bayesian analysis is in getting an idea of alternate scenarios and their associated chances of occuring. The posterior estimates obtained can be used a prior for the next dataset of similarly capable candidates tested on same questions.

References

While the interest in running the analysis is personal, I had to learn the mathematical concepts somewhere and following references have motivated, inspired and taught me