Saturday, 7 March 2020

BIOMD0000000449 - Structure of Network

Introduction

Mathematical modelling is a representaion of a system under study and has a rich history of being used for several complex phenomenon like Biological systems, Financial systems. This approach of building a model replicating/approximating the actual system has the advantage that, once the confidence in the model has been established these can be perturbed in a safe in-silico environment and responses studied. These models also reveal several underlying Mathematical properties, that are hard to notice by working with the actual system, which can give deep insights to an explorer.

Complex systems are characterized with large number of interacting elements and the systemic response is best studied as a whole, different from a reductionist approach. A simple articulation of the response of a complex system is that it is greater than sum of its parts. Owing to their complex nature, models of these systems are a great playground to mine underlying Mathematical structures that are ubiquitous in natural phenomenon. In this post I will use a published model of a biological system to gain an understanding of its architecture.

For this exercise I have chosen BIOMD0000000449 model from EMBL database. The reason for choice is that it is publicly available and simple enough to start with. The EMBL portal has the links to the publication and other information. This is a non-linear dynamical model of the insulin signalling pathways that are crucial in the Type-2 Diabetic condition. There are 27 species in the model whose dynamics are written out as ODEs. Through its parameters, the model identifies the differences between the system in healthy and diseased states.

Analysis on the structure of interactions

A group of interacting species can be depicted as a graph/network. The nodes of the network can be the species and the edges/links between 2 species represents an interaction between them. An easy to visualize example is the road network. In one way, road junctions can be represented as nodes of the graph and the edges are the paths connecting the junctions. Depending on the direction of flow, graphs can be un-directed or directed. A road junction that allows movement in and out is un-directed whereas a junction that allows only one of inflow or outflow is directed.

Following diagram shows the structure of the current system. This is not a simple graph since this is a directed graph and has loops. The species connected via arrows are interacting (species at the tail of arrow affects the species at the head).


The loops around each node represents that the state of any species is affected by its current concentration. Since this is a common trait for all the species in this system, I will ignore these in the rest of this excercise.

Degree of a node is the number of edges connected to it. The size of node is shown proportional to the total degree of the node. The red colored nodes are of high degree and can be seen as high valued species. The indegree, outdegree and the total degree of all the species is shown below.

##        species deg_in deg_out deg_tot
## 1        IRS1p      4       6      10
## 2  PKB308p473p      4       5       9
## 3      mTORC1a      3       5       8
## 4         IRip      2       5       7
## 5      PKB308p      3       4       7
## 6     IRS1p307      2       4       6
## 7      PKB473p      2       4       6
## 8       AS160p      3       3       6
## 9         IRS1      3       2       5
## 10          Xp      2       3       5
## 11     mTORC2a      2       3       5
## 12        S6Kp      2       3       5
## 13          IR      2       2       4
## 14         IRp      2       2       4
## 15         PKB      3       1       4
## 16      mTORC1      3       1       4
## 17       AS160      3       1       4
## 18         IRi      2       1       3
## 19     IRS1307      2       1       3
## 20           X      2       1       3
## 21      mTORC2      2       1       3
## 22      GLUT4m      2       1       3
## 23       GLUT4      2       1       3
## 24         S6K      2       1       3
## 25          S6      2       1       3
## 26         S6p      2       1       3
## 27       IRins      1       1       2

The total degree of nodes in the system ranges from 2 to 10 with a mean degree of 4.74 and a standard deviation of 2.05.

Properties of this directed graph or digraph

In this digraph, there are 27 nodes and 64 edges. The average degree of node = sum of degree/27 = 4.74. In a digraph with N nodes, each node has a possibility of being connected to N-1 other nodes. So, in a completely connected digraph the degree of node = 2*(N-1), multiplied with 2 for the direction of connection, = 2*(27-1)=52 for this case. The average degree of a node in this network is far less than the completely connected graph. So I can say the species in the system are not highly interacting. Further, In a complete digraph, there would be 27P2=702 edges, very high compared to the current network of 64 edges. This means the system is about 9% connected (=100* 64/702), called as the "Density" of the network. Since the system has low density it is a sparse network.

Comparison with a random network

Consider a random digraph G(N,p), where N is the number of nodes(same as our system of 27 nodes) and p is the probability of link between any 2 nodes. Since a total of Np2 = N(N-1) edges are possible in the network, the probability of having L edges is

and the p that maximizes the occurrence of L edges is p* = L/Np2 i.e p = 64/27p2 = 0.09 = Density of the system. So, if the probability of having a directed link between 2 nodes = density of the network, then the observed realization of random network has the highest possibility. If p < p* then it is more likely to have fewer than L edges and if p > p* then it is more likely to have more than L edges.

An exciting difference in the structure of real life observed networks and a random network is the presence of "hubs" in real networks. Hubs are high degree nodes in the network. When hubs are in smaller number compared to the large number of low degree nodes, this gives rise to the Scale free nature of the network. As the name suggests, Scale free networks maintain similar structure while growing. This makes them tolerant to failure. A popular theory for the formation of scale free networks is the preferential engagement of new species to an existing highly connected species, simply understood as rich gets richer.

Just like the current digraph with 27 nodes and probability of link 0.09, G(27,0.09), I can imagine a randomly connected network. In this randomly generated digraph, the probability of a node to have a degree d is given by the binomial term

Here 2(N-1) is the maximum number of edges a node can have. For N=27, Prob(degree = d) = {52,d}*p^d*(1-p)^(52-d). For a sparse random network the Binomial degree distribution can be approximated as a Poisson distribution

where l = Expectation and variance is also l = 52*0.09 = 4.74

A side note: Is it odd that mean of degree of nodes in the system == mean degree in random network ?? As seen above, p = L/Np2 = L/(N*(N-1)). For a binomial random variable with distribution given by {n,r}*p^r (1-p)^(n-r), the expectation = n*p = 52*0.09 ~= 4.74. Mean degree in our system = sum of degree/Number of nodes = 2*L/N (Each edge would add 1 degree to nodes at either sides). n*p = 2*(N-1)*p = 2*(N-1)*L/(N*(N-1)) = 2 * L/N.

Variance of degree distribution of this random sparse network = 4.74, comparable to the variance observed in our network = 4.2. This close match is not what I anticipated. Because of the preferential attachment, I expected that a Scale free network would have a larger variance compared to a randomly generated network(In fact the variance in randomly generated network is higher).

The scale free nature of real network is reflected by the "power law" distribution of their degree. The Expression for power law distribution is

So the logarithmic transformation would look like a straight line. Below is the log transformed degree of distribution in the system.

Other than for degree 2, there seems to be a linear trend. c,a parameters of the power law model can be chosen as the estimates of the best line fitting the Fig 2.

Based on the system's degree distribution, the closest power law distribution is given by coefficients c=0.24, a=-0.66. When I overlay the degree distribution plots from the system, sparse random network and a Scale free network. I get

It looks like the current system's structure is closer to a random network. In the Fig 2, the degree 2 is a peculiar one. This one is an outlier to the known Scale free nature, which has a large number of low degree nodes and fewer high degree nodes. If I ignore this outlier for a moment, The new power law distribution coefficients are c=3.11, a=-1.98. The updated distributions,

Now the match to the Scale free distribution has improved much. I can quantify the resemblance of the system to either of random network or a scale free network through a likelihood estimate.

Likeliness of the system being a random network and a real network

Ok so I have 2 models,

  • The Scale free model - a characteristic of several observed real networks and follows power law for degree distribution and
  • a Random network model - degree distribution in a sparse random network is approximated to Poisson distribution

Let's call these models M1 and M2 respectively. Now I want to understand, in the set of models {M1,M2} which explains my observed system better. It is possible that the observed system is following a model different from these. Also, even if this model set is comprehensive there could be other uncertainties in the development of the system like some species might be under-represented and some over-represented.

Mathematically speaking I am looking for P(M1 | obs) and P(M2 | obs). From the observations I am trying to ascertain which model is best in representing its intricacies. This is an Inverse problem. Inverse problems are not always perfectly solvable, owing to the similar reasoning for why I only want to find the better model and not the perfect. So I need to find a way to convert this Inverse problem into a forward problem. Bayes theorem comes to the rescue here, which states that

So in this case I can convert my Inverse problem of finding P(Mi | obs) as

P(obs|Mi) is answering the question, what is the chance of observing the known data when the system follows the model Mi? and finding it is straight forward. Since I have the probability density function for the two models, I can calculate the probability of observed degree in this distribution.

P(Mi) is the probability of observing model Mi in general. This could come from a prior knowledge or identified through expert knowledge. This is called as "priori distribution". Since I don't know about this I will assume both model are equally likely. So P(M1)=P(M2).

Likelihood estimate of an event can be naively understood as a probability of occurrence of the event but there is a subtle difference. This is nicely explained here. Here, the likelihood that network's degree follow a known distribution is given by,

The probabilities are in [0,1], logarithm is negative and monotonically increasing in this range. This means the model with a lower magnitude of the log-likelihood is a better match(think log(0) -> -Inf and log(1)=0).

Distribution Log likelihood(LL), approx
Poisson Distribution (Sparse random network) -56
Power law distribution (Scale free network) -64
Poisson Distribution (ignoring outlier) -54
Power law distribution (ignoring outlier) -48

Conclusion

The LL values reflect my visual classification of the system's structure. Ignoring the outlier shifted my system's alliance from a random network to a Scale free network (a known characteristic of real networks). Further, by ignoring the outlier improved the likelihood to random network only little bit(abs(δ LL) < 2) compared to the improvement on likeliness to scale free network (abs(δ LL) > 15). This degree pertains to IRins species in the system and this oddity is hinting me at a possible under representation of this species.

Libraries/Tools used

R programming language, iGraph, network library in R

The complete code is shared on Github, Github repo

P.S

This post is spurred out of my interest in understanding observable phenomenon through Mathematical concepts, but the knowledge and guidance has been taken from following sources and in general the internet.

No comments:

Post a Comment