Triadic Closure (Clustering)

Introduction

Triadic Closure is a measure of the tendency of edges in a graph to form triangles. It's a measure of the degree to which nodes in a graph tend to cluster together (wikipedia on clustering coefficents).

# python standard library
from fractions import Fraction

# pypi
import networkx
import seaborn
% matplotlib inline
seaborn.set_style("whitegrid")
sample_graph = networkx.Graph()
sample_graph.add_edges_from([(1, 2), (1, 3), (2, 4), (3, 5)])
networkx.draw_spring(sample_graph, with_labels=True)
triadic_closure.png

In this case we might say that the likelihood that the next edge will be between 1 and 4 or 1 and 5 is greater than the likelihood that it will form between 4 and 5 or 2 and 5.

2 Local Clustering Coefficient

The Local Clustering Coefficient is a measure of clustering for a single node. It is the number of pairs of a node's friends that are themselves friends divided by the total number of pairs of a node's friends. This can be interpreted as measuring how close the node's neighbors are to being a complete graph (wikipedia).

\begin{equation*} LCC = \frac{\textit{number of pairs of a node's friends that are friends (PTAF)}}{\textit{number of pairs of the node's friends (POF)}} \end{equation*}
graph = networkx.Graph()
graph.add_edges_from([("A", "K"), ("A", "B"), ("A", "C"),
                      ("B", "K"), ("B", "C"),
                      ("C", "E"), ("C", "F"),
                      ("D", "E"),
                      ("E", "F"), ("E", "H"),
                      ("F", "G"),
                      ("I", "J")])


                      .. ggcode:: ipython

networkx.draw_spring(graph, with_labels=True)
triadic_example.png

The number of pairs of friends can be calculated from the degree of the node.

\begin{equation*} POF = \frac{d(d-1)}{2} \end{equation*}

Looking at node C, it has degree four so the number of pairs of friends it has is \(\frac{4(3)}{2} = 6\). Looking at the graph you can see that there are two edges between the nodes connected to it - (A,B) and (E, F), so the clustering coefficient for node C is \(\frac{PTAF}{POF}=\frac{2}{6}\) which reduces to 1/3. We can double check this with networkx.

print(networkx.clustering(graph, "C"))
0.3333333333333333

If you don't pass in the node label to networkx.clustering the function will return a dictionary with all the clustering coefficients, which might be useful if you need to make multiple queries and have a large graph.

2.1 One Friend

If you look at nodes I and J, they don't have any pairs of friends, just one friend each. This puts a zero in the denominator of the clustering coefficient, making it undefined, but to make it mathematically useful it is given a 0 instead.

print(networkx.clustering(graph, "I"))
0.0

3 The Whole Network

There's two ways to calculate a clustering coefficient for the entire network. One is to take the average of all the local clustering coefficients, the other is to calculate the percentage of open triads (three nodes connected by two edges) that are triangles.

3.1 Averaging

This is what wikipedia calls the network average clustering coefficient.

coefficients = networkx.clustering(graph)
average = sum(coefficients.values())/len(coefficients)
print(average)
assert average == networkx.average_clustering(graph)
0.28787878787878785

3.2 Transitivity

This is also called the global clustering coefficient.

A triangle is a set of three nodes with three edges connecting them. An open triad is a set of three nodes with only two edges connecting them. Each triangle has three open triads embedded in it. Transivity is a measure of the percentage of open triads that are triangles.

This triangle:

triangle = networkx.Graph()
triangle.add_edges_from([("A", "B"), ("A", "C"), ("B", "C")])

networkx.draw_spring(triangle, with_labels=True)
tc_one.png

Contains these open triads.

one = networkx.Graph()
one.add_edges_from([("A", "B"), ("A", "C")])
networkx.draw(one, with_labels=True)
tc_a.png
two = networkx.Graph()
two.add_edges_from([("A", "B"), ("B", "C")])
networkx.draw(two, with_labels=True)
tc_b.png
three = networkx.Graph()
three.add_edges_from([("B", "C"), ("A", "C")])
three.add_edges_from([("B", "C"), ("A", "C")])
networkx.draw(three, with_labels=True)
tc_c.png

So the transitivity is three times the count of triangles in the graph divided by all the open triads in the graph.

\begin{equation*} transitivity = \frac{3 \times \|\textit{triangles}\|}{\|\textit{open triads}\|} \end{equation*}

Looking at our earlier example you can see that there are three triangles and thirteen open triads (to be honest I only found 10).

networkx.draw_spring(graph, with_labels=True)
triadic_example.png
transitivity = (3 * 3)/(3 * 3 + 13)
print(transitivity)
assert transitivity == networkx.transitivity(graph)
0.4090909090909091

4 Comparing Averaging and Transitivity

4.1 One High Degree Node

high_lcc = networkx.Graph()
left = tuple("AABCCDEEFGGH")
right = tuple("BIIDIIFIIHII")
high_lcc.add_edges_from(list(zip(left, right)))
networkx.draw_spring(high_lcc, with_labels=True)
high_average.png

If we look at this graph, the outer nodes all have a clustering coefficient of 1 (each has 1 pair of friends that are friends) while the center node has a coefficient of 1/7, since half the pairs don't have edges between them.

degree_i = 8
pairs_of_friends = Fraction(8 * 7, 2)
pairs_that_are_friends = Fraction(4, 1)
lcc = pairs_that_are_friends/pairs_of_friends
print(lcc)
1/7

Since there are so many nodes with a coefficient of 1, the average is high.

print(networkx.average_clustering(high_lcc))
0.9047619047619047

But there are many open triads so the transitivity will be low (transitivity weights nodes with large degree higher, but there's only one node with degree greater than 2).

print(networkx.transitivity(high_lcc))
0.3333333333333333

4.2 Many Open Pairs

outer_left = "ABDEGHJKMN"
inner_left = "PPPPQQQRRS"
outer_right = "BCEFHIKLNO"
inner_right = "QRSTRSTSTT"
left = tuple(outer_left + inner_left)
right = tuple(outer_right + inner_right)
low_average = networkx.Graph()
low_average.add_edges_from(list(zip(left, right)))
networkx.draw(low_average, with_labels=True)
low_average.png

Here the nodes P, Q, R, S, and T are completely connected (it's hard to see) but all the other nodes are open triads so the average will be low, but the transitivity will be high, because each of the P, Q, R, S, and T form triangles. This should be easier to see if they are plotted separately.

left = tuple(inner_left)
right = tuple(inner_right)
inner = networkx.Graph()
inner.add_edges_from(list(zip(left, right)))
networkx.draw(inner, with_labels=True)
pqrst.png

Here's the average clustering coefficient (for the complete graph, not the sub-graph I just made).

print(networkx.average_clustering(low_average))
0.25

And here's the transitivity.

print(networkx.transitivity(low_average))
0.8571428571428571

So which one is the right metric? I guess it just depends.