Links:

https://github.com/gordicaleksa/pytorch-GAT/blob/main/The%20Annotated%20GAT%20(Cora).ipynb
https://gordicaleksa.medium.com/how-to-get-started-with-graph-machine-learning-afa53f6f963a
http://blog.shriphani.com/2015/04/06/the-smallest-eigenvalues-of-a-graph-laplacian/

——————————————————————————

Spectral Clustering:

Construct matrix representation of graph
Compute eigenbasis of graph laplacian matrix. Map each point to a lower dimensional representation based on one or more eigenvectors. Denote second smallest eigenvalue and associated eigenvector.
Assign points to two or more clusters (based on magnitude of second smallest eigenvector values). Maybe partition by positive and negative entries?

Alternatively, for each node, create a length 3 feature vector consisting of 2nd smallest, 3rd smallest and 4th smallest eigenvector for that node, then perform k-Means on this representation to cluster nodes.

Sum of squared values of eigenvector should equal 1.

https://www.youtube.com/watch?v=uxsDKhZHDcc

Properties of Graph Laplacian:

The zero-th eigenvalue tells us whether the graph is connected or not.
In particular, if a graph has k-connected components, then eigenvalue 0 has multiplicity k (i.e. k distinct non-trivial eigenvectors).
2nd smallest eigenvlaue:
If λ1=0, clearly eigenvalue 0 has multiplicity greater than 1. Thus the graph is not connected.

GraphSAGE: Sample and Aggregate. Made spatial methods applicable to very large graphs.

——————————————————————————

For CNNs, images have a precise notion of order. Graphs don’t enjoy that nice property and both the number of neighbors as well as the order of neighbors may vary.

How do you define a kernel for a graph? The kernel size can’t be 3x3 because sometimes a node will have 2 neighbors and sometimes 2000.

Researchers proposed solutions:

Spectral methods:

Compute the eigenbasis (generalization of Fourier basis) of the graph Laplacian (Degree matrix - the adjacency matrix with added self-connections) in order to understand the properties of the graph.

Spatial methods:

Spatial methods directly extend the idea of 2D convolution to the graph. Similar to convolving nearby pixels directly connected to each pixel on an image, spatial graph convolution learns features from a node’s neighbors.

GCNs

GCN is kind of a hybrid: It isn’t a spectral method, since no eigen-decomposition is explicitly computed, but it is motivated by spectral methods (via the Chebyshev polynomial expansion).

GCN learns the features from its neighboring nodes, but weights are the same for each neighbor.

GCN is transductive:

Instead of finding a general rule for classifying future examples, GCNs classify only the unlabeled objects, exploiting information from labeled instances. Full graph GCN is not inductive because it only is trained on the whole graph and makes predictions for nodes that can be seen during training.

GAT

It turns out that combining the idea of attention with GCNs was a good idea.

GAT makes GCN more expressive by allowing the network to learn the neighborhood aggregation coefficients (GCN uses constants instead of learnable weights).

GAT extends GCN by allowing for different neighbor weightings based on attention.

Basic idea of spatial message passing:

Have feature vectors associated with each of a node’s neighbors.
Transform features to a lower dimensional representation (linear projection layer)
Somehow aggregate the embeddings (perhaps weighted by attention coeffs)
Update the feature vector of the current node by combining its own embeddings with the aggregated neighborhood representation.

Diff between Transductive and Inductive

Transductive - you have a single graph (like Cora) you split some nodes (and not graphs) into train/val/test training sets. While you’re training you’ll be using only the labels from your training nodes. BUT. During the forward prop, by the nature of how spatial GNNs work, you’ll be aggregating the feature vectors from your neighbors and some of them may belong to val or even test sets! The main point is - you ARE NOT using their label information but you ARE using the structural information and their features.

Inductive - you’re probably much more familiar with this one if you come from the computer vision or NLP background. You have a set of training graphs, a separate set of val graphs and of course a separate set of test graphs.

Macro F1 score:

The macro-averaged F1 score (or macro F1 score) is computed by taking unweighted mean of all the per-class F1 scores. This method treats all classes equally regardless of their support values.

Micro F1 score:

counts the sums of the True Positives (TP), False Negatives (FN), and False Positives (FP). We first sum the respective TP, FP, and FN values across all classes and then plug them into the F1 equation to get our micro F1 score.