Lot of success using CNNs for CV. Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton entered AlexNet in the 2012 ImageNet challenge, and it had 10% lower error than the next best model.

Researchers were looking for ways to generalize convolutions to the graph domain.

Images have a precise notion of order. Graphs don’t exhibit that nice property and both the number of neighbors as well as the order of neighbors may vary.

How do you define a kernel for a graph? The kernel size can’t be 3x3 because sometimes a node will have 2 neighbors and sometimes 2000.

Researchers proposed solutions:

Spectral methods:

Everything starts with the Graph Laplacian (Degree matrix minus the adjacency matrix)

Compute the eigenbasis (generalization of Fourier basis) of the graph Laplacian in order to understand the properties of the graph. The eigenvectors are treated as signals over each node.

Has difficulty with Handling Large Graphs: Spectral methods typically require computing the eigen-decomposition of the graph Laplacian matrix, which has a computational complexity of O(n^3).

Classical spectral methods can’t generalize to new graphs. Requires knowledge of the global structure of the entire graph.

Spatial methods:

Spatial methods directly extend the idea of 2D convolutions to the graph. Similar to convolving nearby pixels directly connected to each pixel on an image, spatial graph convolution learns features from a node’s neighbors.


GCN is kind of a hybrid: It isn’t a spectral method, since no eigen-decomposition is explicitly computed, but it is motivated by spectral methods (via the first order Chebyshev polynomial approximation).

GCN learns the features from its neighboring nodes, but weights are the same for each neighbor.

GCN is transductive:

GCN operates in a transductive setting, where the entire graph is available during both training and inference. This limits their ability to generalize to unseen nodes or graphs, since they are trained on a fixed graph structure and may struggle to adapt to new graph topologies.


It turns out that combining the idea of attention with GCNs was a good idea.

GAT makes GCN more expressive by allowing the network to learn the neighborhood aggregation coefficients (GCN uses constants instead of learnable weights). GAT extends GCN by allowing for different neighbor weightings based on attention.

Basic idea of spatial message passing:

What you would probably do if you were trying to solve this problem.

  1. Have feature vectors associated with each of a node’s neighbors.
  2. Transform features to a lower dimensional representation (linear projection layer)
  3. Somehow aggregate the embeddings (perhaps weighted by attention coeffs)
  4. Update the feature vector of the current node by combining its own embeddings with the aggregated neighborhood representation.

Diff between Transductive and Inductive

Transductive - you have a single graph (like Cora) you split some nodes (and not graphs) into train/val/test training sets. While you’re training you’ll be using only the labels from your training nodes. BUT. During the forward prop, by the nature of how spatial GNNs work, you’ll be aggregating the feature vectors from your neighbors and some of them may belong to val or even test sets! The main point is - you ARE NOT using their label information but you ARE using the structural information and their features.

Inductive - you’re probably much more familiar with this one if you come from the computer vision or NLP background. You have a set of training graphs, a separate set of val graphs and of course a separate set of test graphs.

Macro F1 score:

The macro-averaged F1 score (or macro F1 score) is computed by taking unweighted mean of all the per-class F1 scores. This method treats all classes equally regardless of their support values.

Micro F1 score:

counts the sums of the True Positives (TP), False Negatives (FN), and False Positives (FP). We first sum the respective TP, FP, and FN values across all classes and then plug them into the F1 equation to get our micro F1 score.