Visualizing Clustering with Heatmaps and ggplot2

Visualizing Clustering with Heatmaps and ggplot2

Heatmaps are a powerful tool for visualizing large datasets and identifying patterns or clusters within them. In this article, we will explore how to create heatmaps using ggplot2 and gplots in R, and apply these techniques to a real-world dataset.

Step 1: Processing the Data

To create a heatmap, we first need to process our data into a matrix form. This involves ranking the row names using hclust, a hierarchical clustering function, and then changing the order of the ranks to reflect the clustering structure.

We begin by loading the necessary libraries: ggplot2 and data.table. We then read in the data from a file using fread from the data.table library.

library(ggplot2)
library(data.table)

# Read in the data
CN_DT <- fread("/home/ywliao/project/Gengyan/ONCOCNV_result/ONCOCNV_all_result.txt", sep = "\t")

# Select the data for cfDNA1
dt <- CN_DT[cfDNATime == "cfDNA1"]

# Cast the data into a matrix form
wdt <- dcast(dt, Gene ~ Sample, value.var = "CN", fun.aggregate = mean)

# Convert the data to a matrix
data <- as.matrix(wdt[, 2:length(wdt), with = F])

Step 2: Clustering the Data

To create a heatmap, we need to cluster the data using hclust. This involves creating a distance matrix from the data and then applying the hclust function to cluster the rows.

# Create a distance matrix
hc <- hclust(dist(data), method = "average")

# Get the order of the clusters
rowInd <- hc$order

# Transpose the matrix and cluster the columns
hc <- hclust(dist(t(data)), method = "average")

# Get the order of the clusters
colInd <- hc$order

# Reorder the data according to the clusters
data <- data[rowInd, colInd]

Step 3: Melting the Data

To create a heatmap using ggplot2, we need to melt the data into a long format. This involves creating a new data frame with the clustered data and adding column names.

# Melt the data
dp <- melt(data)

# Add column names
colnames(dp) <- c("Gene", "Sample", "Value")

Step 4: Creating the Heatmap

Finally, we can create the heatmap using ggplot2. This involves mapping the Sample and Gene variables to the x and y axes, respectively, and using geom_tile to create the heatmap.

# Create the heatmap
p <- ggplot(dp, aes(Sample, Gene)) +
  geom_tile(aes(fill = as.factor(Value))) +
  theme(axis.text.x = element_text(angle = 90)) +
  guides(fill = guide_legend(title = "Copy Number")) +
  scale_fill_brewer(palette = 3)

Alternative Method using gplots

We can also create a heatmap using gplots. This involves ranking the row names using hclust, creating a color ramp palette, and then applying the heatmap.2 function to create the heatmap.

# Rank the row names using hclust
labrow <- unlist(wdt[, 1, with = F])

# Create a color ramp palette
colorsChoice <- colorRampPalette(c("green", "black", "red"))

# Create the heatmap
heatmap.2(dp, labRow = labrow, col = colorsChoice(5),
          breaks = c(1, 1.5, 2, 2.5, 3, 4),
          density.info = "histogram",
          hclustfun = function(c) hclust(c, method = "average"),
          keysize = 1.5, cexRow = 0.5, trace = "none")

By following these steps, we can create a heatmap that effectively visualizes the clustering structure of our data.