22 Multidimensional scaling

Multidimensional scaling (MDS) is another approach to ordination. The purpose is similar to the goals of PCA, but the methods are a bit different. With MDS, the starting point is a similarity, or distance matrix, providing measurements between all pairs of data. Many different functions can be used to compute this similarity matrix. Commonly distance functions are used. If quantitative data have very skewed distributions, or there is no good way to interpret the distance between points, the data may be converted to ranks (1, 2, 3, …) before the similarity measure is computed; then the method is called non-metric multidimensional scaling (NMDS) in recognition of the fact that the information being presented is not related to a distance.

22.0.1 Cities on a map

An easy to understand example of MDS starts with a matrix giving the distance between each pair of cities in a set. The MDS visualization then scatters these points across the plane, reconstructing the geographic separation of the points.

The dataset below is built from a database of world cities, selected to have only cities of more than 5 million plus some cities in Canada. I included a maximum of 4 cities per country so that one one region of the globe was too strongly concentrated in points. You can get the data from simplemaps and make your own subset.

cities <- read_csv("static/selected_cities.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   city = col_character(),
##   city_ascii = col_character(),
##   lat = col_double(),
##   lng = col_double(),
##   country = col_character(),
##   iso2 = col_character(),
##   iso3 = col_character(),
##   admin_name = col_character(),
##   capital = col_character(),
##   population = col_double(),
##   id = col_double()
## )
ggplot(cities, aes(x=lng, y=lat)) + 
  geom_point() +
  geom_text_repel(aes(label=city), size=2) 
mymap <- get_stamenmap(bbox = c(left = -150, bottom = -60, right = 175, top = 75), zoom=3, maptype = "terrain-background")
ggmap(mymap) + 
  geom_point(data = cities, aes(x=lng, y=lat)) +
  geom_text_repel(data = cities, aes(x =lng, label=city), size=2) 

Now we calculate the distance between all pairs of cities. I’ve shown two ways to do this – using great circle distance along the surface of the Earth, and using a distance that projects the surface of the Earth onto a plane first. If the goal is to reconstruct the map shown above, use this projected distance. If you would like instead to show great-circle distances between cities in the plane, use the geodesic distance.

city_distance <- dist(cities %>% dplyr::select(lng, lat) %>% as.matrix()) #  distance in equirectangular projection
  # https://en.wikipedia.org/wiki/Equirectangular_projection

Perform MDS (principal coordinates analysis). The direction of the two main axes could be reversed relative to the original map; I’ve reversed the x and y axes to match our customary view of the world.

mds1 <- cmdscale(city_distance)
colnames(mds1) <- c("V1", "V2")
bind_cols(cities, as_tibble(mds1)) %>%
  ggplot(aes(x = V1, y = V2)) + 
  geom_point() + 
  geom_text_repel(aes(label = city), size=2) +
  scale_x_reverse() + scale_y_reverse() +
  labs(title = "Map reconstructed from equirectangular distance matrix",
       x = "", y = "")
city_distance <- rdist.earth(cities %>% dplyr::select(lng, lat) %>% as.matrix(), 
                             miles = FALSE) # geodesic distance
mds1 <- cmdscale(city_distance)
colnames(mds1) <- c("V1", "V2")
bind_cols(cities, as_tibble(mds1)) %>%
  ggplot(aes(x = V1, y = V2)) + 
  geom_point() + 
  geom_text_repel(aes(label = city), size=2) +
  scale_x_reverse() + scale_y_reverse() +
  labs(title = "Map reconstructed from great-circle distance matrix",
       x = "", y = "")

22.1 NMDS example

Morse code is a way of sending text messages using just two symbols: dot and dash, which was designed to be transmitted by a person clicking a key to make sounds and the receiver listening to the sounds and translating the message as it arrives.

The dataset below was created as part of an experiment to measure the rate at which patterns of sounds for one symbol were confused with sounds for a different symbol. The matrix is symmetric dissimilarity measure. Rows and columns vary across the 36 symbols tested (26 letters and 10 numeric digits). All diagonals are 0s. The off diagonal values are large if the sounds are not likely to be confuesed. Smaller dissimilarities correspond to symbols that are more likely to be confused. A small excerpt of the table is shown.

morse.dist <- read.delim('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/morsecodes-dist.txt',
                          row.names = 1, head = T)
names(morse.dist) <- rownames(morse.dist)
morse.dist[1:5,1:5] %>% kable() %>% row_spec(0, monospace=TRUE) %>% 
  column_spec(1, monospace=TRUE) %>%
  kable_styling(full_width = FALSE)
.- -… -.-. -.. .
.- 0 167 169 159 180
-… 167 0 96 79 163
-.-. 169 96 0 141 166
-.. 159 79 141 0 172
. 180 163 166 172 0

Use the function metaMDS from the package vegan to perform the NMDS ordination. The ordiplot function shows the objects from the dissimilarity matrix on a two-dimensional “ordination plot”. Points that are closer together are more likely to be confused (they are less dissimilar).

NMDS <- metaMDS(morse.dist, trace=0)
NMDS
## 
## Call:
## metaMDS(comm = morse.dist, trace = 0) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     morse.dist 
## Distance: user supplied 
## 
## Dimensions: 2 
## Stress:     0.1800255 
## Stress type 1, weak ties
## Two convergent solutions found after 20 tries
## Scaling: centring, PC rotation 
## Species: scores missing
ordiplot(NMDS, cex = 1.5, type = 't')
# stressplot(NMDS)

This next plot compares the distances in the original data to the distances as represented by the ordination. If the ordination represents the original data well, this will be close to a straight line. There is a point for every pairwise comparison. In this case there are 36 * 35 / 2 = 630 distances to compare. I have labeled some of the points with the largest distortion imposed by the ordination.

symbols <- rownames(NMDS$points)
NMDS[c("dist", "dhat", "iidx", "jidx")] %>% as_tibble() %>% 
  mutate(comparison = paste0(symbols[iidx], "/", symbols[jidx]),
         comparison2 = case_when( abs(dist-dhat) < 40 ~ "", TRUE ~ comparison)) %>%
  ggplot(aes(dist, dhat)) + 
  geom_point(size=0.5, color="blue") + 
  geom_text_repel(aes(label = comparison2)) +
  theme_bw()

You can also access the points from the ordination and make the “ordiplot” using ggplot. The relative position of points on the plat is the only thing that matters – any rotation or translation of the plot contains the same information.

NMDS$points %>% as_tibble(rownames = "Symbol") %>%
  ggplot(aes(x = MDS1, y = MDS2 )) +
  geom_text(aes(label=Symbol)) + 
  theme_bw()
  # geom_point(size = 0.5) + 
  # geom_text_repel(aes(label = Symbol))

You might want to understand the NMDS analysis in terms of some properties of the Morse Code signals. This table has the length (1-5) and the ratio of short to long (dots and dashes) signals in each symbol. The envfit function then finds the direction each of these variables increases most rapidly across the ordination plane. The summary reports the direction and the correlation between these variables and the position on the ordination plot. The arrow for “length” follows the pattern in the ordination very well, while the arrow for the ratio of short to long only accounts for about half of the variation in the ordination plot.

morse.attr <- read.delim('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/morsecodes-attr.txt',
                         row.names = 1, head = T)
ef <- envfit(NMDS, morse.attr)
ef
## 
## ***VECTORS
## 
##                     NMDS1    NMDS2     r2 Pr(>r)    
## length            0.88381 -0.46784 0.9287  0.001 ***
## ratio.short.long -0.39383 -0.91918 0.5067  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

The arrows show the direction of most rapid average increase of each variable.

ordiplot(NMDS, cex = 1.5, type = 't')
## species scores not available
plot(ef)

Here is a method to reproduce this plot using ggplot. First use str(ef) to examine the structure of the result from envfit. Then plot the points and arrows using ggplot.

arrows1 <- ef$vectors$arrows %>% as_tibble(rownames = "code")
as_tibble(NMDS$points, rownames = "code") %>%
  ggplot(aes(x = MDS1, y = MDS2, label = code)) + 
  geom_text() +
  geom_text(data = arrows1, aes(x = 25*NMDS1, y = 25*NMDS2)) +
  geom_segment(data = arrows1, aes(x = 20*NMDS1, y = 20*NMDS2, xend = 0, yend = 0)) +
  theme_bw()