Chapter 16 Curse of dimensionality demo
As the dimensionality of your data increases, your data become increasingly sparse in the space that they occupy. As a result, the distances between objects in your dataset become less meaningful as dimensionality increases. This dynamic can make some data mining techniques less effective (e.g., clustering or outlier detection).
In this activity, we will demonstrate the effect of dimensionality on distances in data.
16.1 Dependencies
We’ll use the rdist
package to calculate distances between points.
Look over the documentation for rdist
here: https://github.com/blasern/rdist.
The code chunk below will check if you have the rdist package. If not, it will install it.
<- require(rdist) rdist_available
## Loading required package: rdist
if (!rdist_available) {
install.packages("rdist")
}library(rdist)
We’ll also use the following libraries (if you don’t have these, you’ll need to install them):
library(tidyverse)
16.2 Generate example data
<- 500
num_points <- 0
min_range <- 1
max_range <- 2:50
dimension_sizes <- c("euclidean", "angular") dist_metrics
Generate 500 points in an N-dimensional box (where each dimension of the box ranges from 0 and 1).
<- data.frame(num_dimensions=c())
data # For each dimensionality, randomly generate "num_points" points in our rectangle.
for (num_dimensions in dimension_sizes) {
# Generate random points in our N-dimensional box.
<- matrix(
points_matrix data = runif(num_points * num_dimensions),
nrow = num_points,
ncol = num_dimensions
)# Calculate the pairwise distances between points in our dataset.
# Run ?rdist to see how the rdist function works
for (cur_metric in dist_metrics) {
<- c(rdist(points_matrix, metric=cur_metric))
distances <- data.frame(
dim_data max_dist = c(max(distances)),
min_dist = c(min(distances)),
metric = c(cur_metric),
num_dimensions = num_dimensions
)<- rbind(data, dim_data)
data
} }
We can get a rough idea of how meaningful the distances in our space are by calculating
(max_dist-min_dist)/min_dist
.
ggplot(
data,aes(
x=num_dimensions,
y=((max_dist-min_dist)/min_dist),
color=metric,
fill=metric
)+
) geom_point() +
scale_y_continuous(
name="log10((max_dist-min_dist)/min_dist)",
trans="log10"
+
) labs(
x="Number of dimensions",
color="Metric",
fill="Metric"
)