|
1 | 1 | --- |
2 | 2 | title: "Clustering with easystats" |
3 | | -output: |
| 3 | +output: |
4 | 4 | rmarkdown::html_vignette: |
5 | 5 | vignette: > |
6 | 6 | %\VignetteIndexEntry{Clustering with easystats} |
7 | 7 | \usepackage[utf8]{inputenc} |
8 | 8 | %\VignetteEngine{knitr::rmarkdown} |
9 | | -editor_options: |
| 9 | +editor_options: |
10 | 10 | chunk_output_type: console |
11 | 11 | --- |
12 | 12 |
|
@@ -64,7 +64,7 @@ Clustering traditionally refers to the identification of groups of observations |
64 | 64 |
|
65 | 65 | There are many clustering algorithms (see [this for an overview](https://scikit-learn.org/stable/modules/clustering.html)), but they can grouped in two categories: **supervised** and **unsupervised** techniques. In **supervised** techniques, you have to explicitly specify [**how many clusters**](https://easystats.github.io/parameters/reference/n_clusters.html) you want to extract. **Unsupervised** techniques, on the other hand, will estimate this number as part of their algorithm. Note that there are no inherently superior and inferior clustering methods, each come with their sets of limitations and benefits. |
66 | 66 |
|
67 | | -As an example in the tutorial below, we will use the **iris** dataset, for which we know that there are 3 "real" clusters (the 3 Species of flowers). Let's first start with visualizing the 3 "real" clusters on a 2D space of the variables created through PCA. |
| 67 | +As an example in the tutorial below, we will use the **iris** dataset, for which we know that there are 3 "real" clusters (the 3 Species of flowers). Let's first start with visualizing the 3 "real" clusters on a 2D space of the variables created through PCA. |
68 | 68 |
|
69 | 69 |
|
70 | 70 | ```{r} |
@@ -161,14 +161,11 @@ Hierarchical K-Means, as its name suggest, is essentially a combination of K-Mea |
161 | 161 | rez_hkmeans <- cluster_analysis(data, n = 3, method = "hkmeans") |
162 | 162 |
|
163 | 163 | rez_hkmeans # Show results |
164 | | -
|
165 | | -# Visualize |
166 | | -plot(rez_hkmeans) + theme_modern() # Visualize |
167 | 164 | ``` |
168 | 165 |
|
169 | 166 | ### K-Medoids (PAM) |
170 | 167 |
|
171 | | -Clustering around "medoids", instead of "centroid", is considered to be a more robust version of K-means. See `cluster::pam()` for more information. |
| 168 | +Clustering around "medoids", instead of "centroid", is considered to be a more robust version of K-means. See `cluster::pam()` for more information. |
172 | 169 |
|
173 | 170 | ```{r} |
174 | 171 | rez_pam <- cluster_analysis(data, n = 3, method = "pam") |
@@ -203,7 +200,7 @@ plot(rez_hclust2) + theme_modern() # Visualize |
203 | 200 |
|
204 | 201 | ### DBSCAN |
205 | 202 |
|
206 | | -Although the DBSCAN method is quite powerful to identify clusters, it is highly dependent on its parameters, namely, `eps` and the `min_size`. Regarding the latter, the minimum size of any cluster is set by default to `0.1` (i.e., 10\% of rows), which is appropriate to avoid having too small clusters. |
| 203 | +Although the DBSCAN method is quite powerful to identify clusters, it is highly dependent on its parameters, namely, `eps` and the `min_size`. Regarding the latter, the minimum size of any cluster is set by default to `0.1` (i.e., 10\% of rows), which is appropriate to avoid having too small clusters. |
207 | 204 |
|
208 | 205 | The "optimal" **eps** value can be estimated using the [`n_clusters_dbscan()`](https://easystats.github.io/parameters/reference/cluster_analysis.html) function: |
209 | 206 |
|
|
0 commit comments