Clustering With Many Variables

I have a sample data below that is from a large data set, where each participant is given multiple condition for scoring.

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. Nov 28, 2017 - We extend the variables clustering methodology by two new. Methods for mixed-type data proves to be comparable and in many cases.

I am trying to use K-mean clustering to split participants in different groups, is there any good way to do it, considering the condition is not numeric?

thanks!

D JayD Jay

3 Answers

@Anony has the right idea. You actually do have numeric data - there is (evidently) a c1-score and a c2-score for each participant. So you need to convert your data from 'long' format (data in a single column (Score) with a second column (Condition) differentiating the scores, to 'wide' format (scores under different conditions in separate columns). Then you can run kmeans clustering on the scores to group the participants.

Here is how you would do that in R, using a slightly larger example to demonstrate the clusters.

Now we can plot the scores, showing how the participant clusters.

EDIT: Response to OP's comment.

OP wants to know what to do if there is more than one score for a participant/condition. The answer depends on why there are multiple scores. If the replicates are random and have a central tendency, then probably taking the mean is justified, although in theory participants with more replicates should be more heavily weighted.

One the other hand, suppose these are test scores. Then generally (but not always), the scores go up with multiple sittings. So these scores would not be random - there is a trend. In that case it might be more meaningful to take the most recent score.

As a third example, if the scores are used to make a decision based on some policy (such as with the SAT, where most colleges use the highest score), then the most appropriate aggregating function might be max, not mean.

Finally, it might be the case that the number of replicates is in fact an important distinguishing characteristic. In that case you would include not just the scores but also the number of replicates for each participant/condition when clustering. This is relevant in certain kinds of standardized testing under NCLB, where students take the test over and over again until they pass.

BTW: This type of question (the one in your comment) definitely belongs on https://stats.stackexchange.com/.

Community♦

jlhowardjlhoward

46.8k5 gold badges60 silver badges105 bronze badges

You should pivot your data, so that

each participant is a row
each condition is a column
the scores are your data

Try the reshape2 package.

Anony-MousseAnony-Mousse

60.7k8 gold badges100 silver badges165 bronze badges

You have 3 variables which will be used to split your data in groups. Two of them are categorical which might cause a problem. You can use k-means to split your data in groups but you will need to make dummies for your categorical data (condition and participant) and scale your continuous variable Score.

Using categorical data in K-means is not optimal because k-means cannot handle them well. The dummies will be highly correlated which might cause the algorithm to put too much weight on them and produce suboptimal results.

For the reason above, you can use different techniques such as hierarchical clustering or running a PCA on your data (in order to have continuous uncorrelated data) and then perform a normal k-means model on the PC scores.

These links give good answers:link1link2

Hope that helps!

Community♦

LyzandeRLyzandeR

29.4k11 gold badges50 silver badges70 bronze badges

3 Answers

Not the answer you're looking for? Browse other questions tagged rcluster-analysisk-meansmean or ask your own question.