
A great way to increase the relevance of your survey analysis and build a compelling story is to segment the respondents into categories that demonstrate maturity or adoption or even success within the subject of the survey. Traditionally the way to do this is to take individual questions and segment the responses within this category – so for example, segmenting the respondents into financial performance quartile groups or the category groups within a single or aggregated maturity question. Although useful, these only scratch the surface and don’t necessarily reflect the differences between respondents. Often you get quite similar answers across respondents, so the groupings from single question analysis are disproportionate with gigantic groups and tiny groups.
A way to look more holistically at all parameters that divide the respondents is clustering – in this case, k-means. Clustering is more dynamic and can use multiple variables to group across multiple questions giving more insight into what divides the respondents and provides a richer, more nuanced set of groupings.
In a recent survey about technology investments, I divided the respondents into clusters based on the responses to 6 questions focused on maturity, investment level, perceived business impact, and the perceived value of investments. It identified four groups, two that invested well, were mature, and seeing value from their investments. The difference between the two groups was that one group didn’t feel in as strong a position as its industry competitors. So the difference was really a lack of confidence or could be described as such. Another group seemed ambivalent towards technology – invested less, had low maturity and impact, which was no real surprise! What was interesting was the last group. The last group had acceptable levels of investment, were relatively mature but had little impact. As you can imagine, this group is worth understanding and worth analyzing their responses to other questions to determine the disconnect between investment and impact. They also might be an excellent target for consulting!
So how does it work?
I can’t do it too much justice in a blog – but there are plenty of good textbooks on the topic that can explain how it works in detail. But the simple explanation: k-means is an unsupervised machine learning technique (although I suspect you could argue that it is a data mining technique as well) that divides data into k-number of groups. It is a useful technique for discovering patterns in data between variables (or questions for survey data). You assign each data point to a single cluster (there is no overlap between groups) based on the distance to the center point (centroid) of the cluster. You then take the mean value of all the data in each cluster – this is then used to update the centroid. Then you reassign each datapoint to their nearest centroid and repeat until the centroids don’t change.
So how do you do it?
- Firstly, you need to decide what parameters you want to look at within the questions – so are you looking at levels of investment, the maturity of the current application, what is it in the demographics (or firmographics), or the questions that could distinguish the groups? You have to remember that variables for clustering should be numeric or have a way to be represented numerically. So true or false questions won’t work individually, and scale response questions (for example, 1 to 5) are best aggregated in my experience. So in the above example, a series of 1 to 5 questions across technology types were aggregated to form a single numeric variable.
- Secondly, prepare the variables. For example, aggregating scale response questions but could also mean normalizing or standardizing the variables. Normalizing is rescaling the ranges of the variable to a fixed range like 0 to 1, and standardizing converts the range to its standard deviation distribution. Although I have found it works effectively without doing this, it is worth trying both, especially where there are extreme ranges in the data. You also have to deal with missing data – so a respondent hasn’t selected a response for a particular category – this can impact the aggregation and penalize a respondent unfairly, particularly with a scale response like a 1 to 5. So for this example, I would use the mode of responses for that question. The mode is the most common selection and then aggregate. It is essential to do this in a separate data set, so the actual survey responses aren’t changed – it may be important to know that a respondent didn’t answer a question. Usually, the data used to run the cluster would be a column containing the respondent id and a column for each of the variables.
- Finally, run the data. There are lots of tools to run this; there are tools build into SPSS and Tableau – although I found them tricky to pull out and identify the individual respondents’ cluster group. BTW I am sure it is possible, but I couldn’t find a satisfactory method. But both tools can be good at checking the algorithm you use as they both show the clusters, although SPSS has the edge for me as it was simple and straightforward showing the numbers. To run the clusters themselves, there are lots of tools, excel plugins, and you can use python with the appropriate ML libraries. Personally I use Python just because it is the easiest way if you can code and you have a great deal of control.
It doesn’t always work with every set of data, and you need to select the correct k number for the type of data – it may not be convenient for your story to have seven types of the respondent, but it is worth the effort as it can discover some worthwhile groups and understand responses and decision making in a whole new way. Even if for the story, you use the macro groupings, the micro grouping can be fascinating.
If this all seems great but too complicated please get in touch and I can suggest ways to apply this to your existing data. Check out our survey analysis services here.