In this blog post and accompanying video I will attempt to explain the process of performing some basic encoding and visualisation on survey data using the R package. The data and R scripts used in this example post are available to download via the Harkive GitHub repository.
I hope to show how survey data can be relatively easily visualised using the R package in order to help deliver potentially useful insights. In the example image below, generated using the scripts and data provided here, responses to questions about the importance of Cost when choosing regularly used formats for listening to music are plotted against the importance of Convenience.
What is perhaps unsurpising is that the majority of the people in the random sample set consider both to be important. Perhaps more interesting, however, and certainly in terms of selecting subjects for further analysis, are those who appear to consider neither as important. It leads us to ask further questions as to why that would be the case, and what are important factors to those people. The broader point here being, these are observations and questions that a relatively quickly constructed visualisation can afford us – it would be extremely difficult to observe what we can see here from simply looking at the original data set.
The intention of publishing this data/code is thus twofold:
1) As I have benefitted hugely in my own learning from the culture of sharing data and code that surrounds R, by sharing this code and data I hope to provide some assistance to researchers seeking to analyse their own survey data but who are, like me, new to R. The scripts provided here should be relatively easy to adapt to a different data set, if that is your aim.
2) I also share the data I have gathered in the hope that more experienced researchers and/or those with an interest in popular music may develop their own analyses and share their results/code with us. There are responses to 90 different questions related to popular music listening within the data set – offering the possibility for a huge number of ways in which the data can be analysed. It would be very interesting to see what others come up with using this data. Please do feel free to adapt, create and share your thoughts.
Harkive has been collecting stories from people online about their music listening experiences on single days in July since 2013. Stories are collected from various social media channels (Twitter, Facebook, Instagram, Tumblr) and also via email and through a form on the project website. In order to assist with the analysis of these stories, a Music Listening Survey was devised in 2016 that aimed to gather additional information from participants. This survey was open to both participants and non-participants of the story gathering element of Harkive. Analysis of the data gathered by the survey is intended to:
- provide insight into the experiences of popular music listeners
- contextualise the individual text-based stories gathered by Harkive
- enable the sub-setting of the entire corpus of stories based on observations gleaned from the survey data
The survey is still live and responses are still be collected, so if you would like to participate you can do so by visiting http://www.harkive.org/h16-survey . It would also be hugely appreciated if you would share this link with other music lovers.
The Harkive Survey was created using the JotForm service and then hosted on the Harkive site. After providing their informed consent and some demographic information, participants were then asked whether they had participated in the story gathering element of the Harkive Project (those who indicated that they had were then asked to provide further information about this). Participants were then asked to respond to 86 questions/statements regarding their music listening. In the main, these required responses along Likert Scales. For example, participants were asked to rate whether they Strongly Agreed or Strongly Disagreed with a statement along a 7-point scale.
Data was downloaded from Jotform in CSV format, and a sample of 100 anonymised responses are used here. This sample is comprised of 50 responses where people had indicated they had participated in the story gathering element, and a further 50 where respondents indicated that they had not. Other than that, data was selected at random.
The R Scripts
There are two R scripts that accompany this post. The first takes the ‘raw’ data downloaded from JotForm and converts text-based responses into numeric values. Using these newly created numeric responses, additional variables are created that provide summaries of sections. The second script uses the numeric values created in the first to create some basic visualisations that enable some initial analysis of the data within the survey. By following through both scripts you will be able to replicate the image displayed at the top of this post, and by adapting the code provided in the script you will be able to visualise and explore the rest of the dataset.
What You Will Need
In order to replicate the work in this post you will need:
- R and R Studio installed on your computer (Help videos for: Windows and Mac)
- The data from the Harkive GitHub repository (Intro video for GitHub)
For new users this may sound daunting, but the video links above are very helpful and should have you up and running in a few moments.
Here is a 30-minute screencast in which I walk through the two R scripts provided. Hopefully those of you who are new to R will find it useful in terms of adapting the scripts, and those of you interested in performing your own analysis of the data will get a feel for what it contains.
I am by no means an expert when it comes to creating R scripts, or in Statistical Analysis. More experienced R users may indeed find the way in which I have structured these scripts to be cumbersome and inefficient, and there are probably mistakes in my descriptions and scripts. I am still very much in the early stages of learning R, and as such these scripts are presented in much the same way that I am attempting to develop my skills: through a process of trial and error, one that is iterative and exploratory.
What is useful about that, from the point of view of my own research, is that it has necessarily forced me to break down analysis into discrete component parts. I have learned one step at a time. This not only makes the research replicable both for me (once I learn how, for example, how to visualise one set of data, I can quickly apply that to another) but also potentially others (in that fellow researchers using their own Survey Data may be able to build on this work) but, perhaps more importantly, the assumptions inherent in each step are revealed more clearly.
The act of assigning numeric values to Likert Items, such as I have here, is a case in point. This is heavy with a number of assumptions: that Person A meant the same thing as Person B when both said “Strongly Agree”; that the distance between “Often” and “Very Often” is exactly the same as that between “Rarely” and “Never”, and so on. Further to that, once data of this kind is visualised in a coherent form (as I have attempted to do here), then inferences and insights are ‘revealed’ more starkly. As we will see in later posts, where I will take the insights revealed from survey data and apply them to the corpus of Harkive stories, the research process itself thus becomes a creative act just as much as it is a logical, empirical one. One should always remember, then, to consider both the provenance of the ‘raw data’ (a questionable term, as Gitelman argues) and the process through which that data has ultimately led to insight. As numerous scholars working in this area have argued, reflexivity is as crucial an element as the technical skills required when undertaking work of this kind.
The beauty, then, of an R script and other computational analytical processes, is not only in the efficiency and logic they afford, but in the way they isolate and force us to confront the assumptions inherent in our work.