Topic Modelling tweets discussing “Screen Time”

The aim of this quick project is to understand people’s views towards screen time. Twitter is a platform whereby people can express their opinions or share news, making it a useful place to explore people’s viewpoints. By searching tweets which contain the phrase “Screen Time”, we can analyse the language people use in subsequent tweets. In particular, using topic modelling, we can map what topics are involved when discussing screen time.

To do this in R, first we need to ensure that we have the needed packages installed:

Then we need to collect the tweets. This can be done using R & Rstudio:

Even though I requested 2000 tweets, the resultant data contained 10,000 tweets. However, these tweets need to undergo some cleaning, for example, making the words lower case, removing stop words and removing punctuation (apart from hashtags). Emojies will not be removed in the following analysis.

Once the data is cleaned, you can start exploring the terms in the document, and their frequencies:

For example, here is a list of the top 10 words:

At this point, it is useful to visualise the data to understand its structure.

Lets see the frequencies of the top 10 words:

You can also plot a histogram to explore the frequencies of particular word lengths:

Word clouds are a common way to visualise the frequencies of words in a document. The word cloud below visualises the top 100 most mentioned words:

You can reduce this down to the top 10 most mentioned words:

Word clouds can be adjusted to show off other features of the data. For example, longest words:

Next, it is of interest to understand if people talk about screen time in a positive or negative way. This can be explored through sentiment analysis. Note: Sentiment analysis does not interpret sarcasm well.

You can see that there are more negative comments about screen time than positive:

Next you can analyse the emotional tone using the NRC library. The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). It is interesting to see how different dictionaries categories sentiment differently:

Finally, we can explore the topics discussed in tweets mentioning “Screen Time’. To decide how many topics are in the tweets, we can build several LDA models using Gibbs sampling and exploring the metric “Griffiths2004”. Read this paper to learn how this metric determines the optimum topics in a document: http://www.pnas.org/content/101/suppl_1/5228

By viewing the graph, you can see the optimum number of topics is 16:

Now we know how many topics to examine, we can conduct LDA analysis to explore the underlying topics in our data:

See below several graphs showing the probabilities of words being part of each topic:

You will see that naming the topics still requires human interpretation. For example:

Topic 1: Screen time of people & characters
Topic 7: New iOS screen time measure
Topic 8: The show Riverdale
Topic 12: Children’s screen time and health.