The aim of this quick project is to understand people’s views towards screen time. Twitter is a platform whereby people can express their opinions or share news, making it a useful place to explore people’s viewpoints. By searching tweets which contain the phrase “Screen Time”, we can analyse the language people use in subsequent tweets. In particular, using topic modelling, we can map what topics are involved when discussing screen time.
To do this in R, first we need to ensure that we have the needed packages installed:
1 |
install.packages(c("SnowballC", "wordcloud", "ldatuning", "topicmodels", "RColorBrewer", "tm", "stringr", "rtweet")) |
Then we need to collect the tweets. This can be done using R & Rstudio:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
#Getting Data from twitter: #Required Packages library(rtweet) #Twitter Authentication Variables appname <- "YOUR APP NAME" key <- "YOUR KEY" secret <- "YOUR SECRET" access_token <- "YOUR ACCESS TOKEN" access_secret <- "YOUR ACCESS SECRET" #Create token named "twitter_token" twitter_token <- create_token( app = appname, consumer_key = key, consumer_secret = secret, access_token = access_token, access_secret = access_secret, set_renv = TRUE) #Check Authentication by posting a tweet #post_tweet("tweeting from r console") #Search for tweets about smartphone use collectedtweets <- search_tweets(q = '"Screen Time"', n = 2000, lang="en", include_rts = FALSE, retryonratelimit = TRUE) |
Even though I requested 2000 tweets, the resultant data contained 10,000 tweets. However, these tweets need to undergo some cleaning, for example, making the words lower case, removing stop words and removing punctuation (apart from hashtags). Emojies will not be removed in the following analysis.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
#Isolate data thats needed textdata <- collectedtweets$text #Load in the library 'stringr' so we can use the str_replace_all function. library('stringr') #Remove URL's textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","") #Change all the text to lower case textdata <- tolower(textdata) #Load in libaray 'tm' library('tm') #Replace double spaces with single space textdata <- str_replace_all(textdata, " ", "") #Load in the library 'tm' - start to do tokenization library('tm') #Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages. textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART"))) #Remove & textdata = gsub("&", "", textdata) #Remove punctuation apart from hastags textdata <- str_replace_all(textdata, pattern = "[^#[:^punct:]]", " ") |
Once the data is cleaned, you can start exploring the terms in the document, and their frequencies:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
#Convert our character vector into corpus corpus <- Corpus(VectorSource(textdata)) #Make a Term Document Matrix tdm <- TermDocumentMatrix(corpus) #List all terms listofterms <- Terms(tdm) #List terms mentioned over 10 times Over10 <- findFreqTerms(tdm, lowfreq = 10) #List terms mentioned over 5 times Over5 <- findFreqTerms(tdm, lowfreq = 5) #Make dataframe with words and their frequencies rowsums <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE) wordfreqdf <- data.frame(word=names(rowsums), freq = rowsums) Top10words <- head(wordfreqdf, 10) |
For example, here is a list of the top 10 words:
At this point, it is useful to visualise the data to understand its structure.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
#Load in Library ggplot2 library(ggplot2) #Changes the theme of ggplot theme_set(theme_bw()) #Plots Histogram of top 10 words p<-ggplot(data=Top10words, aes(x=word, y=freq)) + geom_bar(stat="identity") p #make sure stringr package is loaded in here: wordfreqdf$wordlength <- str_length(wordfreqdf$word) #word length histogram histogram <- ggplot(wordfreqdf, aes(x = wordlength)) + geom_histogram(colour = "black", fill = "white", binwidth = 1) + labs(x = "Word Length", y = "Count") + ggtitle("Histogram of Word Length") + theme(plot.title = element_text(hjust = 0.5)) histogram #Make some word cloud #load some useful packages for word clouds library("RColorBrewer") library("wordcloud") library("wordcloud2") wordcloud2(wordfreqdf[1:100, ], size = 5) wordcloud2(wordfreqdf[1:12, ], size= 5) #makes a word cloud based on word length rather than word freq library(dplyr) wc <- wordfreqdf %>% ungroup() %>% select(word, wordlength) %>% distinct() %>% arrange(desc(wordlength)) wordcloud2(wc[1:300, ], size = .15, minSize = .0005, ellipticity = .3, rotateRatio = 1, fontWeight = "bold") |
Lets see the frequencies of the top 10 words:
You can also plot a histogram to explore the frequencies of particular word lengths:
Word clouds are a common way to visualise the frequencies of words in a document. The word cloud below visualises the top 100 most mentioned words:
You can reduce this down to the top 10 most mentioned words:
Word clouds can be adjusted to show off other features of the data. For example, longest words:
Next, it is of interest to understand if people talk about screen time in a positive or negative way. This can be explored through sentiment analysis. Note: Sentiment analysis does not interpret sarcasm well.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
#Sentiment Analysis #Make a Document Term Matrix datatm <- DocumentTermMatrix(corpus) #Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See: https://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document dtm.new <- datatm[rowTotals> 0, ] library('tidytext') #Sentiment analysis tidydata <- tidy(datatm) sp_sentiments <- tidydata %>% inner_join(get_sentiments("bing"), by = c(term = "word")) sp_sentiments #document overall sentiment library(tidyr) documentsentiments <- sp_sentiments %>% count(document, sentiment, wt = count) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) %>% arrange(sentiment) #nrc sentiments sentiments_nrc <- tidydata %>% inner_join(get_sentiments("nrc"), by = c(term = "word")) nrc_plot <- sentiments_nrc %>% group_by(sentiment) %>% summarise(word_count = n()) %>% ungroup() %>% mutate(sentiment = reorder(sentiment, word_count)) %>% #Use `fill = -word_count` to make the larger bars darker ggplot(aes(sentiment, word_count, fill = -word_count)) + geom_col() + guides(fill = FALSE) + #Turn off the legend labs(x = NULL, y = "Word Count") + scale_y_continuous(limits = c(0, 10000)) + #Hard code the axis limit ggtitle("NRC Sentiment") + coord_flip() normal_plot <- sp_sentiments %>% group_by(sentiment) %>% summarise(word_count = n()) %>% ungroup() %>% mutate(sentiment = reorder(sentiment, word_count)) %>% #Use `fill = -word_count` to make the larger bars darker ggplot(aes(sentiment, word_count, fill = -word_count)) + geom_col() + guides(fill = FALSE) + #Turn off the legend labs(x = NULL, y = "Word Count") + scale_y_continuous(limits = c(0, 5000)) + #Hard code the axis limit ggtitle("Positive vs Negative Sentiment") + coord_flip() |
You can see that there are more negative comments about screen time than positive:
Next you can analyse the emotional tone using the NRC library. The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). It is interesting to see how different dictionaries categories sentiment differently:
Finally, we can explore the topics discussed in tweets mentioning “Screen Time’. To decide how many topics are in the tweets, we can build several LDA models using Gibbs sampling and exploring the metric “Griffiths2004”. Read this paper to learn how this metric determines the optimum topics in a document: http://www.pnas.org/content/101/suppl_1/5228
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
#lda tuning. Building a bunch of models to find which number of topics is optimum library("ldatuning") library("topicmodels") #Build lots of models and compare results against 4 metrics result <- FindTopicsNumber(dtm = dtm.new , topics = seq(from = 2, to = 50, by = 1), metrics = "Griffiths2004", method = "Gibbs", control = list(seed = 77), mc.cores = 4L, verbose = TRUE) #Print the result table. (In a data.frame format) result #Plot result FindTopicsNumber_plot(result) |
By viewing the graph, you can see the optimum number of topics is 16:
Now we know how many topics to examine, we can conduct LDA analysis to explore the underlying topics in our data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
#find 16 topics k <- 16 ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k) No.Documents <- length(ldaTopics@documents) No.Terms <- length(ldaTopics@terms) class(ldaTopics) #word probabilities per topic wordsprobabilities <- tidy(ldaTopics, matrix = "beta") #probability of a specific word being in each topic moneyprob <- tidy(ldaTopics, matrix = "beta") %>% filter(term == "money") #top documents per topic #create tidy form showing topic, document and its gamma value docpertop <- tidy(ldaTopics, matrix = "gamma") ldaTerms <- terms(ldaTopics, 10) ldaTerms <- apply(ldaTerms, MARGIN = 2, paste, collapse = ",") ldaTerms #Function topwordtopicprob <- function(topicnum, LDAwordprobs, ntopbetawords) { topicsub <- subset(LDAwordprobs, topic == topicnum) topicsub <- topicsub %>% arrange(topic, -beta) topicsub <- head(topicsub, ntopbetawords) } topic2 <- topwordtopicprob(2, wordsprobabilities, 10) alltopicstopprobs <- 0 for (i in 1:k) { qwerty <- topwordtopicprob(i, wordsprobabilities, 10) alltopicstopprobs <- rbind(alltopicstopprobs, qwerty) } alltopicstopprobs <- alltopicstopprobs[-1,] test <- alltopicstopprobs %>% ggplot(aes(term, beta, fill = factor(topic))) + geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) + facet_wrap(~ topic, scales = "free", ncol = 2) + coord_flip() |
See below several graphs showing the probabilities of words being part of each topic:
You will see that naming the topics still requires human interpretation. For example:
Topic 1: Screen time of people & characters
Topic 7: New iOS screen time measure
Topic 8: The show Riverdale
Topic 12: Children’s screen time and health.