What do the contents of Trump’s tweets tell us about his priorities and communications style? Using natural language processing for linguistic analysis
Whilst there are many conversations that could be had about the current president of the USA, one curious first has been his decision to continue to tweet from a personal Twitter account once taking office. In Obama’s era, whilst he made personal use of social media on the campaign trail, his engagement as president was through official accounts.
This post takes a look at the changing linguistic tone of the tweets from the @realdonaldtrump account. Before we start, this is not the first such analysis. Others have shown that tweets posted from an Android device are likely direct from the source whilst tweets posted from an iPhone or web-based application are likely by an assistant to the source. This post does not separate them.
To conduct the analysis, I scraped the maximum number of tweets allowed by Twitter from his timeline – 3,200. They spanned the date range 17th March 2016 to 12th February 2017 and categorised, based on their date/time stamp, into one of four periods of activity:
- Candidate (primaries to elect the republican candidate: 17 March 2016 to 3 May 2016)
- Nominee (campaigning as Republican nominee: 4 May 2016 to 8 November 2016)
- Elected (winner of the election: 9 November 2016 to 19 January 2017)
- President (inaugurated into office: 20 January 2017 to 12 February 2017)
Notes: Trump became the official Republican nominee at their convention in July, but all other candidates dropped out by the end of 3 May). The election was held on 8 November and the inauguration took place on 20 January.
The tweets were broken into words. Some cleaning was applied for consistency, e.g. U.S., U.S.A. and United States were consolidated to become ‘USA’ whilst ‘makeamericagreatagain’ was separated into ‘make’, ‘america’, ‘great’, and ‘again’. Basic noise words such as ‘the’, ‘and’, ‘of’ were removed. However modal verbs (would, should, might and could) and personal and possessive pronouns (I, you, we, them etc.) were kept because I wanted to get a feel for the human intent and direction of the tweets.
The 20 most common words, based on their frequencies, were extracted from each period and grouped as follows:
|I, me, my, trump, @realdonaldtrump|
|you, we, our|
|they, them, their, who, people|
|Modal verbs||will, would, shall, should, may, might, must, can, could|
|Topics||1. make, america, great, usa, country
2. ted, cruz, he, new, york, indiana, wisconsin
3. hillary, clinton, she, her, crooked
4. president, election, he
5. fake, news, bad
Notes: ‘he’, ‘her’ and ‘she’ were assigned to topics as their frequency coincided with gender-specific names (Ted Cruz with the former, Hillary Clinton with the latter); ‘new’ was assumed to belong with ‘york’; ‘country’ was ambiguous given other countries, well… seven in particular, were also mentioned in tweets during the most recent period, but including or excluding it didn’t affect results
Sooo….. drum roll. What did the tweets reveal?
For each period, the top 5 groups are listed in order with the percentage of the top 20 word frequencies that the group represents. The groups are coloured: orange for self, green for other, purple for modal verbs and blue for topics. The percentage text colour is black when it first appears, then red if the percentage has gone down and green if the percentage has gone up in the next period. The arrows indicate whether the group has moved up or down the rankings.
There is an interesting shift in focus throughout the timeline of periods. The one consistent topic, unsurprisingly, involves making America great. However, once elected, ‘USA’ in its various forms (e.g. United States) becomes more prominent than the word ‘America’. Each of the other topics is relevant only within a single period: Ted Cruz and the states where primaries were being held; Hillary Clinton, apparently crooked; and Fake News. The topic that just failed to make the top 5 groups during the Elected period was ‘president, election, he’ (came in 6th).
The more interesting shift is in the use of modal verbs and personal/possessive pronouns. The focus on self declines dramatically (and somewhat surprisingly given other public communications) once president whilst the use of 2nd person pronouns and modal verbs continues to rise across all four periods. The only modal verb in the top 20 words for the first three periods is ‘will’. Once president, it has been joined by ‘can’ and ‘should’. It will be interesting to see how these trends develop over the next 3,200 tweets…
For those interested in the analysis:
The data was retrieved using Twitter’s public APIs. The analysis was all done in Python. The tweets were pre-processed to clean up the text, using regular expressions. Punctuation and numeric characters were removed. Abbreviations were expanded and combined, e.g. ‘United States’, ‘U.S.’, and ‘U.S.A.’ all became ‘USA’, ‘makeamericagreat*’ became ‘make’, ‘america’, and ‘great’. Shortcuts for pronouns and modals were also expanded, e.g. ‘I’ll’ into ‘I’ and ‘will’. The Python package NLTK was used for word-breaking. However, a custom noise word list was used to keep the pronouns and modal verbs. Once that was all done, a simple frequency distribution was created for each group of tweets and the top 20 most common words were extracted with their frequency counts. Grouping of terms for topics was manual. I was going to use topic modelling, but the topics were fairly self-explanatory within the top 20 terms and, given this is not being published in Nature, I decided simple manual groups were good enough.
The table below shows the full top 20 lists for the four periods. The terms emphasised in bold only occur in one period. Those with an eye for detail will notice the number of tweets is very different, particularly the ‘Nominee’ period compared to all others. Yes, this could affect some of the findings, particularly for the smallest period. I’ll probably run the analysis again in a couple of months to see if it changes much.
And a chuckle for those who made it to the end of the post. The frequencies were dropped into Excel for quick calculations to get the percentages. The visual was created manually in PowerPoint 🙂 Sometimes, it’s easier to just bash something together than write a script…
Thanks to the following sources for code snippets and examples that helped with the analytics behind this article:
- Natural Language Processing with Python by Steven Bird et al, published by O’Reilly Books
- Document clustering by Brandon Rose
Featured image: Snowdrops, February 2017 – Author’s own photo (my mum has a lot of snowdrops in her new garden)