Introducing 7 Words…

TL:DR;
I made an internet web page that displays live data regarding the usage of 7 popular swear words on twitter. You can see it here.

Longer version;

People that know me, know I swear a bit. I can’t help it; evidently I was raised badly, and two decades of listening to that damn rap music has had all of the effects that C. Delores Tucker tried to warn people about. I also (on a seemingly unrelated note) like maps. I just like they way they look, with their land masses and oceans and stuff, and I like seeing things plotted on maps (I slightly obsessively tag EVERY photo that I upload to Flickr with the location). I do prefer hand illustrated versions, but just a bog standard Google Map is enough to please me in lieu of a mappa mundi. The third thing that I like in this slightly tenuous and lengthy introduction to something that I’ve made, is twitter. I tweet. Who doesn’t(?)  Sometimes, I swear on twitter. Sometimes, I allow twitter to get the location data for a tweet too. Occasionally those things happen at the same time. As it turns out, a lot of other people do those things too. Which leads me to…

7 Words You Can Say On Twitter

7 Words

For a while, I’d been toying with various ideas of mini projects to get me playing with twitter data. (I won’t repeat them here because I can’t give away the gold – some may see the light of day eventually and they’ll make me millions. MILLIONS I tell you). I eventually settled on the totally new and original concept of tracking keywords and doing some sort of count on them. I dabbled with the Search API and fairly quickly had a page which would update with the results of a search term you’d provide. Woopty fucking doo. In an attempt to make it more interesting, I added a sort of competitive edge which placed whatever terms you entered onto a track and animate a race between them to display which was the ‘winner’. It was OK, but quite limited. What I really wanted to do was something with the Streaming API and much larger numbers of tweets.

Enter the ninja

140dev

A couple of bungled attempts later and I had a live stream of tweets attached to certain keywords but aside from basically displaying the raw data, I wasn’t doing much of use with it. Thankfully, I stumbled on 140dev by Adam Green. He’s a much more talented developer than I am, and has provided a wonderful tool which scrapes and stores tweets from the live stream into a MySQL db via Phirehose (which I’d failed to implement myself). It’s really great, and worked pretty much right off the bat. If I had a hat, I’d doff it to him, because without that as a starting point, I’d probably still be flailing around. Not only that, but because it’s written in PHP and uses MySQL, I could quickly get my head around it and start to tweak it. Which was just as well, because it turns out that just scraping and storing live tweets quickly results in a lot of data. Like, a fucking shitload. For a *generic popular keyword*, I found that I was easily ending up with a good 100,000 tweets per hour – and with the 140dev tool, I was storing the raw tweet as well as user info and any tags, urls or mentions within those tweets. Many many GB of database space later and I realised that continuing in that manner wasn’t really an option for me.

Furthermore, I was having a few issues with keeping the scripts which collected and parsed the tweets running for extended periods of time. Manually monitoring that and restarting via SSH was OK in the first instance but I obviously didn’t want to do that constantly, as well as emptying the database tables periodically for fear of incurring the wrath of my hosting company. But more on this shortly..

Bring on the cursing

Whilst all of that initial tinkering was going on, I had been using woodland animals as my keywords. Squirrels are mentioned more than hedgehogs, which are mentioned more than weasels, but badgers trump them all, largely because of ‘honey badger’ mentions, which I think is something sexual because the type of people mentioning it are sometimes iffy-looking, but I didn’t want to explore that further. While animals are fun, they don’t really capture the public imagination quite as much as a good old fashioned ‘fuck’. Or ‘shit’. Or ‘tits’. These are things that I obviously soon tried collecting data for, because I’m basically a child and searching for swear words is funny. Like when you looked up those words in a dictionary at school, then claimed to be looking at ‘function’ when you got caught. Don’t pretend you never did that.

george carlin

Anyway, those words turned out to be more fruitful than woodland animals in terms of their usage on twitter, so I started to think that maybe I could do something with them in order to marry up 2 of my great interests in life (like what I mentioned in the intro up there ^ slowly, it all starts to make sense). So I began recording the usage of swear words; shit, piss, fuck, cunt, cocksucker, motherfucker & tits. It had to be those 7 really, because they’re the first thing that comes to mind when I try to list swear words. George Carlin’s ‘Seven Words You Can Never Say on Television‘ is pretty much required listening for anyone in my opinion, and it proved to be the inspiration for this project.

Ironing out the creases

When I started to record those, I found I was gathering about 4 million tweets per day. Each one, neatly stored in this collection of database tables that I still hadn’t got ’round to sorting out properly. After a couple of weeks of experimenting with different solutions (one of the problems of working with live data like this is that I couldn’t find a better way of simulating events/timescales so any changes I made had to be tested in real time), I had a fairly robust system that basically:

  • Collects all tweets that match one of the 7 words
  • Assigns them to the relevant word and updates a separate count of the ongoing word total
  • Stores them (just the tweet info, not the user/mentions/urls metadata) in the database
  • Periodically checks that the scraper is running and if not, restarts it
  • Flushes the tweet cache regularly, but keeps tweets that have location data with them (because they’re a bit more interesting)
  • Takes an hourly snapshot of the word totals
  • Empties everything and restarts every 24 hours

This all means that I’ve been able to keep the scraper running for extended periods of time (I’m up to a month at the time of writing) without my table size ever getting any bigger than a couple of hundred MB, and I still retain enough information to do ‘interesting’ things with the data.

Once I had all that sorted, it was just a case of determining what was ‘interesting’.

The final thing

I still somewhat liked the idea of the word counts ‘racing’ against each other, but I couldn’t really be fucked to progress with that so I stuck with just the raw numbers. Watching the numbers continually incrementing is dynamic enough for me; set an arbitrary target and make up a race between them in your head if you like. (‘Shit’ always wins though).

numbers

I also wanted a visualisation method for the actual content of the tweets. Displaying them as and when they came through obviously wasn’t really an option (due to the crazy amount), so I knew I’d have to filter them somehow. Various options were considered, from every 100th tweet, to a selection every x seconds; rendering them as a scrolling marquee, to floating them around in space. It dawned on me though, that the number of tweets with geo data was actually low enough to be used as a filter itself (only about 3% of the tweets I’m recording are tagged like that) and once that had been recognised, the obvious course of action was to display them on a map. The version that’s currently in place is fairly basic – dropping colour-coded markers for each sweary tweet and displaying the tweet itself momentarily as it’s being added – but it’s something that I’ll be building on shortly. Not having done much with Google Maps for a while, it’s nice to play with v3 of the API, and there’s definitely scope for some more visual goodies.

map

The last thing I’m using to make the data more pleasing to the naked eye is a plain old fashioned line graph. Again, not too much going on in this. It’s a graph. I’ve taken the totals for each word on an hourly basis and plotted them against each other. With ‘shit’ and ‘fuck’ clearly standing apart from the others, I’ve had to plot them on a separate axis to the other words, but really, the graph is just to show when each word peaks throughout the day – the totals aren’t too important to me.

graph

And that’s that

It’s not really intended as a finished product, but it’s complete enough that I thought some people might find it interesting. It’s been fun to make at least, and I’ll be adding some bits here and there. If you’ve got any suggestions or comments in general, feel free to jump on the twitters and let me know @mattnortham.

Shout to wtflevel and the sadly defunct cursebird for being slightly similar to this. I can honestly say that neither really influenced this project, but they work with twitter and swearing so deserve a nod here.

Stop reading, go and watch people swear: 7 Words You Can Say On Twitter

Never use a big word when a little filthy one will do.