28-year-old NYU grad student Josh Begley has created an API that provides detailed data on any US drone strike since 2002. We’ve covered drones before on this site (see also this excellent series on drone ethics over at sipo blog). Also, I’ve decided to pick up the R programming language as part of my efforts to learn statistical methods for the social sciences. So, I thought Begley’s drone data might provide a good opportunity to give you a little „Werkstattbericht“ on my learning efforts on R as well as tying in with the drone theme currently running through the media and the blogosphere.
This post does three things:
- It’s a proof-of-concept for electronically obtaining the US drone data, storing it, and visualizing the data with R and Google Fusion tables. (Or if you will, it’s essentially a complicated way of saying: because I can.)
- It serves as an outlet to provide a piece of code to make life easier for someone who is a) trying to do exactly something like this (ye shall not look further) and b) has better skills than me to actually do something more interesting than me with this data. Also I’m always seeking to improve my coding qualities which are–at this point–not really existent. So any feedback is greatly appreciated.
- It visualizes some parts of the drone data. I’m going to present some mapping of the drone data in Google Fusion tables (plus some info on how to do it), as well as produce some graphics on the quantity and intensity of US drone strikes since 2002. BUT: I do not claim at any point that I have actually any expertise in the subject matter of drone strikes—this post is much more about the process of accessing and visualizing the data than actual analysis. I mean, I do provide some very basic interpretation, but I’d be happy to collaborate with someone who has more expertise in more thoroughly analyzing this kind of data.
What it doesn’t is to claim to be something new. The data I’m using has been much more effectively and impressively visualized on this website. Rather, it provides a glimpse into the mechanics of my first and VERY shaky attempts to handle this kind of data myself (and by doing so hopefully helps other that are attempting something similar).
So, why R? It’s said to be the most complicated software packages out there to learn statistical computing and data manipulation (a point with which I don’t agree btw). For three reasons: a) I had some previous (*ahem* high-school *ahem*) coding experience, b) I’m a firm believer in automatizing repetitive tasks makes life tremendously easier and R seemed to be especially well-suited for that, and c) I usually learn more when thrown into cold water (R didn’t turn out to be that cold, though).
Part One: Getting the Data from the Web
But enough of my backstory, let’s turn to the code (which you can also find here):
This piece of code achieves three things:
- It obtains the drone data from the Web API, reads it into an R data structure, and saves it as „Drone_Data.csv“ in the subfolder „output“ (I’m following this advice of a clear R project layout). This happens in the lines 11-39.
- It aggregates the data format from daily to monthly data. This allows for better visualization, but comes at the cost of time precision (lines 41-53). It also adds „empty“ months to the data, i.e. months in which no drone strikes occurred. This is necessary, because the original data source only contains positive entries, i.e. counts of events when drone strikes happened–but if we want a full overview of the drone strikes time-series we also need the empty months. This happens in lines 55-76.
- It generates two graphs: one summarizing the aggregated number of drone strikes per month since October 2002 and another one summarizing the aggregated number of monthly casualties caused by US drone strikes. This is being done in the lines 82-111.
I won’t go into technical details concerning the specifics of the code, but I know there’s a couple of issues that I’m more than happy to discuss in the comments or by email/twitter etc! (See the end of the post for known issues)
Part Two: Visualizing the Data
But let’s turn to the figures. This is the first graph, the aggregated number of drone strikes per month:
The dotted line represents the beginning of the Obama presidency in January 2009.
We can see a marked increase of the number of drone strike with Obama taking office in 2009. The number reaches an all-time high in September 2009 when there were a staggering 24 drone strikes which means a strike almost every day of this month.
The next figure represents the actual effects of these drone strike in terms of casualties per month. The black line represents the minimum estimated casualties, the red line the maximum number of estimated casualties.
The total (minimum) number of casualties as a result of US drone strikes is 3146 as of June 2013. The real number is most likely to be much higher. This is pretty outrageous, especially given the fact that the vast majority of these casualties are civilians. Again, there is a marked increase in casualties after the beginning of the Obama presidency which is driven by the increasing number of strikes after Obama taking office (we don’t see an increase in the individual lethality of drone strikes in the data which would be another explanation).
Part Three: Mapping the Data
Now, it would be interesting to know where the drone strikes happen. This is relatively easy, since the data contains both latitude and longitude of each strike (or at least an approximation based on the respective source). Although R provides some great mapping functions, I haven’t had a chance to dig deeper into those. Which is why I am relying on Google Fusion Tables to map the data.
Google Fusion Tables is an extension of Google Docs which allows you to create and manage different kinds of data in a spreadsheet-kind of environment. One type of data it can handle is geographical data. This works best when you have latitude and longitude of your observations, but given Fusion Table’s integration with Google and Google Maps you could actually only provide the name of a town or a street address and Google Maps will automatically try to find the matching coordinates and map them for you. This is pretty handy, but our case we already have the georeferenced data of each drone strike, so mapping should come pretty easy.
And it does. Just create a new Google fusion table (requires a Google account) and import the “Drone_Data.csv” provided by the R code and import the data, and setting it to import with tab as a separator (that’s what we defined in R’s write.table() function). I’ve uploaded a fusion table containing the drone data as of 18 June 2013.
[Note: for some reason that I haven’t figured out yet, I can’t import the .csv directly into fusion tables. It always stops after 5 rows. I need to import it as a Google spreadsheet first and then import it from there into fusion tables. I’m working on it. In the meantime, I’m glad about any suggestions as to why that happens or even if you can replicate the problem]
In fusion tables, you set the columns “lon” and “lat” to the type “location” (Google should automatically detect a two-column location column, meaning that fusion tables detects “lon” and “lat” as longitude and latitude of a given observation).
After that you click on the small plus right next to the “rows” and “cards” menu options and click “add map”.
Unfortunately, it is not yet possible to embed Google maps created by fusion tables in wordpress.com posts. This screenshot gives you an idea of where the majority of drone strikes are located, though. For a dynamic view, follow this link.
Drone strikes world wide:
We see that the vast majority of drone strikes happens in the Afghanistan-Pakistan border region. But persons in Yemen and Somalia have also been target by US drone strikes.
So what else could be done with this data? As mentioned before, Pitch Interactive did an impressive job in visualizing the outrageously large number of civilian casualties caused by drone strikes. The data frame produced with the code presented in this post is not yet ready to fully analyze civilian casualties in a similar way, since the data in the civilian casualties column is not yet cleaned (there are entries like “possibly” which would need to be removed). I plan to include that in future versions.
Since the API provides data on drone strikes in near real time, this could also be used to hold President Obama accountable to his promise of reducing drone strikes. Both graphs (number of strikes and casualties) provided above should flatten after May/June 2013. So, we’ll check back in 6 months to see whether that promise was actually true.
Of course there’s tons of other possible research projects one could do with this data. I hope this post helps to ease access and use of this data.
On the technical side, there are several issues that I know, but I haven’t fixed yet:
- The depiction of the time axis in both graphs is not perfect. Don’t know why it centers around April and October of each year when January and July would be more sensible. Trying to fix that.
- The data class of the data frame produced is character which is not convenient for further analysis of the data. Should fix that soon.
- If you source the script without echo in RStudio it produces black graphs. No idea why.
- There’s a package for directly writing fusion tables from the R script, so we wouldn’t need to go through exporting and importing a .csv. The website is currently down, but I try to integrate it, if it comes up again.
- Of course, the mapping could be done directly in R. Haven’t quite dived into that yet, but it’s on the to-do list.
Any suggestions for improvements, further research projects etc. are most welcome. Just drop me a line.