Monday, August 11, 2014

Data is vs data are...

I was trying to decide about resurrecting my blog or closing it out. I decided on making on more quick post and then decide later...

I have been really liking the fivethirtyeight blog and their recent stories on the proper usage of the term "data" have really been enjoyable (one story and their original story). They used Survey Monkey Audience to address this issue. I focus on this issue in my book (available here) in the first pages. In statistics, the grammar usage is very clear - "data" should be used as plural. I like to think of whether "things" works in sentences and use that to decide if I used "data" correctly. "Data" is distinguished from datum as containing more than one observation. I have always found that I have not felt a strong need to follow the whims of the general culture in modifying grammar usage. Except when it is British English usages, but that is story for another blog post. 

The fivethirtyeight blog needed to make an editorial decision and went with the common language usage of "data is" and had their reasons for doing that. A bit odd for a bunch of quants making a living writing data analyses, but their audience would find them erudite. (My wife wondered whether that was a bad thing when we discussed this.) Being thoughtful about grammar usage reflects a concern for clarity. And this leads me to a small addition to their work. If you dig through their data set a bit, you can find the following:
Basically, what this shows is that people who have thought about this issue respond with choosing "data are" at much higher rates than people who have not thought about it. And those that have not thought about it use the colloquial "data is". 

I was recently talking to a colleague who uses statistics but is not statistically trained. They mentioned trying to read a statistical methods book and ran into the colloquial "data is" in it. They put the book down. And that is the issue... if you use "data is" you sound like you don't know what makes up a data set, whether you do or not.

Even more fun is to try to find other relationships with the is/are choice in the following tableplot:

To fully document what I did, I removed any missing observations list-wise to make it easier (eventually) for my students to do this analysis when they read my book. My R code is available upon request and the interpretation ideas for both plots are available in my book. The original data set is available on the github related to the original article.

And I am in no means perfect in my grammar usage, but am always searching to improve... 

No comments:

Post a Comment