Friday, March 6, 2015

Big data data analysis

In statistics, computer science, and business, "big data" is all the rage. I think I have finally figured out what has bothered me about the term besides it being thrown around often without clear definitions.

First, there are two types of "big" data - long data and wide data. Long data involves large numbers of subjects and wide data is high dimensional with sometimes more variables than observations (even sometimes with high dimension & small sample size). I have typically been involved in research in wide data situations which makes using some typical statistical methods impossible - creating really interesting statistical challenges to solve. The long data issues involve not being able to access all the observations because the data set is too large for memory and involves more computer science issues to allow the calculation of basic or sophisticated statistical summaries or analyses.

Business, health, and many other areas are moving to have big data issues in both directions. Big data creates problems because of its volume but the term doesn't describe what you are doing or why or really even what sort of problems you might encounter. Since I focus on statistical methods, I am always thinking about the data source (implications on the scope of inference), data visualization, data analysis, and (correct) interpretation of the statistical results. All of these reflect the analysis of big data. In other areas of statistics, we describe the type of data and add "analysis" to it. So we could discuss "big data analysis" just like we say "multivariate data analysis" or "time series analysis" (the names of courses I am teaching right now) or "functional data analysis" (where much of my research is focused). Or maybe it would be good to be more specific use terms like "wide data analysis" and "long data analysis" but I am not so sure.... I've tried saying "big data data analysis" but it is a bit clunky - maybe its adoption as a term would keep people from using it so much?

Or maybe we should talk about "all data" data analysis and not just be focused on the problems created by measuring "everything".

Whatever it gets called, it would be useful to be talking about analyses that try to address research questions, whether they do, and how to interpret the results.

Tuesday, January 13, 2015

New edition of "A Second Semester Statistics Course with R"

With my co-author, Katharine Banner, we have updated and are releasing a new edition of our book called A Second Semester Statistics Course with R, Version 2.0, 2015. It is available from our institutions ScholarWorks repository at https://scholarworks.montana.edu/xmlui/handle/1/2999


The audience for this book is primarily students taking STAT 217 at Montana State University (nearly 300 per year) but provides what we think is a nice treatment of intermediate statistical methods. What is more unique is the deep integration of R throughout the text - providing a nice venue to learn to use R with a moderately sophisticated set of graphical and statistical methods. All of the data sets and code are provided within the text, making for a nice place to learn R. I think it contains really nice summaries of some exploratory graphical techniques, one and two way ANOVA, Chi-square testing and regression modeling. But I might be biased slightly...

In updating from the version from Spring 2014, there were a few goals based on feedback from "living with" that version for a year. Students thought that the previous version read like it assumed they knew things, especially the new topics (hopefully that is what they were talking about). It was interesting to re-read the book with that feedback in mind and see it from their perspective. In this version, more time is spent explaining terminology and reviewing terminology that should have been learned in the pre-requisite course (at least that is the hope).

Along with being available digitally to the students for free, the book is available as print on demand from our local printing services. In printing the first version, I discovered that much of the cost came from color pages (like 6 times more than a black and white page). So in the last year, I kept watching for alternatives to the displays that needed color that could be made to work in black and white.

The rest gets more technical...

As one example, I have always disliked the interaction.plot function in R for Two-Way ANOVA interaction assessment. First, it doesn't work with formula notation (y~x1*x2) which we use because it allows a direct tie between making graphs and fitting models (this is a project goal of the mosaic projects work). Second, it doesn't provide a measure of variability for each displayed mean. While bounds may not directly relate to results from the tests in the model, they are useful to help to understand potential changing variability across levels (if the sample sizes are similar) and to have some sense of how important the interaction might be. We used the interaction2wt from the HH package in version 1.0 which had two advantages - it made the interaction plot both ways ( switching the variable on the x-axis) and it plotted the "main effects" via boxplots for vs individual explanatory variables. But it didn't contain error bounds, was hard to add annotation to since it was within a lattice display, and its default color scheme require color to be understandable (as with most default lattice colors schemes). The mosaic package provided a better lattice color scheme that could work in black and white, but the lack of error estimates was problematic. So I built a formula interface to a version of an interaction plot that I found in the sciplot R package . As I was working with it I also added functionality to add the results of Tukey's HSD to a Two-Way ANOVA model where the interaction was found to be important. The next version of this function will add options to get the plot in both orders and to add displays of the means that estimate each main effect (writing this will hopefully remind me of the two extra tasks I just created for myself).

We had also used the bwplot function to make boxplots and histogram to make histograms from the mosaic package. They didn't provide many extra features over the standard boxplot and hist functions from base R for our purposes and bwplot replaced the median with a dot that was confusing to explain. So we went back to using boxplot and hist. I added beanplots that add density curves to stripcharts and display the group means - all of which making shape and mean comparisons possible for multi-group problems and allowing a view of the original observations. In an update beanplot added a default option to log-transform the responses if non-normality was detected that I don't really like - partially because I have never liked tests for normality. And I think the jitter option should be the default because stacking tied observations can make it hard to see the mean values, especially in black and white.

We also dropped the do() function mosaic and are trying to teach the for-loop that underlies repeatedly applying functions. This method actually seems simpler than using the do function because the do function added an extra label that we can avoid in directly filling our own vector of results of interest.

I did remove cor.test from the functions used but added in using the cor() function interface that mosaic has built that allows formula specifications of correlations.

The other changes are a general change that I am trying to make in how I talk about "statistically significant" results. This a vague and loaded term and I attempted to remove that term and focus on "strength of evidence" interpretations. This will require another round of writing at some to fully implement, but I started the move.

I have found the following benefits of having my own book for the course I supervise. The instructors (typically graduate students with MS students often only getting to teach it once or twice) are all on the same page with material to cover - and the students are too. I can make changes as R changes. This can be as simple a code interfaces changing or like the integration of beanplots which is a relatively new package in R. It also allows me to change the book as I change. It is surprising that after 20 years of doing statistics how much I continue to evolve as a statistician. I am sure I am now losing and re-learning things I used to know, but I am enjoying the changes that have come with moving to focusing on quantifying evidence instead of just saying that anything with a small p-value is significant.

Monday, August 11, 2014

Data is vs data are...

I was trying to decide about resurrecting my blog or closing it out. I decided on making on more quick post and then decide later...

I have been really liking the fivethirtyeight blog and their recent stories on the proper usage of the term "data" have really been enjoyable (one story and their original story). They used Survey Monkey Audience to address this issue. I focus on this issue in my book (available here) in the first pages. In statistics, the grammar usage is very clear - "data" should be used as plural. I like to think of whether "things" works in sentences and use that to decide if I used "data" correctly. "Data" is distinguished from datum as containing more than one observation. I have always found that I have not felt a strong need to follow the whims of the general culture in modifying grammar usage. Except when it is British English usages, but that is story for another blog post. 

The fivethirtyeight blog needed to make an editorial decision and went with the common language usage of "data is" and had their reasons for doing that. A bit odd for a bunch of quants making a living writing data analyses, but their audience would find them erudite. (My wife wondered whether that was a bad thing when we discussed this.) Being thoughtful about grammar usage reflects a concern for clarity. And this leads me to a small addition to their work. If you dig through their data set a bit, you can find the following:
Basically, what this shows is that people who have thought about this issue respond with choosing "data are" at much higher rates than people who have not thought about it. And those that have not thought about it use the colloquial "data is". 

I was recently talking to a colleague who uses statistics but is not statistically trained. They mentioned trying to read a statistical methods book and ran into the colloquial "data is" in it. They put the book down. And that is the issue... if you use "data is" you sound like you don't know what makes up a data set, whether you do or not.

Even more fun is to try to find other relationships with the is/are choice in the following tableplot:

To fully document what I did, I removed any missing observations list-wise to make it easier (eventually) for my students to do this analysis when they read my book. My R code is available upon request and the interpretation ideas for both plots are available in my book. The original data set is available on the github related to the original article.

And I am in no means perfect in my grammar usage, but am always searching to improve...