Friday, March 6, 2015

Big data data analysis

In statistics, computer science, and business, "big data" is all the rage. I think I have finally figured out what has bothered me about the term besides it being thrown around often without clear definitions.

First, there are two types of "big" data - long data and wide data. Long data involves large numbers of subjects and wide data is high dimensional with sometimes more variables than observations (even sometimes with high dimension & small sample size). I have typically been involved in research in wide data situations which makes using some typical statistical methods impossible - creating really interesting statistical challenges to solve. The long data issues involve not being able to access all the observations because the data set is too large for memory and involves more computer science issues to allow the calculation of basic or sophisticated statistical summaries or analyses.

Business, health, and many other areas are moving to have big data issues in both directions. Big data creates problems because of its volume but the term doesn't describe what you are doing or why or really even what sort of problems you might encounter. Since I focus on statistical methods, I am always thinking about the data source (implications on the scope of inference), data visualization, data analysis, and (correct) interpretation of the statistical results. All of these reflect the analysis of big data. In other areas of statistics, we describe the type of data and add "analysis" to it. So we could discuss "big data analysis" just like we say "multivariate data analysis" or "time series analysis" (the names of courses I am teaching right now) or "functional data analysis" (where much of my research is focused). Or maybe it would be good to be more specific use terms like "wide data analysis" and "long data analysis" but I am not so sure.... I've tried saying "big data data analysis" but it is a bit clunky - maybe its adoption as a term would keep people from using it so much?

Or maybe we should talk about "all data" data analysis and not just be focused on the problems created by measuring "everything".

Whatever it gets called, it would be useful to be talking about analyses that try to address research questions, whether they do, and how to interpret the results.

Tuesday, January 13, 2015

New edition of "A Second Semester Statistics Course with R"

With my co-author, Katharine Banner, we have updated and are releasing a new edition of our book called A Second Semester Statistics Course with R, Version 2.0, 2015. It is available from our institutions ScholarWorks repository at https://scholarworks.montana.edu/xmlui/handle/1/2999


The audience for this book is primarily students taking STAT 217 at Montana State University (nearly 300 per year) but provides what we think is a nice treatment of intermediate statistical methods. What is more unique is the deep integration of R throughout the text - providing a nice venue to learn to use R with a moderately sophisticated set of graphical and statistical methods. All of the data sets and code are provided within the text, making for a nice place to learn R. I think it contains really nice summaries of some exploratory graphical techniques, one and two way ANOVA, Chi-square testing and regression modeling. But I might be biased slightly...

In updating from the version from Spring 2014, there were a few goals based on feedback from "living with" that version for a year. Students thought that the previous version read like it assumed they knew things, especially the new topics (hopefully that is what they were talking about). It was interesting to re-read the book with that feedback in mind and see it from their perspective. In this version, more time is spent explaining terminology and reviewing terminology that should have been learned in the pre-requisite course (at least that is the hope).

Along with being available digitally to the students for free, the book is available as print on demand from our local printing services. In printing the first version, I discovered that much of the cost came from color pages (like 6 times more than a black and white page). So in the last year, I kept watching for alternatives to the displays that needed color that could be made to work in black and white.

The rest gets more technical...

As one example, I have always disliked the interaction.plot function in R for Two-Way ANOVA interaction assessment. First, it doesn't work with formula notation (y~x1*x2) which we use because it allows a direct tie between making graphs and fitting models (this is a project goal of the mosaic projects work). Second, it doesn't provide a measure of variability for each displayed mean. While bounds may not directly relate to results from the tests in the model, they are useful to help to understand potential changing variability across levels (if the sample sizes are similar) and to have some sense of how important the interaction might be. We used the interaction2wt from the HH package in version 1.0 which had two advantages - it made the interaction plot both ways ( switching the variable on the x-axis) and it plotted the "main effects" via boxplots for vs individual explanatory variables. But it didn't contain error bounds, was hard to add annotation to since it was within a lattice display, and its default color scheme require color to be understandable (as with most default lattice colors schemes). The mosaic package provided a better lattice color scheme that could work in black and white, but the lack of error estimates was problematic. So I built a formula interface to a version of an interaction plot that I found in the sciplot R package . As I was working with it I also added functionality to add the results of Tukey's HSD to a Two-Way ANOVA model where the interaction was found to be important. The next version of this function will add options to get the plot in both orders and to add displays of the means that estimate each main effect (writing this will hopefully remind me of the two extra tasks I just created for myself).

We had also used the bwplot function to make boxplots and histogram to make histograms from the mosaic package. They didn't provide many extra features over the standard boxplot and hist functions from base R for our purposes and bwplot replaced the median with a dot that was confusing to explain. So we went back to using boxplot and hist. I added beanplots that add density curves to stripcharts and display the group means - all of which making shape and mean comparisons possible for multi-group problems and allowing a view of the original observations. In an update beanplot added a default option to log-transform the responses if non-normality was detected that I don't really like - partially because I have never liked tests for normality. And I think the jitter option should be the default because stacking tied observations can make it hard to see the mean values, especially in black and white.

We also dropped the do() function mosaic and are trying to teach the for-loop that underlies repeatedly applying functions. This method actually seems simpler than using the do function because the do function added an extra label that we can avoid in directly filling our own vector of results of interest.

I did remove cor.test from the functions used but added in using the cor() function interface that mosaic has built that allows formula specifications of correlations.

The other changes are a general change that I am trying to make in how I talk about "statistically significant" results. This a vague and loaded term and I attempted to remove that term and focus on "strength of evidence" interpretations. This will require another round of writing at some to fully implement, but I started the move.

I have found the following benefits of having my own book for the course I supervise. The instructors (typically graduate students with MS students often only getting to teach it once or twice) are all on the same page with material to cover - and the students are too. I can make changes as R changes. This can be as simple a code interfaces changing or like the integration of beanplots which is a relatively new package in R. It also allows me to change the book as I change. It is surprising that after 20 years of doing statistics how much I continue to evolve as a statistician. I am sure I am now losing and re-learning things I used to know, but I am enjoying the changes that have come with moving to focusing on quantifying evidence instead of just saying that anything with a small p-value is significant.