Friday, March 6, 2015

Big data data analysis

In statistics, computer science, and business, "big data" is all the rage. I think I have finally figured out what has bothered me about the term besides it being thrown around often without clear definitions.

First, there are two types of "big" data - long data and wide data. Long data involves large numbers of subjects and wide data is high dimensional with sometimes more variables than observations (even sometimes with high dimension & small sample size). I have typically been involved in research in wide data situations which makes using some typical statistical methods impossible - creating really interesting statistical challenges to solve. The long data issues involve not being able to access all the observations because the data set is too large for memory and involves more computer science issues to allow the calculation of basic or sophisticated statistical summaries or analyses.

Business, health, and many other areas are moving to have big data issues in both directions. Big data creates problems because of its volume but the term doesn't describe what you are doing or why or really even what sort of problems you might encounter. Since I focus on statistical methods, I am always thinking about the data source (implications on the scope of inference), data visualization, data analysis, and (correct) interpretation of the statistical results. All of these reflect the analysis of big data. In other areas of statistics, we describe the type of data and add "analysis" to it. So we could discuss "big data analysis" just like we say "multivariate data analysis" or "time series analysis" (the names of courses I am teaching right now) or "functional data analysis" (where much of my research is focused). Or maybe it would be good to be more specific use terms like "wide data analysis" and "long data analysis" but I am not so sure.... I've tried saying "big data data analysis" but it is a bit clunky - maybe its adoption as a term would keep people from using it so much?

Or maybe we should talk about "all data" data analysis and not just be focused on the problems created by measuring "everything".

Whatever it gets called, it would be useful to be talking about analyses that try to address research questions, whether they do, and how to interpret the results.

No comments:

Post a Comment