Monday, April 1, 2013

Let the data speak!

I am against the trend that we let data, and only data to speak for themselves. But I do have some thoughts on what they should speak. Namely, questions!

Asking the right question is an important, important skill. What is a good question? It is relevant, and we have a good attack! In statistical science, the second point means we have the right data! If we only have like 10 data points and you wish to estimate the parameters for a super-complicated models involving 10 parameters, I wish you good luck.

When I took Econometrics, I was cynical. I am unhappy with linear models. As far as I see, the relation we try to estimate could be quite non-linear.  However, there is a point, that I just begin to see. For a long time, (still in many areas today), data are scarce. It will be unrealistic to do a highly structural estimation, and it makes perfect sense for us to restrict us to linear models. At least, we recover correlations. The caveat many miss is that, once the model is ill-specified, the coefficients we get might not be interpretable, or requires more thinking, and quantifiers.

If I indeed get lots of observations for each possible vector of independent variable, I could impose no structure  I could simply estimate the mean of y_i for each possible vector of independent variable, and here we go.  I am not saying that's the right way to do it, but it will be ok.

A related talk I heard last semester is on spatial statistics. It talks about estimating earthquake frequency. The talk discusses how fine the grid should be. The solution is let the data decide! If you happen to have lots of observations, you can have finer grids...duh.

My view is that there is no physical probability! The world is deterministic. (For example, the coin flip is not random! At Persi can toss it however he wants) There is only belief. Lots of things we observe in life are just very complicated, and obtaining enough data to recover the deterministic relation becomes a bad bad question (because it is impossible). What is our job? We try our best to collect data, and based on the available data, we try to come up with the right question! The questions involves the model with the parameters to be estimated. The complexity of the model should match the data. What we could not be incorporated, we take them as "noise". It is our discretion to decide what is noise (the log term, or quadratic term, or that variable). This is where reasoning comes in. From research in the corresponding fields, we get a sense of what might be important and what might be negligible, and we make our assumptions accordingly. Data and assumptions are complementary not substitutable!

We are in the era of big data. There is no deny. But instead of being lazy and striving to make as few assumptions as we could, the availability of new data grants us the opportunity to ask questions that could not be asked before! Ask yourself not, what I can leave the data to do, ask yourself, what can I do with the data!

No comments:

Post a Comment