Do your data analysts learn enough about your business environment?

16/06/22 03:33 PM
Sign up to receive the next article!
In many aspects of statistics, field expertise must enter the analytical process to generate robust and reliable information. For instance, modeling potential causes of variations of a metric of interest (e.g., daily sales, price elasticity) can strongly benefit from inputs by field experts (e.g., sale assistants, product managers). This is especially true, when the size of the sample is close to the number of potential explanatory variables.

For instance, we might be interested in the customer characteristics that explain how many beauty products they buy on our platform.

Let assume a first situation where one of our field expert shares with our analyst his intuition about "weather" and "sex" influencing customer cosmetics purchases. If our analyst follows his indications, he will test the different models that include both variables as potential causes and will not consider other variables present in our customer database. If there is a causation link between those two variables and the cosmetics purchases (and no other variables are of strong influence), some of these models will show good accuracy, and allow drawing reliable conclusions about potential causes influencing the number of cosmetics bought.

    Now, let say that our analyst is left to the job of fitting the model without any input from a field expert. Because he does not know which variable is expected to influence cosmetics purchases, he will test all models including the weather, the customer sex, and all other variables present in the database (e.g., sock purchases). Let assume again that these other variables do not influence the shopping behavior which interests us.

Including in the model all those additional variables will likely end up with detection of spurious effects. In other words, our results will support that some variables have a significant effect on the customer behavior, even if this is not the case. One of the reasons is that the effects of some variables might be biased away from zero in the sample. For instance, despite that there is no link among the sock and cosmetics purchases, we might have sampled persons which wrongly give the impression that a customer who bought socks in the last month buys fewer cosmetics now. Because this effect is not real, if we had the opportunity to redo the sampling, this effect is more likely to not be detected again. Unfortunately, in most cases, we have access to only one sample and there are no ways to know which parameter of the model is biased. In addition, the presence of these spurious effects will lead to an overestimation of the precision of the model.

To avoid this problem, field experts should express their a priori knowledge about the variables that they believe to be of some influence (sex and weather in the example above). This information will help analysts to define which models are reasonable to test among the many conceivable. In the example of the cosmetic purchases, the analyst who received field-expert input tested the models with weather and sex as explanatory variables without socks. This strategy avoided the selection of models with spurious effects, and which provided information which cannot be generalized beyond the sample used.

That being said, one might be interested to take the risk of having those spurious effects to be present in the models if they can afford future investigations to confirm their effects or rule them out.

Sign up to receive the next article!
Julien Massoni

Julien Massoni