By Craig Kolb, Acentric Marketing Research (Pty) LTD, 22 September 2016
Tables sometimes just don’t cut it and more advanced techniques are needed to obtain additional insights. Whether you’re on the supply-side or wedged somewhere between suppliers and internal clients, it is useful having additional tricks up your sleeve. A challenge though is communicating the value of these techniques to end-clients. In this article I will explain how one particular technique – logistic regression – can help provide additional insights beyond what basic tables provide. Hopefully this will inspire you to get more from your surveys in future.
It is quite possible that as a research user you are unaware of what is possible. We live in the age of Survey Monkey and DIY research, which unfortunately means awareness of what’s inside the marketing research ‘box’ is limited (for some) to what they see offered by the most common survey apps. It is worthwhile remembering that many of these common survey tools are created by IT people with little knowledge of the discipline of marketing research, and that more sophisticated software (and techniques that live independent of software) are out there.
At the same time I hope to communicate some of the steps and thought processes involved in running this sort of analysis. As always, we need to keep in mind that the more complex the technique, the more to go wrong. Therefore careful planning is required in order to deliver valid results. This planning should aim to ensure the alignment of research objectives, questionnaire design, sample design, data collection and data preparation as these are critical pre-requisites for a model that will provide reliable answers and predictions. Too often these are ignored and statistical models are fit in a haphazard manner after the fact.
Logistic regression has practical application in solving real-world marketing problems. Three broad types of business problem that can be solved using this technique; some or all of which you are likely to have encountered before: 1.) estimating percentages, such as the percentage of customers within certain groups, or the probability people are likely to have a specific characteristic; 2.) predicting the probability that each individual customer belongs to a specific group, or has a characteristic of interest and/or 3.) determining the impact of different scenarios on these probabilities or percentages.
- You promote retirement products via the internet and your’re thinking of switching your target segment from black females nearing retirement – with mostly primary school education – to those who have a university education. Your target segment must have internet access, and so you want to know if this change will increase the percentage with internet access substantially? You have a theory that the age of these consumers would result in ‘technology aversion’ and you are not sure if a better education is enough to counterbalance this effect. You need at least a 5% improvement for the switch to be justified. Will this requirement be met? Logistic regression can answer this.
- You know that there is a 10% chance that a potential customer will cancel their membership each year. You want to offer a special promotion, but you don’t want to waste your marketing budget by targeting all the members on your database. In fact you can only afford to target 5% of customers; so what do you do? Logistic regression gives you a way to calculate individual probabilities of cancellation, so that you can identify and focus on the top 5% who are most likely to leave.
- Your restaurant is facing reduced demand in a poor economy and you want to stop profit losses. You can’t afford to spend more money on marketing. In particular you want to know: 1.) how far you can go in changing your most popular menu item in terms of portion size and ingredients – you don’t want to reduce perceived quality/value so much that the lower prices don’t compensate in the consumer’s mind. 2.) You also want to know what the impact of price reductions would be, and how far you would need to go. You realise the impact on profit depends on a complex interplay between price, perceived value of the offering and unit costs. In this case a cousin of logistic regression would be used, a multinomial logit.
The need for logistic regression
Why not simply use cross-tabulations to answer these questions? While a simple cross-tabulation will allow you to estimate a conditional probability from your survey data, it becomes more difficult to calculate when you have more than one condition. Initially you may try nested tables, but as additional conditions are included it soon becomes unmanageably large and difficult to read. Secondly; if the condition is a continuous variable (e.g. age is sometimes available as an exact value) a cross-tabulation won’t be particularly useful without recoding variables into distinct categories or brackets; which means introducing subjective decisions into the process, not to mention the extra work and loss of precision. Lastly; the sub-sample sizes are reduced very quickly as you add more dimensions to a cross-tabulation leading to very large confidence intervals. In these cases trading lossed precision for the information loss involved in using a model may be worthwhile. In these cases logistic regression provides an easy way of estimating conditional probabilities.
A quick note on terminology. Most of my work relates to data from custom-survey samples – which are common in marketing research; not ‘big data’. So I stick to the original terminology of the statistics and mathematics fields (data sciences as they sometimes referred to) rather than the terminology data miners are more familiar with. I have however kept the article as generic as possible in relation to software packages. Most major statistics packages, as well as many data mining tools include logistic regression. Some examples include: SPSS, R, NCSS, SAS, Minitab and Rapidminer.
Using logistic regression to calculate probabilities
Logistic regression is a statistical modelling (data science) technique that estimates model parameters from observations of individual sampling elements. If you have survey data indicating the outcome of interest and additional covariates describing each sample element, you can use logistic (or multinomial logistic) regression. Preferably this sample should be the result of a probability sampling process called simple random sampling (SRS), but often in practice it would often be a non-probability sample.
Technically logistic regression belongs to a class of models referred to as general linear models. It has a logit link function which means that the resulting equations predicts log odds rather than the actual values observed. These can then be converted to odds, and then probabilities using the following formula.
A simple example – predicting internet access
You wish to know how the probability of having internet access varies depending on the gender and age of an individual. This example is represented below in Equation 2.
Where p represents the probability, x1 represents an indicator for gender, while x2 represents age. The intercept a and the coefficient betas are estimated using maximum likelihood estimation (frequently using the Newton-Raphson method). The log odds predictions can be easily converted into more intuitive probability estimates. This can be expanded to include more independent variables, such as the level of education of the consumer, their ethnic group and whether they live in an urban or rural area.
Once the model is estimated, it can be used to predict probabilities or percentages. A user-friendly way of visualising the model output is through a simulator, setup in Microsoft Excel.
Basic simulator tools assist in visualising the impact of changes
Using the results of the logistic regression analysis, a basic simulator tool can be created. This allows for a more interactive experience with the model, so that you can see the impact of various changes on the estimated probability. In Figure 1: Example of a basic simulator the simulator predicts a very low probability 0.03 (3%) of having internet access for a 30 year old rural black- female, with primary school as the highest level of education. This individual would also be unemployed (but still looking for work).
Figure 1: Example of a basic simulator, set to a specific scenario
We can then make alterations to this scenario, step by step. So for example, we could change unemployed (looking) to unemployed (not looking). This increases the probability to 0.05 (5%); and we can carry on in this way ‘playing’ with scenarios in various ways to assess the impact on the probability of a consumer having internet access. This is shown in Figure 2 below.
Interpreting the coefficients
It is important to understand that the exponentiated coefficient for a specific independent variable needs to be understood in terms of the remaining coefficients – and the intercept – when attempting to understand the impact on the probability of event. The exponentiated coefficient for a specific level, is in effect giving the odds of a ‘event’ equal to the odds for that level multiplied by the intercept – holding all other categorical independent variables at their reference levels (the coefficients at the reference levels are 1 and so will be of no affect when multiplied out). For instance male may have an exponentiated coefficient of 3, meaning males are 3 times more likely than females to produce the event, but this only has meaning if you understand what the baseline odds of the event are. While in the case of a single condition you could simply estimate the odds of event from your survey sample and multiply this by the product of the odds estimated for the specific condition (level of the independent categorical variable) and the intercept, this becomes impractical when combinations of multiple conditions are of interest.
A common solution is to set the independent variables based on some or other measure of central tendency – such as the mode for categorical variables with levels or the mean for continuous variables – in order to provide a baseline; as shown in Equation 3.
The effect can then be evaluated in terms of predicted probabilities rather than odds. For instance, if gender is the variable of interest, you might then evaluate the impact on the predicted ‘event’ probability when switching between male and female, rather than attempting to interpret the coefficients on their own.
While simple to implement, approaches setting levels based on modes or other measures of central tendency are not without limitations. The values of independent variables will in reality not be fixed to single values throughout the population (unless they are alternative specific levels for products as is the case in discrete choice models used in marketing research). Marginal standardization is preferred over conditional prediction methods that set all confounders to a specified value. In this approach the marginal distributions of independent variables are used to provide weighted probability estimates. (1) In practive this might mean making predictions for each individual observation using the actual independent variable values, holding the independent variable of interest fixed at the level you want it at.
Before using the model, a variety of assumptions must be met for the model to be useful:
- Independent observations. Selection of one person into the sample shouldn’t affect the probability of another being included.
- Linearity, the logit has a linear relationship with the independent variables.
- Multi-collinearity should not exist. You shouldn’t be able to accurately predict one independent variable from another.
- The model should be correctly specified – all important variables included and extraneous excluded as best as possible.
- Independent variables shouldn’t have measurement error. In practice they will, and so this assumption is ignored in practice. However gross violations are a problem. Outlier analysis is an important part of the screening process to identify potential measurement/capturing errors. Even if they are not errors, they will skew coefficients.
- The sample size should be reasonable. Rules of thumb, such as Peduzzi’s rule of thumb can be used as a guideline.
Examining how these are assessed for violations is beyond the scope of this short article. As long as you are aware that there are a few things that can get in the way of a quick analysis. It may be the case that a different type of model is required, or additional data needs to be collected in order to develop a successful model.
I will however look in more detail at measures of predictive accuracy and model fit, as these are one of the most important ways of assessing how successful the model is once the assumptions have been met.
Measuring predictive accuracy
Perhaps the most intuitive method of evaluating model accuracy is the confusion matrix (no pun intended!). The confusion matrix uses a grid of four cells. Two columns represent the predicted outcome while the rows represent the actual outcomes. The percentages in each cell allow an evaluation of the number of true positives, versus false positives and true negatives (failures to predict) versus false negatives (failures to predict). Figure 3 below is an example of a confusion matrix of a successful model; that accurately predicts the true outcome most of the time (90%). In only 10% of observations is the prediction wrong.
Figure 3: Confusion matrix
While intuitive, the confusion matrix depends on the particular threshold used. While the standard is 0.5, you may want to test different thresholds, if for some reason you wanted to favour specificity or sensitivity. You could evaluate these thresholds manually, but the receiver operating charateristic (ROC) curve (standard output in many statistics packages) provides a quicker way of doing this. The ROC curve plots the specificity against the inverse sensitivity over the entire range of possible thresholds from 0 to 1. The further away the curve is from the diagnol, the better the predictive accuracy across the whole range of thresholds. The area under the curve (AUC) provides a quantitative way to assess how far the line is a away from the diagnol. A perfect model would have area of 1, while a model that is no better than chance would be 0.5. Area Under ROC Curve = 0.83504 in the example – see Figure 4 ROC curve (a) and proportion correct vs cutoff (threshold) (b).
Figure 4 ROC curve (a) and proportion correct vs cutoff (threshold) (b)
Goodness of Fit (GOF) measures and tests
Goodness of fit measures assess how well the model fits the data. Since the actual binary outcome predictions of the model rely on more than just fit in the case of logistic regression, but also the selection of probability thresholds; model fit assessment allows model assessment independent of threshold selections. Therefore I believe in looking at these measures/tests first.
R2 measures are a familiar measure of model fit since they are used in one of the oldest and most widely used of the dependence models; multiple regression. The measure ranges from 0 (no relationship) to 1 (perfect relationship with all variance explained). In practice statistical models seldom if ever go near these extremes. In fact, a model with either extreme should be viewed with suspicion. Sometimes a dependent variable is accidentally included as an independent variable, either as a direct copy or through a transformed version of the dependent variable, and the resulting R2 is then likely to be 1.
Various pseudo R2 measures have been created to evaluate the results of logistic regression. However, they are a compromise and should be treated as a rough indicator of the strength of the relationship. I say this as some pseudo R2 measures cannot reach the ceiling of 1, and they give widely differing results. When comparing models, you should stick to one pseudo R2 and use the same dependent variable across all models compared. You should also use the same dataset. (2) An easily calculated version is McFadden’s pseudo R2 (Equation 4) if your statistics package provides the null deviance (the null is the model with the intercept only) and the residual deviance of the model you are testing.
Where the likelihood of the saturated model () and the proposed model are used to calculate residual deviance (Equation 5).
In effect this is the same as just saying “negative twice the log likelihood” which is marked as such in some software outputs; since the log of the saturated model (the perfect model with a likelihood of 1) is in effect 0. The log likelihood is multiplied by -2 so that it approximates the Chi-Square distribution, which makes it handy for significance testing as explained under Chi-Square model test. Note that the more general case of the deviance is the called the ‘likelihood ratio’; where the proposed model is compared to any other model (called a full model) of which it is a subset (reduced model).
Hosmer & Lemeshow goodness of fit test
I include this for historical interest, and because many software packages still provide this GOF test in their output. However, as detailed below considerable caution needs to be applied in interpretation. The appeal of most goodness of fit ‘tests’ I suspect lies in the ‘either or’ nature of the result, but they do have a ‘straw man’ aspect; therefore acceptance should be regarded as a bare minimum and they should be examined in concert with other tests and measures of fit.
Perhaps the most commonly used goodness of fit test (GOF) is Hosmer and Lemeshow (which I will refer to as HL). HL is a significance test. Instead of providing a measure of the strength of fit such as an R2 measure, HL tests whether or not the model deviates significantly from its linearity assumption.
Since it is possible for models to obtain reasonable R2 and yet violate the linearity assumption, it is a useful additional check. It is NOT useful when sample sizes are small, as the probability of type II errors increase as sample sizes decrease; conversely when sample sizes are very large it becomes overly sensitive and prone to type I errors. It works by ranking the predictions into roughly equal groups (bins) that are then compared to the actual values. The log odds are converted to probabilities and summed to provide estimates of the ‘event’ counts in each group. A Chi-Square test is used to determine if the deviations between the predicted and actual counts are significant.
Another issue to keep in mind is that the test is sensitive to the number of bins, so any model comparisons should hold the number of bins constant.
Chi-Square model test
Another frequently used test is the Chi-Square test. This has the same problem as HL in relation to sample size. However bins are not an issue, which is an advantage over the HL test. It can be thought of as equivalent in application to the F test in a regression context, in other words a test of model significance (3).
The residual deviance of the model is approximately equal to Chi-Square and so can be directly evaluated using Chi-Square tables along with the models degrees of freedom.
In the example of the internet access model we obtain a probability that the null hypothesis is true of 1, meaning there is no evidence that the model is does not fit. Given the fairly large sample of more than 1,000 interviews it seems reasonable to conclude that the model passes the bare minimum standard
The AIC (Equation 6) measures the fit of the model while at the same time penalising for the number of parameters. Better models will have a smaller AIC as they improve fit sufficiently to compensate for any increase in complexity. AIC is useful for comparing alternative models, so that you can identify the best from a set of fitted models. There is no absolute maximum or minimum however, so it must always be interpreted relative to alternative models.
Where k equals the number of parameters.
- Estimating predicted probabilities from logistic regression: different methods correspond to different target populations. Muller CJ, MacLehose RF. s.l. : Int J Epidemiol, 2014, Vol. 43.
- UCLA: Statistical Consulting Group. FAQ: What are pseudo R-squareds? Institute for Digital Research and Education. [Online] [Cited: 12 September 2016.] http://www.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm.
- Hair JF, B Black, B Babin, RE Anderson, RL Tatham. Multivariate Statistics. s.l. : Pearson, 2006. ISBN13: 9780130329295 .
- Czepiel, Scott A. Maximum Likelihood Estimation of Logistic. [Online] [Cited: 12 9 2016.] http://czep.net.
- 7.2.3 – Receiver Operating Characteristic Curve (ROC). Stat 504. [Online] [Cited: 17 09 2016.] https://onlinecourses.science.psu.edu/stat504/node/163.