Dummy variables

A dummy variable is a qualitative variable that can take only two values: 0 and 1. It is called a ``dummy" variable because it represents information from a categorical variable. A dummy variable is also referred to as an indicator variable.

It can be interpreted as a ``switch" variable were it's on (d=1) or off (d=0), indicating whether the condition holds or not.

Some examples where a dummy variable is useful include:

  1. educational status (college, no college). A company is interested in estimating the effect of college education on salaries paid within the company.
  2. repair type (electrical, mechanical). A company provides maintenance services for water-filtration systems. The managers believe that the repair time is a function of the number of months since last service and the type of repair problem.
  3. sales regions (A, B, C, D). A manufacturer of copy machines would like to predict the number of copiers sold per week, but treating the regions differently.
  4. type of population (rural, urban)
  5. institution type (public, private)
  6. type of firm (unionized, not unionized)
  7. gender (male, female)
  8. political party (republican, democrat)
  9. housing data (with and without pool)
  10. method of payment (check, credit card, cash)
  11. days of the week (weekday, weekend)
  12. season (summer, other seasons)
  13. season (summer, fall, winter, spring)



Dummy variables are used in regression models to analyze and estimate differences among groups.

A dummy variable is an explanatory variable that is included in a regression like other regressors in the multiple regression framework.


Number of groups

We can define n-1 dummy variables if the number of groups is n. Otherwise if n dummy variables are defined and included in a regression, perfect multicollinearity would not allow to estimate the regression.


Reference group

One of the two groups in a definition of a dummy variable is called the ``excluded" and the other is called the ``included" group. The latter makes reference to the group identified with a value of 1 in the definition of the dummy variable. The other (``excluded" group) carries a value of 0. The ``excluded" group is also referred to as the ``control" group, or the ``benchmark" group. This is the group used as reference to make comparisons, and it represents that category for which a dummy variable is not included in the regression. For instance, if d=1 for females and d=0 for males, and we include d, then the left out group is males, which becomes the reference group. The results obtained must be compared with this reference group.


Common slope

Suppose that the effect of x on y is the same for both groups, and that regardless of the level of x there is a systematic difference between the two groups. Graphically, the situation is depicted by two parallel lines, with different intercepts.

(See images.)



An interaction between a dummy variable and a quantitative variable allows the analyst to estimate difference in the slope among groups. For instance, in the salary equation if we include the ``product" variable sex*experience the coefficient of that variable would indicate whether there is any difference between the additional salary that males and females can obtain with an
additional year of experience.

(See images.)