This document is linked from Case C-Q.

]]>This short video elaborates upon the information displayed in a boxplot.

The original slides are not available.

This document is linked from Boxplots.

]]>From the online version of Little Handbook of Statistical Practice, this reading contains examples of numerous exploratory graphical displays.

This document is linked from Summary (Unit 1).

]]>- to practice comparing and contrasting distributions, and
- to help you gain more intuition about variability through the interpretation of your results in context.

The percentage of each entering freshman class that graduated on time was recorded for each of six colleges at a major university over a period of several years. (Source: This data is distributed with the software package, Data Desk. (1993). Ithaca, NY: Data Description, Inc., and appeared in the Data and Story Library)

In order to compare the graduation rates among the different colleges, we will create side-by-side boxplots (graduation rate by college), and supplement the graph with numerical measures. Answer the questions based on the SPSS output provided.

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/07/qz-LBD01016.swf

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/07/qz-LBD01017.swf

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/07/qz-LBD01018.swf

This document is linked from Boxplots.

]]>**Related SAS Tutorials**

- 7A (2:32) Numeric Summaries by Groups
- 7B (3:03) Side-By-Side Boxplots

**Related SPSS Tutorials**

- 7A (3:29) Numeric Summaries by Groups
- 7B (1:59) Side-By-Side Boxplots

Recall the role-type classification table for framing our discussion about the relationship between two variables:

We are now ready to start with Case C→Q, exploring the relationship between two variables where the explanatory variable is categorical, and the response variable is quantitative. As you’ll discover, exploring relationships of this type is something we’ve already discussed in this course, but we didn’t frame the discussion this way.

**Background:** People who are concerned about their health may prefer hot dogs that are low in calories. A study was conducted by a concerned health group in which 54 major hot dog brands were examined, and their calorie contents recorded. In addition, each brand was classified by type: beef, poultry, and meat (mostly pork and beef, but up to 15% poultry meat). The purpose of the study was to examine whether the **number of calories** a hot dog has is related to (or affected by) its **type**. (Reference: Moore, David S., and George P. McCabe (1989). Introduction to the Practice of Statistics. Original source: Consumer Reports, June 1986, pp. 366-367.)

Answering this question requires us to examine the relationship between the categorical variable, Type and the quantitative variable Calories. Because the question of interest is whether the type of hot dog affects calorie content,

- the
**explanatory**variable is**Type**, and - the
**response**variable is**Calories**.

Here is what the raw data look like:

The raw data are a list of types and calorie contents, and are not very useful in that form. To explore how the number of calories is related to the type of hot dog, we need an informative visual display of the data that will compare the three types of hot dogs with respect to their calorie content.

The visual display that we’ll use is **side-by-side boxplots** (which we’ve seen before). The side-by-side boxplots will allow us to **compare the distribution** of calorie counts within each category of the explanatory variable, hot dog type:

As before, we supplement the side-by-side boxplots with the descriptive statistics of the calorie content (response) for each type of hot dog separately (i.e., for each level of the explanatory variable separately):

Let’s summarize the results we obtained and interpret them in the context of the question we posed:

Statistic | Beef | Meat | Poultry |
---|---|---|---|

min | 111 | 107 | 86 |

Q1 | 139.5 | 138.5 | 100.5 |

Median | 152.5 | 153 | 113 |

Q3 | 179.75 | 180.5 | 142.5 |

Max | 190 | 195 | 152 |

By examining the three side-by-side boxplots and the numerical measures, we see at once that poultry hot dogs, as a group, contain fewer calories than those made of beef or meat. The median number of calories in poultry hot dogs (113) is less than the median (and even the first quartile) of either of the other two distributions (medians 152.5 and 153). The spread of the three distributions is about the same, if IQR is considered (all slightly above 40), but the (full) ranges vary slightly more (beef: 80, meat: 88, poultry: 66). The general recommendation to the health-conscious consumer is to eat poultry hot dogs. It should be noted, though, that since each of the three types of hot dogs shows quite a large spread among brands, simply buying a poultry hot dog does not guarantee a low-calorie food.

What we learn from this example is that when exploring the relationship between a categorical explanatory variable and a quantitative response (Case C→Q), we essentially **compare the distributions of the quantitative response for each category of the explanatory variable** using side-by-side boxplots supplemented by descriptive statistics. Recall that we have actually done this before when we talked about the boxplot and argued that boxplots are most useful when presented side by side for comparing distributions of two or more groups. This is exactly what we are doing here!

Here is another example:

**Background:** The Survey of Study Habits and Attitudes (SSHA) is a psychological test designed to measure the motivation, study habits, and attitudes toward learning of college students. Is there a relationship between **gender** and **SSHA** scores? In other words, is there a “gender effect” on SSHA scores? Data were collected from 40 randomly selected college students, and here is what the raw data look like:

(Reference: Moore and McCabe. (2003). Introduction to the Practice of Statistics)

Side-by-side boxplots supplemented by descriptive statistics allow us to compare the distribution of SSHA scores within each category of the explanatory variable—gender:

Statistic | Female | Male |
---|---|---|

min | 103 | 70 |

Q1 | 128.75 | 95 |

Median | 153 | 114.5 |

Q3 | 163.75 | 144.5 |

Max | 200 | 187 |

Let’s summarize our results and interpret them:

By examining the side-by-side boxplots and the numerical measures, we see that in general females perform better on the SSHA than males. The median SSHA score of females is higher than the median score for males (153 vs. 114), and in fact, it is even higher than the third quartile of the males’ distribution (144.5). On the other hand, the males’ scores display more variability, both in terms of IQR (49.5 vs. 35) and in terms of the full range of scores (117 vs. 97). Based on these results, it seems that there is a gender effect on SSHA score. It should be noted, though, that our sample consists of only 20 males and 20 females, so we should be cautious about making any kind of generalizations beyond this study. One interesting question that comes to mind is, “Why did we observe this relationship between gender and SSHA scores?” In other words, is there maybe an explanation for why females score higher on the SSHA? Let’s leave it to the psychologists to try and answer that one.

- The relationship between a categorical explanatory variable and a quantitative response variable is summarized using:
**Visual display:**side-by-side boxplots**Numerical measures:**descriptive statistics used for one quantitative variable calculated in each group

- Exploring the relationship between a categorical explanatory variable and a quantitative response variable amounts to comparing the distributions of the quantitative response for each category of the explanatory variable. In particular, we look at how the distribution of the response variable differs between the values of the explanatory variable

**Related SAS Tutorials**

- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT

**Related SPSS Tutorials**

- 5B – (2:29) Creating Histograms and Boxplots

Now we introduce another graphical display of the distribution of a quantitative variable, the **boxplot**.

So far, in our discussion about measures of spread, some key players were:

- the extremes (min and Max), which provide the range covered by all the data; and
- the quartiles (Q1, M and Q3), which together provide the IQR, the range covered by the middle 50% of the data.

Recall that the combination of all five numbers (min, Q1, M, Q3, Max) is called the **five number summary**, and provides a quick numerical description of both the center and spread of a distribution.

We will continue with the Best Actress Oscar winners example (Link to the Best Actress Oscar Winners data).

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

The five number summary of the age of Best Actress Oscar winners (1970-2001) is:

min = 21, Q1 = 32, M = 35, Q3 = 41.5, Max = 80

To sketch the boxplot we will need to know the 5-number summary as well as identify any outliers. We will also need to locate the largest and smallest values which are not outliers. The stemplot below might be helpful as it displays the data in order.

Now that you understand what each of the five numbers means, you can appreciate how much information about the distribution is packed into the five-number summary. All this information can also be represented visually by using the boxplot.

The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five-number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion.

(Link to the Best Actress Oscar Winners data).

- The central box spans from Q1 to Q3. In our example, the box spans from 32 to 41.5. Note that the width of the box has no meaning.

- A line in the box marks the median M, which in our case is 35.

- Lines extend from the edges of the box to the smallest and largest observations that were not classified as suspected outliers (using the 1.5xIQR criterion). In our example, we have no low outliers, so the bottom line goes down to the smallest observation, which is 21. Since we have three high outliers (61,74, and 80), the top line extends only up to 49, which is the largest observation that has not been flagged as an outlier.

- outliers are marked with asterisks (*).

To summarize: the following information is visually depicted in the boxplot:

- the five number summary (blue)
- the range and IQR (red)
- outliers (green)

As we learned earlier, the distribution of a quantitative variable is best represented graphically by a histogram. Boxplots are most useful when presented side-by-side for comparing and contrasting distributions from two or more groups.

So far we have examined the age distributions of Oscar winners for males and females separately. It will be interesting to compare the age distributions of actors and actresses who won best acting Oscars. To do that we will look at side-by-side boxplots of the age distributions by gender.

Recall also that we found the five-number summary and means for both distributions. For the Best Actress dataset, we did the calculations by hand. For the Best Actor dataset, we used statistical software, and here are the results:

- Actors: min = 31, Q1 = 37.25, M = 42.5, Q3 = 50.25, Max = 76
- Actresses: min = 21, Q1 = 32, M = 35, Q3 = 41.5, Max = 80

Based on the graph and numerical measures, we can make the following comparison between the two distributions:

**Center:** The graph reveals that the age distribution of the males is higher than the females’ age distribution. This is supported by the numerical measures. The median age for females (35) is lower than for males (42.5). Actually, it should be noted that even the third quartile of the females’ distribution (41.5) is lower than the median age for males. We therefore conclude that in general, actresses win the Best Actress Oscar at a younger age than actors do.

**Spread:** Judging by the range of the data, there is much more variability in the females’ distribution (range = 59) than there is in the males’ distribution (range = 45). On the other hand, if we look at the IQR, which measures the variability only among the middle 50% of the distribution, we see more spread in the ages of males (IQR = 13) than females (IQR = 9.5). We conclude that among all the winners, the actors’ ages are more alike than the actresses’ ages. However, the middle 50% of the age distribution of actresses is more homogeneous than the actors’ age distribution.

**Outliers:** We see that we have outliers in both distributions. There is only one high outlier in the actors’ distribution (76, Henry Fonda, On Golden Pond), compared with three high outliers in the actresses’ distribution.

In order to compare the average high temperatures of Pittsburgh to those in San Francisco we will look at the following side-by-side boxplots, and supplement the graph with the descriptive statistics of each of the two distributions.

Statistic | Pittsburgh | San Francisco |
---|---|---|

min | 33.7 | 56.3 |

Q1 | 41.2 | 60.2 |

Median | 61.4 | 62.7 |

Q3 | 77.75 | 65.35 |

Max | 82.6 | 68.7 |

When looking at the graph, the similarities and differences between the two distributions are striking. Both distributions have roughly the same center (medians are 61.4 for Pitt, and 62.7 for San Francisco). However, the temperatures in Pittsburgh have a much larger variability than the temperatures in San Francisco (Range: 49 vs. 12. IQR: 36.5 vs. 5).

The practical interpretation of the results we obtained is that the weather in San Francisco is much more consistent than the weather in Pittsburgh, which varies a lot during the year. Also, because the temperatures in San Francisco vary so little during the year, knowing that the median temperature is around 63 is actually very informative. On the other hand, knowing that the median temperature in Pittsburgh is around 61 is practically useless, since temperatures vary so much during the year, and can get much warmer or much colder.

Note that this example provides more intuition about variability by interpreting small variability as consistency, and large variability as lack of consistency. Also, through this example we learned that the center of the distribution is more meaningful as a typical value for the distribution when there is little variability (or, as statisticians say, little “noise”) around it. When there is large variability, the center loses its practical meaning as a typical value.

- The five-number summary of a distribution consists of the median (M), the two quartiles (Q1, Q3) and the extremes (min, Max).

- The five-number summary provides a complete numerical description of a distribution. The median describes the center, and the extremes (which give the range) and the quartiles (which give the IQR) describe the spread.

- The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion. (Some software packages indicate extreme outliers with a different symbol)

- Boxplots are most useful when presented side-by-side to compare and contrast distributions from two or more groups.