The introductory lab asks students to develop a multivariate regression model the relationship between the value of 2005 used GM cars and various car characterstics, such as mileage, make, model, cruise control, and engine size. The structure of this lab allows students to work through the entire process of model building and assessment, thus providing a guided practice run before tackling a large data set on their own. This bridges the gap between short, focused homework problems and the open-ended nature of a project.
R and Rstudio: If students are new to R and Rstudio, the following handout outlines some nice Youtube videos and short assignments to get them started: Introduction to Rstudio. Other videos specific to multiple regression in Rstudio: Rstudio for Regression. Updated Chapter 3 R Instructions are also available.
All the materials on the textbook CD are also at: http://www.pearsonhighered.com/mathstatsresources/ (just select K). However, additional information for instructors, such as PowerPoint Slides, Instructor Guides, Updated Datasets and software instruction are available upon request. A description of how this lab can be used in the classroom is provided at : http://www.amstat.org/publications/jse/v16n3/kuiper.html
Below is a Chapter 3 syllabus appropriate for an introductory class, a full semester syllabus appropriate for a senior seminar or graduate level class is here: Advanced Syllabus
Day 1: Building a Model to Estimate Car Prices
Section 3.1 builds a simple linear regression model. This case study is an example where the t-test provides evidence that Mileage is important in predicting the price of a used car. However the R-squared value indicates that Mileage alone is not sufficient in getting an accurate estimate of price. We build on this example to construct a multiple regression model in later sections. Section 3.2 discusses the distict goals of multiple regress and section 3.3 discusses variable selection techniques. During class, the following questions serve as a review of the concepts in linear regression:
- In general, what happens to price when there is one more mile on the car?
- Does the fact that b1 is small (-0.17) mean mileage is not very important? Students often misinterpret the magnitude of b1 as a measure of significance. Here students get a feeling of the importance of scale. For example, "How does the price change if two cars are identical except one has 60,000 more miles?" Students see that b1 = -0.17 can be meaningful since mileage has so much variability.
- Does mileage help you predict price? What does the p-value tell you?
- Does mileage help you predict price? What does the R-Sq value tell you?
- Are there any outliers or influential observations in the data?
- Watch the VIDEO C3a: Introduction to Multiple Regression: Hypothesis Tests and R-squared Calculations
This video demonstrates the reasoning behing the calculations of R-squared and the adjusted R-squared values.
- Read: Chapter 3.1-3.3 of the textbook and work through questions C3.1-C3.4
Note:Previous introduction to simple linear regression is required. Inference for regression is helpful, but not required. I suggest reviewing hypothesis tests for regression coefficients and residual plots before this lab is given. In questions (2) and (3) students often missed the cone shaped residual plot and incorrectly assumed the residuals being balanced around zero as the most important visual inspection for random scatter.
Day 2: Variable Selection Techniques
Students are expected to complete Questions 7-13 before class, but I usually allow them to ask questions at the beginning of class. We discuss what to look for in various residual plots, ways to choose which explanatory variables should be included in a multiple regression model, and the importance of understanding the underlying techniques software uses in building these models. When explanatory variables are correlated, stepwise procedures can exclude important terms within the model. A strong emphasis is placed upon using residual plots to validate the model. A strong emphasis is placed upon using residual plots to validate the model .
- Watch the VIDEO C3b: Variable Selection Methods (https://www.youtube.com/watch?v=iRCjIlB3xxQ)
This video compares best subsets and stepwise variable selection techniques.
- Read: Chapter 3.3 and 3.4 of the textbook (and work through quesitons 5-13)
Note: Question 5 and 6:
While students are able to use the software instructions to complete each question, they often have difficulty interpreting the results of Question 5 and 6. It is useful to discuss variable selection techniques and work through Questions 5-6 on the instructor’s computer. I allow students to work on the next problems for the rest of class period
Day 3: More on Multiple Regression
These sections discuss correlated explanatory variables, proper interpretation of model coefficients and incorporating categorical terms in a regression model. With the Minitab and R instructiosns freely available, students tend not to have difficulty with the calculations, but it is worthwhile spending some time reviewing what these tests really tells us and when they are appropriate to use.
- Read: Chapter 3.4 through 3.7 of the textbook (and work through questions 14-20)
Day 4: Building a "Best" Model Challenge
Students are expected to come to class (often working in groups) with a "best" multiple regression model using the cars data. At the beginning of class students complete a table on the board that has the following columns: Name, R^2, adj R^2, and number of terms within the model. We very briefly discuss several of the better models and ask students to defend their choices and talk about specific terms they included and why. I let students vote on a "best model" and the winners are given a small prize (such as a college pen or a few points extra credit). Through this activity students clearly understand that a truly best model does not exist, it depends upon the goals, objectives, and subjective decisions of the researcher.
- Submit a "Best" multiple regression model (Question 3.21 in the textbook)
Extended Activities
Sections 3.8 (F-Tests for Multiple Regression) and 3.10 (Interaction and Terms for Curvature) are particularly useful for the student projects.
Section 3.8 is provides additional mathematical details. In addition, the extra sum of squares F-test is a very valuable tool in hypothesis testing in multiple regression.
Section 3.9 (Developing a Model to Confirm a Theory) describes the steps needed to identify
an appropriate theoretical model when the goal is to confirm a theoretical relationship between variables. This section is often assigned as reading outside of class. Questions 32-34 are simply an exercise in learning how to use regression to test a theory. For example, a theory (or hypothesized model) could be that there is a positive linear relationship between liter and TPrice. Another hyopothesized model could be that as Mileage increases, price would decrease. Clearly these are shown in the data, but you certainly can see how these theories could be reasonable before seeing any dataset.
Research Project
Short Project (1-2 class periods): It is possible to only assign the “Exploring the Data” section of the project and ask students to write a 1-2 page summary describing their analysis of the data provided. This can be done outside of class and does not require reading primary literature or writing research papers. Instead of holding a traditional class, I set up 10-15 minute times for each group to meet with me and discuss their model before writing up their summary.
Long Project (4-5 class periods): The book used for this project is very popular and used in some of Grinnell College’s economics classes. It is inexpensive to put one or two copies on reserve in the college library. I tend to simply put an electronic copy of the chapter on a course site (such as blackboard).
I have found that some students are initially frustrated or intimidated when they are asked to read a research paper outside of their major. In my experience, providing more time to read the paper has not been helpful. It may, however, be appropriate to:
- clarify that the paper review questions are worth only a small amount of the overall grade
- allow students to complete the assignment in groups
- allow students to turn in a revised version of their paper review questions after they are discussed in class
After students have worked through the provided data, we discuss possible research questions before students collect their own data. The www.worldbank.org is a reliable source and students enjoy the opportunity to easily collect data that pertains to their own research question. Student work has dramatically improved after I have started using peer review of papers before I grade them.