About the Data

ID: property identification
sale: the actual sale price of the property in Canadian dollars
list: the last list price of the property in Canadian dollars
bedroom: the total number of bedrooms
bathroom: the number of bathrooms
parking: the total number of parking spots
maxsqfoot: the maximum square footage of the property
taxes: previous year’s property tax
lotwidth: the frontage in feet
lotlength: the length in feet of one side of the property
location: M - Mississauga Neighbourhood, T - Toronto Neighbourhood,
Missing values are labelled as ‘NA’

The Data is in Resources folder

I. Data Wrangling

The data used for this report is a random sample of 150 cases from the TREB data for detached homes in Toronto and Mississauga.

Below is the property ID of the houses for the sample randomly selected.

##   [1]   1   2   3   4   6   7   8   9  10  11  13  14  15  16  17  19  21  23
##  [19]  24  25  26  27  28  30  31  32  33  35  36  37  39  40  41  42  43  44
##  [37]  45  46  48  49  50  52  53  54  55  57  58  61  62  63  64  65  67  68
##  [55]  69  70  71  72  73  74  75  76  79  81  82  84  85  86  87  88  89  90
##  [73]  91  92  93  94  95  96  97  98  99 100 101 103 105 106 107 108 109 110
##  [91] 112 114 115 116 117 118 119 125 126 131 132 135 136 137 140 142 143 144
## [109] 145 146 147 148 149 150 151 154 155 156 157 158 159 160 161 162 163 164
## [127] 165 167 168 169 170 172 174 175 177 178 179 183 185 187 188 190 194 195
## [145] 196 205 207 218 227 229

To simplify this data, I will create a new variable called “lotsize” which is lotwidth * lotlength and remove the lotwidth and lotlength variable.

Since there are many NA rows in the data, I will remove the rows with NA since they don’t contain the data we need to do regression. I will also remove the predictor “maxsqfoot” because half the entries for this variable is missing. This means I am not able to accurately describe the statistical significance of this variable and whether it correlates with other variables. This variable does not really provide insight to the price of detached homes since its missing so much data, removing it will greatly reduce our uncertainty.

Below is a glimpse of the variables and row data after the data has been cleaned and I will be using this clean version for further analysis.

## Rows: 140
## Columns: 9
## $ ID       <int> 119, 86, 13, 94, 115, 28, 112, 93, 63, 194, 143, 154, 177, 19…
## $ sale     <int> 1708000, 1280000, 1350000, 2600000, 1738000, 1360000, 1085000…
## $ list     <int> 1499000, 1348000, 1099000, 2799000, 1698000, 1099000, 849900,…
## $ bedroom  <int> 3, 2, 3, 6, 3, 4, 3, 3, 3, 4, 4, 3, 4, 4, 3, 4, 3, 5, 3, 3, 4…
## $ bathroom <int> 3, 3, 3, 5, 4, 3, 2, 2, 3, 3, 3, 4, 4, 4, 2, 3, 1, 2, 2, 4, 7…
## $ parking  <int> 2, 1, 0, 3, 2, 3, 1, 2, 1, 6, 6, 8, 8, 6, 6, 2, 6, 2, 1, 6, 1…
## $ taxes    <dbl> 7089, 4274, 4750, 9696, 6109, 4966, 4457, 5713, 4695, 5981, 5…
## $ location <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "M", "M", "M", "…
## $ lotsize  <dbl> 1560.000, 1976.560, 1080.000, 4248.000, 2180.000, 3095.910, 9…

II. Exploratory Data Analysis

Variable Variable Type
ID categorical
sale continuous
list continuous
bedroom discrete
bathroom discrete
parking discrete
maxsqfoot continuous
taxes continuous
lotwidth continuous
lotlength continuous
location categorical
lotsize continuous

##            sale   list bedroom bathroom parking  taxes lotsize
## sale     1.0000 0.9859  0.4474   0.6002  0.1517 0.7777  0.3659
## list     0.9859 1.0000  0.4434   0.6145  0.1917 0.7574  0.3781
## bedroom  0.4474 0.4434  1.0000   0.5185  0.3378 0.4167  0.2847
## bathroom 0.6002 0.6145  0.5185   1.0000  0.3691 0.5181  0.3508
## parking  0.1517 0.1917  0.3378   0.3691  1.0000 0.3561  0.7379
## taxes    0.7777 0.7574  0.4167   0.5181  0.3561 1.0000  0.5530
## lotsize  0.3659 0.3781  0.2847   0.3508  0.7379 0.5530  1.0000

Based on the above pairwise correlation coefficient of quantitative variables against sales price, I can rank each quantitative variable in terms of their correlation coefficient against sales price.

1.) Rank 1 (highest correlation coefficient) - list correlation coefficient: 0.9859

2.) Rank 2 - taxes correlation coefficient: 0.7777

3.) Rank 3 - bathroom correlation coefficient: 0.6002

4.) Rank 4 - bedroom correlation coefficient: 0.4474

5.) Rank 5 - lotsize correlation coefficient: 0.3659

6.) Rank 6 - parking correlation coefficient: 0.1517

We see that the variable list has the highest correlation coefficient and it can also be seen by the scatterplot matrix that it has a strong positive linear relationship with sales price of detached homes.

Based on the scatter plot matrix, we see that the single plot of lotsize predictor has a huge fan shape. This means there is evidence for non constant variance. In this case the constant variance assumption is strongly violated because the fan shape is very wide. If we look at the predictor variable taxes, it also has a fan shape suggesting non constant variance but it is less severe than the variable lot size.

Let’s take a closer look at the standardized residual plot for the predictor variable lot size.

From the standardized residual plot for lot size, we see a clear pattern that is a fan shape and this means the variance tends to increase as lot size increases so we don’t have constant variance. The fan shape is so strong that it almost looks like an “L” shape. This is evidence that the constant variance assumption is strongly violated.

III. Methods and Model

Additive Model for Sales Price (Part 1)

Below, I fit a additive linear regression model of sales price against all possible predictors. Listing the estimated regression coefficients and the p values for the corresponding t-test.

Regression Coefficient Estimate P-Value for T-Test
intercept 57510.00 0.3366
list 0.83 less than 2e-16
bedroom 18570.00 0.2014
bathroom 9113.00 0.5236
parking -22730.00 0.0304
taxes 21.93 1.7e-05
lotsize 2.78 0.2619
location (Toronto) 70540.00 0.0928

Our additive term here is location since it is a dummy variable. If it takes the value 1, the intercept will shift and our model will be shifted up or down. In this case, Toronto represents “1” and Mississauga represents “0” according to the LM summary output. In the above output, 70540 is the estimated regression coefficient for the categorical variable location. And it represents the estimated mean amount the sales price will increase when the location of an observation is Toronto instead of Mississauga, assuming everything else is held constant.

According to the p values less than 0.05, the t test is significant for predictors list price, taxes and parking.

For list price, the t test being significant means this predictor contributes to the prediction of sales price over and above all the other predictors in the model. The coefficient of 0.83 means that on average, the sales price of detached homes will increase by 0.83 dollars for a 1 dollar increase in list price. Holding everything else constant.

Taxes also has a p value close to 0 and significant t test. This means the predictor taxes contributes to the prediction of sales price over and above other predictors. The interpretation of this estimated coefficient is similar. For a 1 dollar increase in property tax, the sales price will increase 21.93 dollars on average, holding everything else constant.

Lastly, we have the predictor parking having a significant t test. The estimated coefficient for this predictor is not really interpretable since it is negative. Logically, a detached home with more parking spots should be more expensive and not decrease the sales price. What may have occurred is multicollinearity since the variable parking spots may be highly correlated with say number of bathrooms and also number of bedrooms. So the estimated regression coefficients will vary widely from different data set and sometimes the sign will not even make sense like what we have right now. The fitted regression equation is not going to be stable if there is multicollinearity.

There is not really a reason to believe that the predictor parking is significant since we even saw from the scatter plot matrix that there is a no pattern between parking spots and sales price and the correlation coefficient is low. And now that we get a weird coefficient estimate, we have evidence that this predictor variable is doing us no good.

Backward Elimination with AIC to Obtain A Model

Now I will use backward elimination with the AIC criterion and try to reduce the model.

## Start:  AIC=3319.42
## sale ~ list + bedroom + bathroom + parking + taxes + lotsize + 
##     location
## 
##            Df  Sum of Sq        RSS    AIC
## - bathroom  1 7.6691e+09 2.4833e+12 3317.9
## - lotsize   1 2.3814e+10 2.4994e+12 3318.8
## - bedroom   1 3.0925e+10 2.5066e+12 3319.2
## <none>                   2.4756e+12 3319.4
## - location  1 5.3765e+10 2.5294e+12 3320.4
## - parking   1 8.9855e+10 2.5655e+12 3322.4
## - taxes     1 3.7382e+11 2.8495e+12 3337.1
## - list      1 2.4166e+13 2.6642e+13 3650.1
## 
## Step:  AIC=3317.86
## sale ~ list + bedroom + parking + taxes + lotsize + location
## 
##            Df  Sum of Sq        RSS    AIC
## - lotsize   1 2.1497e+10 2.5048e+12 3317.1
## <none>                   2.4833e+12 3317.9
## - bedroom   1 4.5700e+10 2.5290e+12 3318.4
## - location  1 4.6285e+10 2.5296e+12 3318.4
## - parking   1 9.0682e+10 2.5740e+12 3320.9
## - taxes     1 3.7112e+11 2.8544e+12 3335.4
## - list      1 3.1867e+13 3.4350e+13 3683.6
## 
## Step:  AIC=3317.06
## sale ~ list + bedroom + parking + taxes + location
## 
##            Df  Sum of Sq        RSS    AIC
## <none>                   2.5048e+12 3317.1
## - bedroom   1 3.8702e+10 2.5435e+12 3317.2
## - location  1 4.6163e+10 2.5510e+12 3317.6
## - parking   1 6.9511e+10 2.5743e+12 3318.9
## - taxes     1 4.7752e+11 2.9823e+12 3339.5
## - list      1 3.2024e+13 3.4529e+13 3682.4
## 
## Call:
## lm(formula = sale ~ list + bedroom + parking + taxes + location, 
##     data = data_0309)
## 
## Coefficients:
## (Intercept)         list      bedroom      parking        taxes    locationT  
##   6.220e+04    8.382e-01    1.958e+04   -1.686e+04    2.351e+01    6.189e+04

We can see above that the final fitted model with backward elimination AIC is:

\[ \hat{sales\:price} = 62200 + 0.8382*list\:price\: + 19580*bedroom\: -16860*parking\: + 23.51*taxes\: + 61890*locationT\]

where locationT is a dummy variable. LocationT = 1 if the location is Toronto and 0 if Mississauga.

The result here is consistent with the p values of t-test obtained from the additive model in part 1. We see that the predictor with highest p value is bathroom from the additive model and so we expect that the backward elimination with remove the predictor bathroom first. And we see that this is true from the backward elimination output above. The second predictor removed is lotsize and we see from the p values table that lotsize has the second largest p value out of all predictors. So the process we get from backward elimination AIC is indeed consistent with the p values we obtained from additive model.

Backward Elimination with BIC to obtain a Model

Now I will reduce the model using backwards elimination but this time with BIC.

## Start:  AIC=3342.96
## sale ~ list + bedroom + bathroom + parking + taxes + lotsize + 
##     location
## 
##            Df  Sum of Sq        RSS    AIC
## - bathroom  1 7.6691e+09 2.4833e+12 3338.4
## - lotsize   1 2.3814e+10 2.4994e+12 3339.4
## - bedroom   1 3.0925e+10 2.5066e+12 3339.8
## - location  1 5.3765e+10 2.5294e+12 3341.0
## <none>                   2.4756e+12 3343.0
## - parking   1 8.9855e+10 2.5655e+12 3343.0
## - taxes     1 3.7382e+11 2.8495e+12 3357.7
## - list      1 2.4166e+13 2.6642e+13 3670.7
## 
## Step:  AIC=3338.45
## sale ~ list + bedroom + parking + taxes + lotsize + location
## 
##            Df  Sum of Sq        RSS    AIC
## - lotsize   1 2.1497e+10 2.5048e+12 3334.7
## - bedroom   1 4.5700e+10 2.5290e+12 3336.1
## - location  1 4.6285e+10 2.5296e+12 3336.1
## <none>                   2.4833e+12 3338.4
## - parking   1 9.0682e+10 2.5740e+12 3338.5
## - taxes     1 3.7112e+11 2.8544e+12 3353.0
## - list      1 3.1867e+13 3.4350e+13 3701.3
## 
## Step:  AIC=3334.71
## sale ~ list + bedroom + parking + taxes + location
## 
##            Df  Sum of Sq        RSS    AIC
## - bedroom   1 3.8702e+10 2.5435e+12 3331.9
## - location  1 4.6163e+10 2.5510e+12 3332.3
## - parking   1 6.9511e+10 2.5743e+12 3333.6
## <none>                   2.5048e+12 3334.7
## - taxes     1 4.7752e+11 2.9823e+12 3354.2
## - list      1 3.2024e+13 3.4529e+13 3697.1
## 
## Step:  AIC=3331.92
## sale ~ list + parking + taxes + location
## 
##            Df  Sum of Sq        RSS    AIC
## - parking   1 4.9519e+10 2.5930e+12 3329.7
## - location  1 5.5291e+10 2.5988e+12 3330.0
## <none>                   2.5435e+12 3331.9
## - taxes     1 4.9553e+11 3.0390e+12 3351.9
## - list      1 3.3831e+13 3.6375e+13 3699.4
## 
## Step:  AIC=3329.67
## sale ~ list + taxes + location
## 
##            Df  Sum of Sq        RSS    AIC
## <none>                   2.5930e+12 3329.7
## - location  1 3.7745e+11 2.9705e+12 3343.8
## - taxes     1 4.6406e+11 3.0571e+12 3347.8
## - list      1 3.4840e+13 3.7433e+13 3698.5
## 
## Call:
## lm(formula = sale ~ list + taxes + location, data = data_0309)
## 
## Coefficients:
## (Intercept)         list        taxes    locationT  
##   5.295e+04    8.372e-01    2.295e+01    1.156e+05

The final fitted model with backwards elimination BIC is:

\[ \hat{sales\:price} = 52950 + 0.8372*list\:price\: + 22.95*taxes\: + 115600*locationT\]

with the predictors defined the same as before.

The result is overall consistent with part 1 since bathroom has the highest p value, it should be removed first. And we see here that backwards elimination BIC removed the predictor bathroom first because of the high p value. Backwards elimination then goes on to remove lotsize and bedroom since they also have high p values as shown in part 1. The difference this time compared to using AIC is that more predictors are removed. This is expected since BIC has a higher penalty than AIC so it favors a simpler model by removing more predictors. This is actually a better model since we saw before that the variable parking may be multicollinear and we can actually interpret all the coefficients properly by removing parking. In part 2, less variables were removed because the penalty was not strong enough and we see here that having a higher penalty may favor us since the parking predictor is removed. The variable parking may already have a relatively high p value in the AIC case, but because the penalty was not strong enough, it was not removed. Now by using BIC, we can penalize more heavily and hence remove more redundant variables. I think it is good to use BIC because as we saw before, the scatter plot of parking vs sales price is a null plot and we see no correlation. We also saw that the constant variance assumption was greatly violated by predictor lotsize, so a simpler model may be a better choice.

IV. Discussions and Limitations

Using the simplified model obtained from backwards elimination BIC, these are the diagnostic plots after fitting that model.

Residuals vs Fitted plot: From this plot, there is no obvious pattern or trend of residuals. The residuals are equally spread around the horizontal line so we have evidence that the linearity assumption is satisfied for MLR. However, there appears to be a point with high residual (point 95). We should keep in mind that this point may be a problem.

Normal QQ plot: The standardized residuals mostly follow the straight line so we have evidence that the normality of error assumption is satisfied. Again, the point 95 appears to be on the right tail affecting the normality assumption. But otherwise, I think we can assume normality is satisfied so we can do inference.

Scale-Location plot: In the spread location plot, we see a relatively horizontal line with points that are equally spread (mostly). The constant variance assumption should be satisfied based on how equally spread the points are. We have to take note of point 95 again since it appears to be very far from the red line.

Residuals vs Leverage plot: From this plot, we can confirm that case 95 is an outlier since its standardized residual is greater than 4. However, it does not have a high leverage and it probably isn’t going to change our fitted model significantly if we remove it since it is not a note worthy point. There are no points outside of the red dotted line (cooks distance) so there should be no points with high influence affecting our fit greatly. There is one point with a very high leverage but since the standardized residual of it is close to 0, it should not be a point of concern.

The next steps I would take towards finding a valid model would be to investigate on the points that are highlighted in the diagnostic plots. Looking at the point with a very high leverage and the outlier point 95 could reveal some information about variables/predictors that are left out. The assumptions for MLR is mostly satisfied as shown by the 4 plots so we don’t really have to transform any variables or remove any points since there does not appear to be any influential point. I would investigate more on some weird observations, such as observations having $4 in property tax and I would like to find out why there is no relationship between parking spots and sales price. If I can learn how the variables were collected, perhaps some variables are very closely related with another and we can remove redundant information or create interaction terms.

Overall, the assumptions of MLR is satisfied and this model appears to be a simple one.