This R notebook will be exploring this dataset provided by Kaggle: https://www.kaggle.com/harlfoxem/housesalesprediction
Note: This project was worked on in the course of one afternoon. Because of this, I took a more exploratory approach in finding insights. I am sorry in advance if this excursion was a little haphazard. With that said, I hope you find tremendous value from it! Also, please offer suggestions if you have any!
We will build a simple regression model to predict the home sales price
house_data = read.csv(file.choose(), header=T)
head(house_data)
#Let's check how many observations we have
dim(house_data)
[1] 21613 21
Let’s create a train and test set for our data ## Test and Train Set
smp_size = floor(.7*nrow(house_data))
set.seed(123)
ind = sample(seq_len(nrow(house_data)),size=smp_size)
train_house_data = house_data[ind,]
test_house_data= house_data[-ind,]
dim(train_house_data)
[1] 15129 21
dim(test_house_data)
[1] 6484 21
We will do a summary to find some descriptive statistics on each of the explanatory variables within our training set
attach(train_house_data)
summary(train_house_data)
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
Min. :1.000e+06 20140708T000000: 94 Min. : 75000 Min. : 0.000 Min. :0.000 Min. : 370 Min. : 520 Min. :1.000 Min. :0.000000
1st Qu.:2.123e+09 20140625T000000: 93 1st Qu.: 323500 1st Qu.: 3.000 1st Qu.:1.500 1st Qu.: 1420 1st Qu.: 5091 1st Qu.:1.000 1st Qu.:0.000000
Median :3.903e+09 20140623T000000: 92 Median : 450000 Median : 3.000 Median :2.250 Median : 1910 Median : 7633 Median :1.500 Median :0.000000
Mean :4.574e+09 20150422T000000: 90 Mean : 540145 Mean : 3.368 Mean :2.111 Mean : 2078 Mean : 15410 Mean :1.492 Mean :0.007733
3rd Qu.:7.304e+09 20140825T000000: 88 3rd Qu.: 640000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2540 3rd Qu.: 10800 3rd Qu.:2.000 3rd Qu.:0.000000
Max. :9.842e+09 20150325T000000: 88 Max. :7700000 Max. :33.000 Max. :8.000 Max. :12050 Max. :1164794 Max. :3.500 Max. :1.000000
(Other) :14584
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long
Min. :0.0000 Min. :1.000 Min. : 3.000 Min. : 370 Min. : 0.0 Min. :1900 Min. : 0.00 Min. :98001 Min. :47.16 Min. :-122.5
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.00 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3
Median :0.0000 Median :3.000 Median : 7.000 Median :1552 Median : 0.0 Median :1975 Median : 0.00 Median :98065 Median :47.57 Median :-122.2
Mean :0.2371 Mean :3.407 Mean : 7.653 Mean :1787 Mean : 291.7 Mean :1971 Mean : 84.56 Mean :98078 Mean :47.56 Mean :-122.2
3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:2200 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.00 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1
Max. :4.0000 Max. :5.000 Max. :13.000 Max. :8860 Max. :4820.0 Max. :2015 Max. :2015.00 Max. :98199 Max. :47.78 Max. :-121.3
sqft_living15 sqft_lot15
Min. : 460 Min. : 651
1st Qu.:1480 1st Qu.: 5108
Median :1840 Median : 7644
Mean :1984 Mean : 12922
3rd Qu.:2360 3rd Qu.: 10125
Max. :6210 Max. :858132
From this summary statistic we find some interesting qualities within our columns. First and foremost, we will remove the identifier, id, from this regression. We find that the date timestamp is in an interesting format. In order to fully use data, we will have to transform it (for the sake of simplicity I will not do this initially. A way of using date would be by turning it into a categorical variable. Another would be to use R’s timestamp features. What we will do is compute a partial F test with the transformed timestamp after the final model has been created). Also, The year renovated (represented as yr_renovated) appears to have zeros mostly within the dataset. We will remove it from our initial model. We can find the frequency of values on our variable by plotting the histograms from each. This will aid us in seeing if we need to do some transformations. I am suspecting that yr_renovated will be right skewed because the median is less than the mean.
Also, there are lots of variables that explain the same behavior (i.e. location). We may remove these variables later on in the analysis because of multicollinearity.
par(mfrow=c(4,6))
hist(price)
hist(bedrooms)
hist(bathrooms)
hist(sqft_living)
hist(sqft_lot)
hist(floors)
hist(waterfront)
hist(view)
hist(condition)
hist(grade)
hist(sqft_above)
hist(sqft_basement)
hist(yr_built)
hist(yr_renovated)
hist(zipcode)
hist(lat)
hist(long)
hist(sqft_living15)
hist(sqft_lot15)
From these histograms, we notice that a lot of our variables are incredibly right skewed. This is problematic. From the looks of it, there are many variables in which they have a high a frequency of an empty value (i.e. waterfront, view). This is why they are incredibly skewed. Because of this, we will remove waterfront, view, yr_renovated from all regression models.
Let’s apply a natural log transformation onto bedroom, and all of the sqft variables
ln_bedroom = log(bedrooms+1)
ln_sqft_basement = log(sqft_basement+1)
ln_sqft_living = log(sqft_living)
ln_sqft_above = log(sqft_above)
ln_sqft15_living = log(sqft_living15)
ln_sqft15_lot = log(sqft_lot15)
ln_price = log(price)
par(mfrow=c(2,4))
hist(ln_bedroom)
hist(ln_sqft_basement)
hist(ln_sqft_living)
hist(ln_sqft_above)
hist(ln_sqft15_living)
hist(ln_sqft15_lot)
hist(ln_price)
We will now build a scatterplot matrix to see if any of the explanatory variables need a square term.
vars1 = data.frame(cat_grade,cat_cond,bathrooms,yr_built,price)
vars2 = data.frame(ln_sqft_living,ln_sqft15_lot,cat_floors,ln_sqft_above,price)
vars3 = data.frame(zipcode,lat,long,ln_sqft15_living,ln_sqft15_lot,ln_bedroom,ln_sqft_basement,price)
pairs(vars1, upper.panel = NULL)
pairs(vars2, upper.panel = NULL)
pairs(vars3, upper.panel = NULL)
From these various scatterplot matrices we see that ln_sqft_living, ln_sqft_basement may need a square term
basementSq = ln_sqft_basement ** 2
sqft_livingSq = ln_sqft_living ** 2
Let us start by creating a simple regression model.
For starters, we will build a model using our price variable. In our final model, if we notice any heteroscedasticity then we will use the natural log of price variable for our Y
m1 = lm(price ~ cat_grade+cat_cond+cat_floors+bathrooms+ln_sqft_living+ln_sqft15_lot+ln_sqft_above+yr_built+zipcode+lat+long+ln_sqft15_living+ln_sqft15_lot+ln_bedroom+ln_sqft_basement+basementSq+sqft_livingSq)
summary(m1)
Call:
lm(formula = price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq)
Residuals:
Min 1Q Median 3Q Max
-1634591 -95962 -10949 71991 4450946
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.958e+07 3.732e+06 5.247 1.56e-07 ***
cat_grade4 -1.767e+05 1.537e+05 -1.150 0.25025
cat_grade5 -2.845e+04 1.468e+05 -0.194 0.84637
cat_grade6 3.839e+04 1.462e+05 0.262 0.79297
cat_grade7 1.070e+05 1.464e+05 0.731 0.46483
cat_grade8 1.714e+05 1.465e+05 1.170 0.24209
cat_grade9 2.831e+05 1.466e+05 1.930 0.05358 .
cat_grade10 4.366e+05 1.467e+05 2.975 0.00293 **
cat_grade11 7.141e+05 1.471e+05 4.854 1.22e-06 ***
cat_grade12 1.194e+06 1.488e+05 8.023 1.11e-15 ***
cat_grade13 2.379e+06 1.645e+05 14.462 < 2e-16 ***
cat_cond2 1.641e+03 4.791e+04 0.034 0.97268
cat_cond3 2.525e+04 4.450e+04 0.567 0.57048
cat_cond4 5.819e+04 4.452e+04 1.307 0.19115
cat_cond5 1.010e+05 4.481e+04 2.253 0.02428 *
cat_floors1.5 7.897e+03 6.796e+03 1.162 0.24530
cat_floors2 4.111e+00 5.728e+03 0.001 0.99943
cat_floors2.5 8.389e+04 1.988e+04 4.220 2.45e-05 ***
cat_floors3 5.869e+04 1.229e+04 4.776 1.81e-06 ***
cat_floors3.5 1.225e+05 9.255e+04 1.323 0.18576
bathrooms 5.278e+04 4.038e+03 13.071 < 2e-16 ***
ln_sqft_living -3.850e+06 1.402e+05 -27.464 < 2e-16 ***
ln_sqft15_lot -1.897e+04 2.677e+03 -7.086 1.45e-12 ***
ln_sqft_above 3.807e+05 2.931e+04 12.990 < 2e-16 ***
yr_built -2.431e+03 9.339e+01 -26.027 < 2e-16 ***
zipcode -5.468e+02 4.049e+01 -13.505 < 2e-16 ***
lat 5.510e+05 1.324e+04 41.629 < 2e-16 ***
long -2.000e+05 1.640e+04 -12.189 < 2e-16 ***
ln_sqft15_living 1.102e+05 8.635e+03 12.765 < 2e-16 ***
ln_bedroom -8.674e+04 1.084e+04 -8.004 1.29e-15 ***
ln_sqft_basement -3.395e+04 6.907e+03 -4.915 8.96e-07 ***
basementSq 9.227e+03 1.252e+03 7.368 1.82e-13 ***
sqft_livingSq 2.459e+05 9.957e+03 24.694 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 205900 on 15096 degrees of freedom
Multiple R-squared: 0.6943, Adjusted R-squared: 0.6937
F-statistic: 1072 on 32 and 15096 DF, p-value: < 2.2e-16
So, from the looks of it we have a decent model. The F p-value=2.2e-16 < alpha=.05 which means that this model is very significant. Also the R2 value=0.6943 and R2a=0.6937 difference is very small, which means we are not taking a penalty hit for the extra variables included in this model. Let’s build another model in which we will do backwards selection (or backwards regression). We also notice that the square terms for basement and sqft_living are significant since their t p-values<alpha=.05.
We will now create model 2 where we will do stepwise regression. ###Stepwise Regression
m1_0 = lm(price ~ 1)
step(m1_0, scope=list(lower=m1_0, upper=m1, direction="both"))
Start: AIC=388113.2
price ~ 1
Df Sum of Sq RSS AIC
+ cat_grade 10 1.0847e+15 1.0092e+15 377091
+ sqft_livingSq 1 8.1773e+14 1.2762e+15 380624
+ ln_sqft_living 1 7.8133e+14 1.3126e+15 381049
+ ln_sqft_above 1 6.1829e+14 1.4757e+15 382821
+ ln_sqft15_living 1 6.1620e+14 1.4777e+15 382842
+ bathrooms 1 5.8581e+14 1.5081e+15 383150
+ ln_bedroom 1 1.9531e+14 1.8986e+15 386634
+ lat 1 1.8609e+14 1.9079e+15 386707
+ cat_floors 5 1.6667e+14 1.9273e+15 386868
+ basementSq 1 1.1209e+14 1.9819e+15 387283
+ ln_sqft_basement 1 8.8724e+13 2.0052e+15 387460
+ ln_sqft15_lot 1 4.8900e+13 2.0450e+15 387758
+ cat_cond 4 1.3633e+13 2.0803e+15 388022
+ yr_built 1 9.4932e+12 2.0845e+15 388046
+ zipcode 1 7.3542e+12 2.0866e+15 388062
+ long 1 1.2730e+12 2.0927e+15 388106
<none> 2.0939e+15 388113
Step: AIC=377091.1
price ~ cat_grade
Df Sum of Sq RSS AIC
+ lat 1 1.1325e+14 8.9598e+14 375292
+ yr_built 1 9.8933e+13 9.1030e+14 375532
+ sqft_livingSq 1 8.6078e+13 9.2315e+14 375744
+ ln_sqft_living 1 7.9702e+13 9.2953e+14 375849
+ basementSq 1 6.7022e+13 9.4221e+14 376053
+ ln_sqft_basement 1 5.9877e+13 9.4936e+14 376168
+ cat_cond 4 3.6454e+13 9.7278e+14 376543
+ ln_sqft15_living 1 3.1564e+13 9.7767e+14 376612
+ bathrooms 1 2.9888e+13 9.7934e+14 376638
+ long 1 2.4877e+13 9.8436e+14 376716
+ cat_floors 5 2.4423e+13 9.8481e+14 376730
+ ln_sqft_above 1 1.4584e+13 9.9465e+14 376873
+ ln_bedroom 1 1.4375e+13 9.9486e+14 376876
+ zipcode 1 7.2627e+12 1.0020e+15 376984
+ ln_sqft15_lot 1 2.7569e+11 1.0090e+15 377089
<none> 1.0092e+15 377091
- cat_grade 10 1.0847e+15 2.0939e+15 388113
Step: AIC=375292.4
price ~ cat_grade + lat
Df Sum of Sq RSS AIC
+ sqft_livingSq 1 1.0144e+14 7.9454e+14 373477
+ ln_sqft_living 1 9.5042e+13 8.0094e+14 373598
+ yr_built 1 5.9012e+13 8.3697e+14 374264
+ basementSq 1 4.8923e+13 8.4706e+14 374445
+ ln_sqft_basement 1 4.2052e+13 8.5393e+14 374567
+ bathrooms 1 3.9665e+13 8.5632e+14 374609
+ ln_sqft15_living 1 3.7351e+13 8.5863e+14 374650
+ cat_cond 4 3.5242e+13 8.6074e+14 374693
+ ln_sqft_above 1 3.0882e+13 8.6510e+14 374764
+ ln_bedroom 1 2.1679e+13 8.7430e+14 374924
+ cat_floors 5 1.7783e+13 8.7820e+14 374999
+ long 1 1.1124e+13 8.8486e+14 375105
+ ln_sqft15_lot 1 1.5945e+12 8.9439e+14 375267
+ zipcode 1 2.5790e+11 8.9572e+14 375290
<none> 8.9598e+14 375292
- lat 1 1.1325e+14 1.0092e+15 377091
- cat_grade 10 1.0119e+15 1.9079e+15 386707
Step: AIC=373476.6
price ~ cat_grade + lat + sqft_livingSq
Df Sum of Sq RSS AIC
+ yr_built 1 5.7724e+13 7.3682e+14 372337
+ ln_sqft_living 1 4.7205e+13 7.4734e+14 372552
+ cat_cond 4 2.5443e+13 7.6910e+14 372992
+ long 1 2.4542e+13 7.7000e+14 373004
+ cat_floors 5 1.4009e+13 7.8053e+14 373217
+ basementSq 1 1.0962e+13 7.8358e+14 373268
+ ln_sqft_basement 1 9.7644e+12 7.8478e+14 373292
+ ln_sqft_above 1 4.4964e+12 7.9005e+14 373393
+ ln_sqft15_lot 1 2.1614e+12 7.9238e+14 373437
+ ln_sqft15_living 1 2.0427e+12 7.9250e+14 373440
+ ln_bedroom 1 1.9648e+12 7.9258e+14 373441
+ bathrooms 1 1.1259e+12 7.9342e+14 373457
+ zipcode 1 2.2958e+11 7.9431e+14 373474
<none> 7.9454e+14 373477
- sqft_livingSq 1 1.0144e+14 8.9598e+14 375292
- lat 1 1.2861e+14 9.2315e+14 375744
- cat_grade 10 3.2090e+14 1.1154e+15 378589
Step: AIC=372337.5
price ~ cat_grade + lat + sqft_livingSq + yr_built
Df Sum of Sq RSS AIC
+ ln_sqft_living 1 4.4196e+13 6.9262e+14 371404
+ bathrooms 1 1.6607e+13 7.2021e+14 371995
+ long 1 5.9989e+12 7.3082e+14 372216
+ cat_cond 4 6.1072e+12 7.3071e+14 372220
+ ln_sqft15_lot 1 4.7067e+12 7.3211e+14 372243
+ cat_floors 5 4.4638e+12 7.3236e+14 372256
+ ln_sqft15_living 1 2.7522e+12 7.3407e+14 372283
+ ln_bedroom 1 2.5871e+12 7.3423e+14 372286
+ basementSq 1 2.2576e+12 7.3456e+14 372293
+ zipcode 1 2.0337e+12 7.3479e+14 372298
+ ln_sqft_basement 1 1.9695e+12 7.3485e+14 372299
<none> 7.3682e+14 372337
+ ln_sqft_above 1 8.9841e+10 7.3673e+14 372338
- yr_built 1 5.7724e+13 7.9454e+14 373477
- lat 1 8.5796e+13 8.2262e+14 374002
- sqft_livingSq 1 1.0015e+14 8.3697e+14 374264
- cat_grade 10 3.4971e+14 1.0865e+15 378194
Step: AIC=371403.7
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living
Df Sum of Sq RSS AIC
+ bathrooms 1 1.2084e+13 6.8054e+14 371139
+ cat_cond 4 7.0921e+12 6.8553e+14 371256
+ ln_sqft15_lot 1 5.5076e+12 6.8712e+14 371285
+ long 1 4.7780e+12 6.8785e+14 371301
+ cat_floors 5 4.3573e+12 6.8827e+14 371318
+ ln_sqft15_living 1 2.7853e+12 6.8984e+14 371345
+ zipcode 1 2.7624e+12 6.8986e+14 371345
+ ln_sqft_basement 1 2.2340e+12 6.9039e+14 371357
+ basementSq 1 1.9839e+12 6.9064e+14 371362
+ ln_bedroom 1 9.6032e+11 6.9166e+14 371385
<none> 6.9262e+14 371404
+ ln_sqft_above 1 3.6310e+10 6.9259e+14 371405
- ln_sqft_living 1 4.4196e+13 7.3682e+14 372337
- sqft_livingSq 1 5.0350e+13 7.4297e+14 372463
- yr_built 1 5.4715e+13 7.4734e+14 372552
- lat 1 7.9747e+13 7.7237e+14 373050
- cat_grade 10 1.7405e+14 8.6668e+14 374775
Step: AIC=371139.4
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms
Df Sum of Sq RSS AIC
+ cat_cond 4 6.5211e+12 6.7402e+14 371002
+ ln_sqft15_living 1 3.9886e+12 6.7655e+14 371052
+ zipcode 1 3.5094e+12 6.7703e+14 371063
+ long 1 3.3708e+12 6.7717e+14 371066
+ ln_sqft15_lot 1 2.6617e+12 6.7788e+14 371082
+ cat_floors 5 2.8683e+12 6.7767e+14 371085
+ ln_bedroom 1 2.4324e+12 6.7811e+14 371087
+ ln_sqft_basement 1 1.2885e+12 6.7925e+14 371113
+ basementSq 1 1.1455e+12 6.7939e+14 371116
<none> 6.8054e+14 371139
+ ln_sqft_above 1 2.7682e+08 6.8054e+14 371141
- bathrooms 1 1.2084e+13 6.9262e+14 371404
- ln_sqft_living 1 3.9673e+13 7.2021e+14 371995
- sqft_livingSq 1 4.3753e+13 7.2429e+14 372080
- yr_built 1 6.6502e+13 7.4704e+14 372548
- lat 1 7.7227e+13 7.5777e+14 372764
- cat_grade 10 1.6438e+14 8.4492e+14 374393
Step: AIC=371001.7
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond
Df Sum of Sq RSS AIC
+ ln_sqft15_living 1 4.0632e+12 6.6996e+14 370912
+ long 1 3.7554e+12 6.7026e+14 370919
+ ln_sqft15_lot 1 3.3567e+12 6.7066e+14 370928
+ ln_bedroom 1 2.7126e+12 6.7131e+14 370943
+ cat_floors 5 2.9611e+12 6.7106e+14 370945
+ zipcode 1 2.4995e+12 6.7152e+14 370948
+ ln_sqft_basement 1 9.7972e+11 6.7304e+14 370982
+ basementSq 1 7.9935e+11 6.7322e+14 370986
<none> 6.7402e+14 371002
+ ln_sqft_above 1 5.2205e+10 6.7397e+14 371003
- cat_cond 4 6.5211e+12 6.8054e+14 371139
- bathrooms 1 1.1513e+13 6.8553e+14 371256
- ln_sqft_living 1 4.0692e+13 7.1471e+14 371887
- sqft_livingSq 1 4.4707e+13 7.1873e+14 371971
- yr_built 1 4.5431e+13 7.1945e+14 371987
- lat 1 7.9785e+13 7.5380e+14 372692
- cat_grade 10 1.6567e+14 8.3969e+14 374307
Step: AIC=370912.2
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living
Df Sum of Sq RSS AIC
+ long 1 5.9491e+12 6.6401e+14 370779
+ ln_sqft15_lot 1 5.8396e+12 6.6412e+14 370782
+ cat_floors 5 4.3227e+12 6.6563e+14 370824
+ ln_bedroom 1 2.5234e+12 6.6743e+14 370857
+ zipcode 1 1.5130e+12 6.6844e+14 370880
+ ln_sqft_basement 1 1.4540e+12 6.6850e+14 370881
+ basementSq 1 1.2404e+12 6.6872e+14 370886
<none> 6.6996e+14 370912
+ ln_sqft_above 1 7.1238e+08 6.6995e+14 370914
- ln_sqft15_living 1 4.0632e+12 6.7402e+14 371002
- cat_cond 4 6.5957e+12 6.7655e+14 371052
- bathrooms 1 1.2650e+13 6.8261e+14 371193
- ln_sqft_living 1 4.0517e+13 7.1047e+14 371799
- sqft_livingSq 1 4.3730e+13 7.1369e+14 371867
- yr_built 1 4.6883e+13 7.1684e+14 371934
- lat 1 7.9745e+13 7.4970e+14 372612
- cat_grade 10 1.4878e+14 8.1873e+14 373926
Step: AIC=370779.3
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long
Df Sum of Sq RSS AIC
+ zipcode 1 6.9011e+12 6.5711e+14 370623
+ ln_sqft15_lot 1 3.0215e+12 6.6098e+14 370712
+ cat_floors 5 3.1117e+12 6.6089e+14 370718
+ ln_bedroom 1 2.4481e+12 6.6156e+14 370725
+ ln_sqft_basement 1 4.6817e+11 6.6354e+14 370771
+ basementSq 1 3.5610e+11 6.6365e+14 370773
+ ln_sqft_above 1 2.6657e+11 6.6374e+14 370775
<none> 6.6401e+14 370779
- long 1 5.9491e+12 6.6996e+14 370912
- ln_sqft15_living 1 6.2568e+12 6.7026e+14 370919
- cat_cond 4 7.1113e+12 6.7112e+14 370932
- bathrooms 1 1.1144e+13 6.7515e+14 371029
- yr_built 1 3.0946e+13 6.9495e+14 371466
- ln_sqft_living 1 3.9507e+13 7.0351e+14 371652
- sqft_livingSq 1 4.2849e+13 7.0686e+14 371723
- lat 1 7.6826e+13 7.4083e+14 372434
- cat_grade 10 1.4164e+14 8.0565e+14 373685
Step: AIC=370623.2
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode
Df Sum of Sq RSS AIC
+ ln_sqft15_lot 1 3.3834e+12 6.5372e+14 370547
+ cat_floors 5 3.6042e+12 6.5350e+14 370550
+ ln_bedroom 1 2.9767e+12 6.5413e+14 370557
+ ln_sqft_basement 1 5.8808e+11 6.5652e+14 370612
+ basementSq 1 4.4213e+11 6.5666e+14 370615
+ ln_sqft_above 1 2.0812e+11 6.5690e+14 370620
<none> 6.5711e+14 370623
- ln_sqft15_living 1 5.0403e+12 6.6215e+14 370737
- cat_cond 4 5.5333e+12 6.6264e+14 370742
- zipcode 1 6.9011e+12 6.6401e+14 370779
- bathrooms 1 1.1147e+13 6.6825e+14 370876
- long 1 1.1337e+13 6.6844e+14 370880
- yr_built 1 3.4051e+13 6.9116e+14 371386
- ln_sqft_living 1 3.9997e+13 6.9710e+14 371515
- sqft_livingSq 1 4.3405e+13 7.0051e+14 371589
- lat 1 8.3389e+13 7.4049e+14 372429
- cat_grade 10 1.4015e+14 7.9726e+14 373528
Step: AIC=370547.1
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_sqft15_lot
Df Sum of Sq RSS AIC
+ ln_bedroom 1 2.9589e+12 6.5076e+14 370481
+ cat_floors 5 2.5080e+12 6.5121e+14 370499
+ ln_sqft_basement 1 4.8473e+11 6.5324e+14 370538
+ basementSq 1 3.6977e+11 6.5335e+14 370541
+ ln_sqft_above 1 2.3393e+11 6.5349e+14 370544
<none> 6.5372e+14 370547
- ln_sqft15_lot 1 3.3834e+12 6.5711e+14 370623
- cat_cond 4 5.9974e+12 6.5972e+14 370677
- ln_sqft15_living 1 6.4982e+12 6.6022e+14 370695
- zipcode 1 7.2630e+12 6.6098e+14 370712
- long 1 7.5750e+12 6.6130e+14 370719
- bathrooms 1 8.7465e+12 6.6247e+14 370746
- yr_built 1 3.5970e+13 6.8969e+14 371355
- ln_sqft_living 1 4.1323e+13 6.9505e+14 371472
- sqft_livingSq 1 4.4993e+13 6.9872e+14 371552
- lat 1 7.7399e+13 7.3112e+14 372238
- cat_grade 10 1.4038e+14 7.9410e+14 373470
Step: AIC=370480.5
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_sqft15_lot + ln_bedroom
Df Sum of Sq RSS AIC
+ cat_floors 5 2.4516e+12 6.4831e+14 370433
+ ln_sqft_basement 1 3.5267e+11 6.5041e+14 370474
+ ln_sqft_above 1 3.1669e+11 6.5045e+14 370475
+ basementSq 1 2.6403e+11 6.5050e+14 370476
<none> 6.5076e+14 370481
- ln_bedroom 1 2.9589e+12 6.5372e+14 370547
- ln_sqft15_lot 1 3.3657e+12 6.5413e+14 370557
- cat_cond 4 6.2077e+12 6.5697e+14 370616
- ln_sqft15_living 1 6.1763e+12 6.5694e+14 370621
- long 1 7.7499e+12 6.5851e+14 370658
- zipcode 1 7.8014e+12 6.5856e+14 370659
- bathrooms 1 1.0178e+13 6.6094e+14 370713
- yr_built 1 3.7291e+13 6.8805e+14 371322
- ln_sqft_living 1 3.8484e+13 6.8925e+14 371348
- sqft_livingSq 1 4.2871e+13 6.9363e+14 371444
- lat 1 7.6284e+13 7.2705e+14 372155
- cat_grade 10 1.3448e+14 7.8525e+14 373303
Step: AIC=370433.4
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_sqft15_lot + ln_bedroom + cat_floors
Df Sum of Sq RSS AIC
+ ln_sqft_basement 1 7.9331e+11 6.4752e+14 370417
+ basementSq 1 6.6206e+11 6.4765e+14 370420
+ ln_sqft_above 1 1.5895e+11 6.4815e+14 370432
<none> 6.4831e+14 370433
- cat_floors 5 2.4516e+12 6.5076e+14 370481
- ln_sqft15_lot 1 2.3403e+12 6.5065e+14 370486
- ln_bedroom 1 2.9025e+12 6.5121e+14 370499
- cat_cond 4 6.0677e+12 6.5438e+14 370566
- ln_sqft15_living 1 6.8675e+12 6.5518e+14 370591
- long 1 7.1634e+12 6.5547e+14 370598
- zipcode 1 8.0536e+12 6.5636e+14 370618
- bathrooms 1 9.4956e+12 6.5781e+14 370651
- yr_built 1 3.0015e+13 6.7833e+14 371116
- ln_sqft_living 1 3.8836e+13 6.8715e+14 371312
- sqft_livingSq 1 4.3233e+13 6.9154e+14 371408
- lat 1 7.3581e+13 7.2189e+14 372058
- cat_grade 10 1.3020e+14 7.7851e+14 373182
Step: AIC=370416.9
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_sqft15_lot + ln_bedroom + cat_floors + ln_sqft_basement
Df Sum of Sq RSS AIC
+ ln_sqft_above 1 5.1711e+12 6.4235e+14 370298
+ basementSq 1 3.1881e+11 6.4720e+14 370411
<none> 6.4752e+14 370417
- ln_sqft_basement 1 7.9331e+11 6.4831e+14 370433
- ln_sqft15_lot 1 1.7822e+12 6.4930e+14 370456
- cat_floors 5 2.8922e+12 6.5041e+14 370474
- ln_bedroom 1 2.6403e+12 6.5016e+14 370476
- cat_cond 4 5.9420e+12 6.5346e+14 370547
- long 1 6.3819e+12 6.5390e+14 370563
- ln_sqft15_living 1 7.1164e+12 6.5463e+14 370580
- bathrooms 1 8.2756e+12 6.5579e+14 370607
- zipcode 1 8.3449e+12 6.5586e+14 370609
- yr_built 1 2.9303e+13 6.7682e+14 371084
- ln_sqft_living 1 3.9351e+13 6.8687e+14 371307
- sqft_livingSq 1 4.3568e+13 6.9109e+14 371400
- lat 1 7.2103e+13 7.1962e+14 372012
- cat_grade 10 1.3095e+14 7.7847e+14 373183
Step: AIC=370297.6
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_sqft15_lot + ln_bedroom + cat_floors + ln_sqft_basement +
ln_sqft_above
Df Sum of Sq RSS AIC
+ basementSq 1 2.3018e+12 6.4005e+14 370245
<none> 6.4235e+14 370298
- cat_floors 5 2.2922e+12 6.4464e+14 370341
- ln_sqft15_lot 1 2.0534e+12 6.4440e+14 370344
- ln_bedroom 1 2.5725e+12 6.4492e+14 370356
- ln_sqft_above 1 5.1711e+12 6.4752e+14 370417
- ln_sqft_basement 1 5.8055e+12 6.4815e+14 370432
- long 1 6.3583e+12 6.4871e+14 370445
- cat_cond 4 6.7620e+12 6.4911e+14 370448
- ln_sqft15_living 1 6.7048e+12 6.4905e+14 370453
- bathrooms 1 7.7621e+12 6.5011e+14 370477
- zipcode 1 8.1382e+12 6.5049e+14 370486
- yr_built 1 2.9505e+13 6.7185e+14 370975
- ln_sqft_living 1 4.0923e+13 6.8327e+14 371230
- sqft_livingSq 1 4.1024e+13 6.8337e+14 371232
- lat 1 7.2614e+13 7.1496e+14 371916
- cat_grade 10 1.2332e+14 7.6566e+14 372934
Step: AIC=370245.3
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_sqft15_lot + ln_bedroom + cat_floors + ln_sqft_basement +
ln_sqft_above + basementSq
Df Sum of Sq RSS AIC
<none> 6.4005e+14 370245
- ln_sqft_basement 1 1.0244e+12 6.4107e+14 370267
- cat_floors 5 2.0203e+12 6.4207e+14 370283
- ln_sqft15_lot 1 2.1286e+12 6.4217e+14 370293
- basementSq 1 2.3018e+12 6.4235e+14 370298
- ln_bedroom 1 2.7164e+12 6.4276e+14 370307
- cat_cond 4 6.4596e+12 6.4650e+14 370389
- long 1 6.2997e+12 6.4634e+14 370391
- ln_sqft15_living 1 6.9089e+12 6.4695e+14 370406
- ln_sqft_above 1 7.1541e+12 6.4720e+14 370411
- bathrooms 1 7.2440e+12 6.4729e+14 370414
- zipcode 1 7.7324e+12 6.4778e+14 370425
- sqft_livingSq 1 2.5855e+13 6.6590e+14 370842
- yr_built 1 2.8720e+13 6.6877e+14 370907
- ln_sqft_living 1 3.1981e+13 6.7203e+14 370981
- lat 1 7.3474e+13 7.1352e+14 371887
- cat_grade 10 1.2558e+14 7.6563e+14 372936
Call:
lm(formula = price ~ cat_grade + lat + sqft_livingSq + yr_built +
ln_sqft_living + bathrooms + cat_cond + ln_sqft15_living +
long + zipcode + ln_sqft15_lot + ln_bedroom + cat_floors +
ln_sqft_basement + ln_sqft_above + basementSq)
Coefficients:
(Intercept) cat_grade4 cat_grade5 cat_grade6 cat_grade7 cat_grade8 cat_grade9 cat_grade10 cat_grade11
1.958e+07 -1.767e+05 -2.845e+04 3.839e+04 1.070e+05 1.714e+05 2.831e+05 4.366e+05 7.141e+05
cat_grade12 cat_grade13 lat sqft_livingSq yr_built ln_sqft_living bathrooms cat_cond2 cat_cond3
1.194e+06 2.379e+06 5.510e+05 2.459e+05 -2.431e+03 -3.850e+06 5.278e+04 1.641e+03 2.525e+04
cat_cond4 cat_cond5 ln_sqft15_living long zipcode ln_sqft15_lot ln_bedroom cat_floors1.5 cat_floors2
5.819e+04 1.010e+05 1.102e+05 -2.000e+05 -5.468e+02 -1.897e+04 -8.674e+04 7.897e+03 4.111e+00
cat_floors2.5 cat_floors3 cat_floors3.5 ln_sqft_basement ln_sqft_above basementSq
8.389e+04 5.869e+04 1.225e+05 -3.395e+04 3.807e+05 9.227e+03
We have now found model 2. Interestingly enough, the model 2 is exactly like m1. Let’s do backwards regression on the model 1. We will discard the stepwise regression model ###Model 2
m2 = lm(price ~ cat_grade + lat + sqft_livingSq + yr_built +
ln_sqft_living + bathrooms + cat_cond + ln_sqft15_living +
long + zipcode + ln_sqft15_lot + ln_bedroom + cat_floors +
ln_sqft_basement + ln_sqft_above + basementSq)
summary(m2)
Call:
lm(formula = price ~ cat_grade + lat + sqft_livingSq + yr_built +
ln_sqft_living + bathrooms + cat_cond + ln_sqft15_living +
long + zipcode + ln_sqft15_lot + ln_bedroom + cat_floors +
ln_sqft_basement + ln_sqft_above + basementSq)
Residuals:
Min 1Q Median 3Q Max
-1634591 -95962 -10949 71991 4450946
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.958e+07 3.732e+06 5.247 1.56e-07 ***
cat_grade4 -1.767e+05 1.537e+05 -1.150 0.25025
cat_grade5 -2.845e+04 1.468e+05 -0.194 0.84637
cat_grade6 3.839e+04 1.462e+05 0.262 0.79297
cat_grade7 1.070e+05 1.464e+05 0.731 0.46483
cat_grade8 1.714e+05 1.465e+05 1.170 0.24209
cat_grade9 2.831e+05 1.466e+05 1.930 0.05358 .
cat_grade10 4.366e+05 1.467e+05 2.975 0.00293 **
cat_grade11 7.141e+05 1.471e+05 4.854 1.22e-06 ***
cat_grade12 1.194e+06 1.488e+05 8.023 1.11e-15 ***
cat_grade13 2.379e+06 1.645e+05 14.462 < 2e-16 ***
lat 5.510e+05 1.324e+04 41.629 < 2e-16 ***
sqft_livingSq 2.459e+05 9.957e+03 24.694 < 2e-16 ***
yr_built -2.431e+03 9.339e+01 -26.027 < 2e-16 ***
ln_sqft_living -3.850e+06 1.402e+05 -27.464 < 2e-16 ***
bathrooms 5.278e+04 4.038e+03 13.071 < 2e-16 ***
cat_cond2 1.641e+03 4.791e+04 0.034 0.97268
cat_cond3 2.525e+04 4.450e+04 0.567 0.57048
cat_cond4 5.819e+04 4.452e+04 1.307 0.19115
cat_cond5 1.010e+05 4.481e+04 2.253 0.02428 *
ln_sqft15_living 1.102e+05 8.635e+03 12.765 < 2e-16 ***
long -2.000e+05 1.640e+04 -12.189 < 2e-16 ***
zipcode -5.468e+02 4.049e+01 -13.505 < 2e-16 ***
ln_sqft15_lot -1.897e+04 2.677e+03 -7.086 1.45e-12 ***
ln_bedroom -8.674e+04 1.084e+04 -8.004 1.29e-15 ***
cat_floors1.5 7.897e+03 6.796e+03 1.162 0.24530
cat_floors2 4.111e+00 5.728e+03 0.001 0.99943
cat_floors2.5 8.389e+04 1.988e+04 4.220 2.45e-05 ***
cat_floors3 5.869e+04 1.229e+04 4.776 1.81e-06 ***
cat_floors3.5 1.225e+05 9.255e+04 1.323 0.18576
ln_sqft_basement -3.395e+04 6.907e+03 -4.915 8.96e-07 ***
ln_sqft_above 3.807e+05 2.931e+04 12.990 < 2e-16 ***
basementSq 9.227e+03 1.252e+03 7.368 1.82e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 205900 on 15096 degrees of freedom
Multiple R-squared: 0.6943, Adjusted R-squared: 0.6937
F-statistic: 1072 on 32 and 15096 DF, p-value: < 2.2e-16
step(m1,data=train_house_data,direction="backward")
Start: AIC=370245.3
price ~ cat_grade + cat_cond + cat_floors + bathrooms + ln_sqft_living +
ln_sqft15_lot + ln_sqft_above + yr_built + zipcode + lat +
long + ln_sqft15_living + ln_sqft15_lot + ln_bedroom + ln_sqft_basement +
basementSq + sqft_livingSq
Df Sum of Sq RSS AIC
<none> 6.4005e+14 370245
- ln_sqft_basement 1 1.0244e+12 6.4107e+14 370267
- cat_floors 5 2.0203e+12 6.4207e+14 370283
- ln_sqft15_lot 1 2.1286e+12 6.4217e+14 370293
- basementSq 1 2.3018e+12 6.4235e+14 370298
- ln_bedroom 1 2.7164e+12 6.4276e+14 370307
- cat_cond 4 6.4596e+12 6.4650e+14 370389
- long 1 6.2997e+12 6.4634e+14 370391
- ln_sqft15_living 1 6.9089e+12 6.4695e+14 370406
- ln_sqft_above 1 7.1541e+12 6.4720e+14 370411
- bathrooms 1 7.2440e+12 6.4729e+14 370414
- zipcode 1 7.7324e+12 6.4778e+14 370425
- sqft_livingSq 1 2.5855e+13 6.6590e+14 370842
- yr_built 1 2.8720e+13 6.6877e+14 370907
- ln_sqft_living 1 3.1981e+13 6.7203e+14 370981
- lat 1 7.3474e+13 7.1352e+14 371887
- cat_grade 10 1.2558e+14 7.6563e+14 372936
Call:
lm(formula = price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq)
Coefficients:
(Intercept) cat_grade4 cat_grade5 cat_grade6 cat_grade7 cat_grade8 cat_grade9 cat_grade10 cat_grade11
1.958e+07 -1.767e+05 -2.845e+04 3.839e+04 1.070e+05 1.714e+05 2.831e+05 4.366e+05 7.141e+05
cat_grade12 cat_grade13 cat_cond2 cat_cond3 cat_cond4 cat_cond5 cat_floors1.5 cat_floors2 cat_floors2.5
1.194e+06 2.379e+06 1.641e+03 2.525e+04 5.819e+04 1.010e+05 7.897e+03 4.111e+00 8.389e+04
cat_floors3 cat_floors3.5 bathrooms ln_sqft_living ln_sqft15_lot ln_sqft_above yr_built zipcode lat
5.869e+04 1.225e+05 5.278e+04 -3.850e+06 -1.897e+04 3.807e+05 -2.431e+03 -5.468e+02 5.510e+05
long ln_sqft15_living ln_bedroom ln_sqft_basement basementSq sqft_livingSq
-2.000e+05 1.102e+05 -8.674e+04 -3.395e+04 9.227e+03 2.459e+05
Once again, backwards regression picks the same model. This makes sense because in model 1, most of the t p-values were very significant in which they were less than an alpha= .05 or .01
Let’s compute an F partial test on our categorical variables to prove our assumption (that these groups add a significant effect on predicting the price of a home)
m1_floor =lm(price ~ cat_grade + cat_cond + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq)
m1_grade =lm(price ~ cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq)
m1_cond =lm(price ~ cat_grade + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq)
anova(m1, m1_floor)
Analysis of Variance Table
Model 1: price ~ cat_grade + cat_cond + cat_floors + bathrooms + ln_sqft_living +
ln_sqft15_lot + ln_sqft_above + yr_built + zipcode + lat +
long + ln_sqft15_living + ln_sqft15_lot + ln_bedroom + ln_sqft_basement +
basementSq + sqft_livingSq
Model 2: price ~ cat_grade + cat_cond + bathrooms + ln_sqft_living + ln_sqft15_lot +
ln_sqft_above + yr_built + zipcode + lat + long + ln_sqft15_living +
ln_sqft15_lot + ln_bedroom + ln_sqft_basement + basementSq +
sqft_livingSq
Res.Df RSS Df Sum of Sq F Pr(>F)
1 15096 6.4005e+14
2 15101 6.4207e+14 -5 -2.0203e+12 9.5301 4.327e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(m1, m1_cond)
Analysis of Variance Table
Model 1: price ~ cat_grade + cat_cond + cat_floors + bathrooms + ln_sqft_living +
ln_sqft15_lot + ln_sqft_above + yr_built + zipcode + lat +
long + ln_sqft15_living + ln_sqft15_lot + ln_bedroom + ln_sqft_basement +
basementSq + sqft_livingSq
Model 2: price ~ cat_grade + cat_floors + bathrooms + ln_sqft_living +
ln_sqft15_lot + ln_sqft_above + yr_built + zipcode + lat +
long + ln_sqft15_living + ln_sqft15_lot + ln_bedroom + ln_sqft_basement +
basementSq + sqft_livingSq
Res.Df RSS Df Sum of Sq F Pr(>F)
1 15096 6.4005e+14
2 15100 6.4650e+14 -4 -6.4596e+12 38.089 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(m1, m1_grade)
Analysis of Variance Table
Model 1: price ~ cat_grade + cat_cond + cat_floors + bathrooms + ln_sqft_living +
ln_sqft15_lot + ln_sqft_above + yr_built + zipcode + lat +
long + ln_sqft15_living + ln_sqft15_lot + ln_bedroom + ln_sqft_basement +
basementSq + sqft_livingSq
Model 2: price ~ cat_cond + cat_floors + bathrooms + ln_sqft_living +
ln_sqft15_lot + ln_sqft_above + yr_built + zipcode + lat +
long + ln_sqft15_living + ln_sqft15_lot + ln_bedroom + ln_sqft_basement +
basementSq + sqft_livingSq
Res.Df RSS Df Sum of Sq F Pr(>F)
1 15096 6.4005e+14
2 15106 7.6563e+14 -10 -1.2558e+14 296.19 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From all of these outputs, we see that the F p-value < alpha=.05 or .01. Because of this, the categorical variables are indeed highly significant.
Let’s see if we can remove any outlier and leverage points.
First let’s find the outliers. To do this we will find any standard residual errors that exceed the cutoff of the absolute value of 2.
rsta = rstandard(m1)
temp_data_out = subset(train_house_data, (rsta>=2 & rsta<=-2))
temp_data_out
It appears that there are no outliers. Let’s do the same for the leverage point
Cutoff: 3(k+1)/n
Cutoff: 3(33)/15129
Cutoff: 0.006543724
dim(train_house_data)
[1] 15129 21
cutoff = 3*(33)/15129
lev = hatvalues(m1)
temp_data = subset(train_house_data, (lev>cutoff))
nrow(temp_data)
[1] 543
removed_train = which((lev>cutoff))
We have found 543 observations that exceed the cutoff. These are leverage points and we will remove it from the training data set. We will create a new model 2 from this
We will need to create a list to remove the points. We will create a function to do so.
removal <- function(x){
vector = c()
for (i in 1:length(x)){
vector[i] <- x[i]
}
return(vector)
}
lev_points = removal(removed_train)
m2 = lm(price ~ cat_grade+cat_cond+cat_floors+bathrooms+ln_sqft_living+ln_sqft15_lot+ln_sqft_above+yr_built+zipcode+lat+long+ln_sqft15_living+ln_sqft15_lot+ln_bedroom+ln_sqft_basement+basementSq+sqft_livingSq, subset=-lev_points)
summary(m2)
Call:
lm(formula = price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq,
subset = -lev_points)
Residuals:
Min 1Q Median 3Q Max
-1013645 -93434 -11609 68571 2219533
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.360e+07 3.487e+06 3.900 9.64e-05 ***
cat_grade6 4.061e+04 6.655e+04 0.610 0.54172
cat_grade7 1.061e+05 6.662e+04 1.592 0.11143
cat_grade8 1.752e+05 6.681e+04 2.622 0.00875 **
cat_grade9 2.965e+05 6.702e+04 4.424 9.76e-06 ***
cat_grade10 4.603e+05 6.733e+04 6.837 8.43e-12 ***
cat_grade11 7.489e+05 6.828e+04 10.967 < 2e-16 ***
cat_cond4 3.170e+04 3.910e+03 8.107 5.59e-16 ***
cat_cond5 7.604e+04 6.343e+03 11.988 < 2e-16 ***
cat_floors1.5 6.622e+03 6.318e+03 1.048 0.29462
cat_floors2 1.690e+03 5.300e+03 0.319 0.74986
cat_floors3 5.604e+04 1.142e+04 4.908 9.32e-07 ***
bathrooms 5.285e+04 3.816e+03 13.848 < 2e-16 ***
ln_sqft_living -3.214e+06 1.418e+05 -22.666 < 2e-16 ***
ln_sqft15_lot -1.755e+04 2.516e+03 -6.974 3.20e-12 ***
ln_sqft_above 3.270e+05 2.778e+04 11.774 < 2e-16 ***
yr_built -2.457e+03 8.786e+01 -27.959 < 2e-16 ***
zipcode -4.862e+02 3.751e+01 -12.960 < 2e-16 ***
lat 5.475e+05 1.220e+04 44.864 < 2e-16 ***
long -1.843e+05 1.551e+04 -11.886 < 2e-16 ***
ln_sqft15_living 1.164e+05 8.113e+03 14.353 < 2e-16 ***
ln_bedroom -8.605e+04 1.035e+04 -8.314 < 2e-16 ***
ln_sqft_basement -2.443e+04 6.567e+03 -3.720 0.00020 ***
basementSq 7.216e+03 1.196e+03 6.035 1.63e-09 ***
sqft_livingSq 2.060e+05 1.005e+04 20.506 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 187400 on 14561 degrees of freedom
Multiple R-squared: 0.6688, Adjusted R-squared: 0.6683
F-statistic: 1225 on 24 and 14561 DF, p-value: < 2.2e-16
Let’s compare this model to model1
summary(m1)
Call:
lm(formula = price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq)
Residuals:
Min 1Q Median 3Q Max
-1634591 -95962 -10949 71991 4450946
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.958e+07 3.732e+06 5.247 1.56e-07 ***
cat_grade4 -1.767e+05 1.537e+05 -1.150 0.25025
cat_grade5 -2.845e+04 1.468e+05 -0.194 0.84637
cat_grade6 3.839e+04 1.462e+05 0.262 0.79297
cat_grade7 1.070e+05 1.464e+05 0.731 0.46483
cat_grade8 1.714e+05 1.465e+05 1.170 0.24209
cat_grade9 2.831e+05 1.466e+05 1.930 0.05358 .
cat_grade10 4.366e+05 1.467e+05 2.975 0.00293 **
cat_grade11 7.141e+05 1.471e+05 4.854 1.22e-06 ***
cat_grade12 1.194e+06 1.488e+05 8.023 1.11e-15 ***
cat_grade13 2.379e+06 1.645e+05 14.462 < 2e-16 ***
cat_cond2 1.641e+03 4.791e+04 0.034 0.97268
cat_cond3 2.525e+04 4.450e+04 0.567 0.57048
cat_cond4 5.819e+04 4.452e+04 1.307 0.19115
cat_cond5 1.010e+05 4.481e+04 2.253 0.02428 *
cat_floors1.5 7.897e+03 6.796e+03 1.162 0.24530
cat_floors2 4.111e+00 5.728e+03 0.001 0.99943
cat_floors2.5 8.389e+04 1.988e+04 4.220 2.45e-05 ***
cat_floors3 5.869e+04 1.229e+04 4.776 1.81e-06 ***
cat_floors3.5 1.225e+05 9.255e+04 1.323 0.18576
bathrooms 5.278e+04 4.038e+03 13.071 < 2e-16 ***
ln_sqft_living -3.850e+06 1.402e+05 -27.464 < 2e-16 ***
ln_sqft15_lot -1.897e+04 2.677e+03 -7.086 1.45e-12 ***
ln_sqft_above 3.807e+05 2.931e+04 12.990 < 2e-16 ***
yr_built -2.431e+03 9.339e+01 -26.027 < 2e-16 ***
zipcode -5.468e+02 4.049e+01 -13.505 < 2e-16 ***
lat 5.510e+05 1.324e+04 41.629 < 2e-16 ***
long -2.000e+05 1.640e+04 -12.189 < 2e-16 ***
ln_sqft15_living 1.102e+05 8.635e+03 12.765 < 2e-16 ***
ln_bedroom -8.674e+04 1.084e+04 -8.004 1.29e-15 ***
ln_sqft_basement -3.395e+04 6.907e+03 -4.915 8.96e-07 ***
basementSq 9.227e+03 1.252e+03 7.368 1.82e-13 ***
sqft_livingSq 2.459e+05 9.957e+03 24.694 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 205900 on 15096 degrees of freedom
Multiple R-squared: 0.6943, Adjusted R-squared: 0.6937
F-statistic: 1072 on 32 and 15096 DF, p-value: < 2.2e-16
Although the residual standard error for model 2 (s=187400) decreased a lot compared to model 1 (s=205900), the R2 and R2a decreased to 0.6688 and 0.6683 respectively. Model 2 is also more parsimonious compared to model 1 as it has only 24 explanatory variables whereas model 1 has 32 variables.
Let’s see if we can remove any values using cook’s distance. The formula for Cook’s distance df, df2
df1: k+1 df2: n - (k+1)
cooks = cooks.distance(m1)
#cooks[order(cooks)]
dim(train_house_data)
[1] 15129 21
df2 = 15129 - 33
cooks_cutoff = qf(.95, 33, df2)
temp_data2 = which(cooks > cooks_cutoff)
nrow(temp_data2)
NULL
From this output, we can see that there are no observations that exceed the cutoff from cooks distance.
Thus, we can establish m2 as model 2.
Let’s perform backwards regression on model 2 to see if we can generate a more parsimonious model
step(m2,data=train_house_data,direction="backward")
Start: AIC=354198
price ~ cat_grade + cat_cond + cat_floors + bathrooms + ln_sqft_living +
ln_sqft15_lot + ln_sqft_above + yr_built + zipcode + lat +
long + ln_sqft15_living + ln_sqft15_lot + ln_bedroom + ln_sqft_basement +
basementSq + sqft_livingSq
Df Sum of Sq RSS AIC
<none> 5.1122e+14 354198
- ln_sqft_basement 1 4.8589e+11 5.1170e+14 354210
- cat_floors 3 1.0097e+12 5.1223e+14 354221
- basementSq 1 1.2786e+12 5.1249e+14 354232
- ln_sqft15_lot 1 1.7078e+12 5.1292e+14 354245
- ln_bedroom 1 2.4266e+12 5.1364e+14 354265
- ln_sqft_above 1 4.8670e+12 5.1608e+14 354334
- long 1 4.9599e+12 5.1618e+14 354337
- cat_cond 2 5.9274e+12 5.1714e+14 354362
- zipcode 1 5.8970e+12 5.1711e+14 354363
- bathrooms 1 6.7329e+12 5.1795e+14 354387
- ln_sqft15_living 1 7.2325e+12 5.1845e+14 354401
- sqft_livingSq 1 1.4763e+13 5.2598e+14 354611
- ln_sqft_living 1 1.8037e+13 5.2925e+14 354702
- yr_built 1 2.7444e+13 5.3866e+14 354959
- lat 1 7.0666e+13 5.8188e+14 356085
- cat_grade 6 8.0458e+13 5.9167e+14 356318
Call:
lm(formula = price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq,
subset = -lev_points)
Coefficients:
(Intercept) cat_grade6 cat_grade7 cat_grade8 cat_grade9 cat_grade10 cat_grade11 cat_cond4 cat_cond5
13601366.5 40609.2 106051.4 175162.0 296503.3 460287.4 748874.5 31697.9 76039.7
cat_floors1.5 cat_floors2 cat_floors3 bathrooms ln_sqft_living ln_sqft15_lot ln_sqft_above yr_built zipcode
6621.6 1689.8 56038.6 52846.8 -3213857.0 -17547.5 327033.7 -2456.5 -486.2
lat long ln_sqft15_living ln_bedroom ln_sqft_basement basementSq sqft_livingSq
547544.6 -184321.3 116438.0 -86052.8 -24429.0 7216.4 206045.8
m3 = lm(price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq,
subset = -lev_points)
summary(m3)
Call:
lm(formula = price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq,
subset = -lev_points)
Residuals:
Min 1Q Median 3Q Max
-1013645 -93434 -11609 68571 2219533
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.360e+07 3.487e+06 3.900 9.64e-05 ***
cat_grade6 4.061e+04 6.655e+04 0.610 0.54172
cat_grade7 1.061e+05 6.662e+04 1.592 0.11143
cat_grade8 1.752e+05 6.681e+04 2.622 0.00875 **
cat_grade9 2.965e+05 6.702e+04 4.424 9.76e-06 ***
cat_grade10 4.603e+05 6.733e+04 6.837 8.43e-12 ***
cat_grade11 7.489e+05 6.828e+04 10.967 < 2e-16 ***
cat_cond4 3.170e+04 3.910e+03 8.107 5.59e-16 ***
cat_cond5 7.604e+04 6.343e+03 11.988 < 2e-16 ***
cat_floors1.5 6.622e+03 6.318e+03 1.048 0.29462
cat_floors2 1.690e+03 5.300e+03 0.319 0.74986
cat_floors3 5.604e+04 1.142e+04 4.908 9.32e-07 ***
bathrooms 5.285e+04 3.816e+03 13.848 < 2e-16 ***
ln_sqft_living -3.214e+06 1.418e+05 -22.666 < 2e-16 ***
ln_sqft15_lot -1.755e+04 2.516e+03 -6.974 3.20e-12 ***
ln_sqft_above 3.270e+05 2.778e+04 11.774 < 2e-16 ***
yr_built -2.457e+03 8.786e+01 -27.959 < 2e-16 ***
zipcode -4.862e+02 3.751e+01 -12.960 < 2e-16 ***
lat 5.475e+05 1.220e+04 44.864 < 2e-16 ***
long -1.843e+05 1.551e+04 -11.886 < 2e-16 ***
ln_sqft15_living 1.164e+05 8.113e+03 14.353 < 2e-16 ***
ln_bedroom -8.605e+04 1.035e+04 -8.314 < 2e-16 ***
ln_sqft_basement -2.443e+04 6.567e+03 -3.720 0.00020 ***
basementSq 7.216e+03 1.196e+03 6.035 1.63e-09 ***
sqft_livingSq 2.060e+05 1.005e+04 20.506 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 187400 on 14561 degrees of freedom
Multiple R-squared: 0.6688, Adjusted R-squared: 0.6683
F-statistic: 1225 on 24 and 14561 DF, p-value: < 2.2e-16
Backwards regression has chosen the same model as model 2.
Let’s see if stepwise also chooses the same
m2_0 = lm(price~1, subset=-lev_points)
step(m2_0, scope=list(lower=m2_0,upper=m2, direction="both"))
Start: AIC=370268.4
price ~ 1
Df Sum of Sq RSS AIC
+ cat_grade 6 7.1554e+14 8.2803e+14 361196
+ sqft_livingSq 1 5.8603e+14 9.5754e+14 363306
+ ln_sqft_living 1 5.6649e+14 9.7708e+14 363600
+ ln_sqft15_living 1 4.6334e+14 1.0802e+15 365064
+ ln_sqft_above 1 4.2976e+14 1.1138e+15 365511
+ bathrooms 1 3.7915e+14 1.1644e+15 366159
+ lat 1 1.6584e+14 1.3777e+15 368613
+ ln_bedroom 1 1.3788e+14 1.4057e+15 368906
+ cat_floors 3 1.1186e+14 1.4317e+15 369177
+ basementSq 1 7.9168e+13 1.4644e+15 369502
+ ln_sqft_basement 1 6.4390e+13 1.4792e+15 369649
+ ln_sqft15_lot 1 3.0459e+13 1.5131e+15 369980
+ cat_cond 2 7.1843e+12 1.5364e+15 370204
+ yr_built 1 4.2198e+12 1.5394e+15 370231
+ zipcode 1 4.0245e+12 1.5395e+15 370232
+ long 1 1.1981e+12 1.5424e+15 370259
<none> 1.5436e+15 370268
Step: AIC=361196.2
price ~ cat_grade
Df Sum of Sq RSS AIC
+ lat 1 1.1001e+14 7.1802e+14 359119
+ yr_built 1 9.2021e+13 7.3601e+14 359480
+ sqft_livingSq 1 7.2278e+13 7.5575e+14 359866
+ ln_sqft_living 1 6.8160e+13 7.5987e+14 359945
+ basementSq 1 5.8431e+13 7.6960e+14 360131
+ ln_sqft_basement 1 5.2960e+13 7.7507e+14 360234
+ cat_cond 2 3.3457e+13 7.9457e+14 360599
+ ln_sqft15_living 1 3.0962e+13 7.9707e+14 360642
+ long 1 2.1969e+13 8.0606e+14 360806
+ bathrooms 1 2.1533e+13 8.0650e+14 360814
+ cat_floors 3 2.0278e+13 8.0775e+14 360841
+ ln_bedroom 1 1.2615e+13 8.1541e+14 360974
+ ln_sqft_above 1 1.0756e+13 8.1727e+14 361007
+ zipcode 1 7.5943e+12 8.2043e+14 361064
<none> 8.2803e+14 361196
+ ln_sqft15_lot 1 9.8691e+10 8.2793e+14 361196
- cat_grade 6 7.1554e+14 1.5436e+15 370268
Step: AIC=359118.9
price ~ cat_grade + lat
Df Sum of Sq RSS AIC
+ sqft_livingSq 1 8.5623e+13 6.3239e+14 357269
+ ln_sqft_living 1 8.1591e+13 6.3643e+14 357361
+ yr_built 1 5.3630e+13 6.6439e+14 357989
+ basementSq 1 4.1578e+13 6.7644e+14 358251
+ ln_sqft15_living 1 3.6624e+13 6.8139e+14 358357
+ ln_sqft_basement 1 3.6285e+13 6.8173e+14 358364
+ cat_cond 2 3.2304e+13 6.8571e+14 358451
+ bathrooms 1 2.9981e+13 6.8804e+14 358499
+ ln_sqft_above 1 2.4875e+13 6.9314e+14 358607
+ ln_bedroom 1 1.9657e+13 6.9836e+14 358716
+ cat_floors 3 1.3924e+13 7.0409e+14 358839
+ long 1 8.9607e+12 7.0906e+14 358938
+ ln_sqft15_lot 1 2.0574e+12 7.1596e+14 359079
+ zipcode 1 1.6281e+11 7.1785e+14 359118
<none> 7.1802e+14 359119
- lat 1 1.1001e+14 8.2803e+14 361196
- cat_grade 6 6.5972e+14 1.3777e+15 368613
Step: AIC=357268.8
price ~ cat_grade + lat + sqft_livingSq
Df Sum of Sq RSS AIC
+ yr_built 1 5.3283e+13 5.7911e+14 355987
+ ln_sqft_living 1 2.6063e+13 6.0633e+14 356657
+ cat_cond 2 2.3934e+13 6.0846e+14 356710
+ long 1 2.0195e+13 6.1220e+14 356797
+ cat_floors 3 1.0885e+13 6.2151e+14 357021
+ basementSq 1 9.2139e+12 6.2318e+14 357057
+ ln_sqft_basement 1 8.4652e+12 6.2393e+14 357074
+ ln_sqft_above 1 3.9959e+12 6.2840e+14 357178
+ ln_sqft15_living 1 2.8759e+12 6.2952e+14 357204
+ ln_bedroom 1 1.4688e+12 6.3092e+14 357237
+ ln_sqft15_lot 1 1.2255e+12 6.3117e+14 357242
+ bathrooms 1 3.4045e+11 6.3205e+14 357263
+ zipcode 1 2.4443e+11 6.3215e+14 357265
<none> 6.3239e+14 357269
- sqft_livingSq 1 8.5623e+13 7.1802e+14 359119
- lat 1 1.2336e+14 7.5575e+14 359866
- cat_grade 6 1.7381e+14 8.0620e+14 360799
Step: AIC=355986.9
price ~ cat_grade + lat + sqft_livingSq + yr_built
Df Sum of Sq RSS AIC
+ ln_sqft_living 1 2.3848e+13 5.5526e+14 355376
+ bathrooms 1 1.2690e+13 5.6642e+14 355666
+ cat_cond 2 5.8116e+12 5.7330e+14 355844
+ long 1 4.1525e+12 5.7496e+14 355884
+ ln_sqft15_lot 1 3.5689e+12 5.7554e+14 355899
+ ln_sqft15_living 1 3.5552e+12 5.7556e+14 355899
+ cat_floors 3 2.5748e+12 5.7654e+14 355928
+ ln_bedroom 1 1.8293e+12 5.7728e+14 355943
+ zipcode 1 1.8067e+12 5.7730e+14 355943
+ basementSq 1 1.5022e+12 5.7761e+14 355951
+ ln_sqft_basement 1 1.4118e+12 5.7770e+14 355953
<none> 5.7911e+14 355987
+ ln_sqft_above 1 2.9231e+10 5.7908e+14 355988
- yr_built 1 5.3283e+13 6.3239e+14 357269
- lat 1 8.2315e+13 6.6143e+14 357923
- sqft_livingSq 1 8.5276e+13 6.6439e+14 357989
- cat_grade 6 2.0645e+14 7.8556e+14 360422
Step: AIC=355375.5
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living
Df Sum of Sq RSS AIC
+ bathrooms 1 1.0952e+13 5.4431e+14 355087
+ cat_cond 2 6.3924e+12 5.4887e+14 355211
+ ln_sqft15_lot 1 4.0632e+12 5.5120e+14 355270
+ long 1 3.5741e+12 5.5169e+14 355283
+ ln_sqft15_living 1 3.4575e+12 5.5181e+14 355286
+ cat_floors 3 2.7155e+12 5.5255e+14 355310
+ zipcode 1 2.2325e+12 5.5303e+14 355319
+ ln_sqft_basement 1 1.7437e+12 5.5352e+14 355332
+ basementSq 1 1.5054e+12 5.5376e+14 355338
+ ln_bedroom 1 9.8875e+11 5.5427e+14 355352
<none> 5.5526e+14 355376
+ ln_sqft_above 1 3.0254e+10 5.5523e+14 355377
- ln_sqft_living 1 2.3848e+13 5.7911e+14 355987
- sqft_livingSq 1 2.7703e+13 5.8297e+14 356084
- yr_built 1 5.1067e+13 6.0633e+14 356657
- lat 1 7.7882e+13 6.3315e+14 357288
- cat_grade 6 1.2305e+14 6.7831e+14 358283
Step: AIC=355087
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms
Df Sum of Sq RSS AIC
+ cat_cond 2 5.8090e+12 5.3850e+14 354934
+ ln_sqft15_living 1 4.6687e+12 5.3964e+14 354963
+ zipcode 1 2.8827e+12 5.4143e+14 355012
+ long 1 2.4101e+12 5.4190e+14 355024
+ ln_bedroom 1 2.2770e+12 5.4203e+14 355028
+ ln_sqft15_lot 1 1.7631e+12 5.4255e+14 355042
+ cat_floors 3 1.4430e+12 5.4287e+14 355054
+ ln_sqft_basement 1 9.4553e+11 5.4337e+14 355064
+ basementSq 1 8.0553e+11 5.4351e+14 355067
<none> 5.4431e+14 355087
+ ln_sqft_above 1 3.1916e+05 5.4431e+14 355089
- bathrooms 1 1.0952e+13 5.5526e+14 355376
- ln_sqft_living 1 2.2111e+13 5.6642e+14 355666
- sqft_livingSq 1 2.4773e+13 5.6908e+14 355734
- yr_built 1 6.1866e+13 6.0618e+14 356655
- lat 1 7.5336e+13 6.1965e+14 356976
- cat_grade 6 1.1796e+14 6.6227e+14 357936
Step: AIC=354934.5
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond
Df Sum of Sq RSS AIC
+ ln_sqft15_living 1 4.7234e+12 5.3378e+14 354808
+ long 1 2.7468e+12 5.3575e+14 354862
+ ln_bedroom 1 2.5218e+12 5.3598e+14 354868
+ ln_sqft15_lot 1 2.3484e+12 5.3615e+14 354873
+ zipcode 1 2.0129e+12 5.3649e+14 354882
+ cat_floors 3 1.5424e+12 5.3696e+14 354899
+ ln_sqft_basement 1 7.0215e+11 5.3780e+14 354917
+ basementSq 1 5.3778e+11 5.3796e+14 354922
<none> 5.3850e+14 354934
+ ln_sqft_above 1 5.2266e+10 5.3845e+14 354935
- cat_cond 2 5.8090e+12 5.4431e+14 355087
- bathrooms 1 1.0369e+13 5.4887e+14 355211
- ln_sqft_living 1 2.2715e+13 5.6122e+14 355535
- sqft_livingSq 1 2.5341e+13 5.6384e+14 355603
- yr_built 1 4.2306e+13 5.8081e+14 356036
- lat 1 7.7859e+13 6.1636e+14 356902
- cat_grade 6 1.1980e+14 6.5831e+14 357852
Step: AIC=354808
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living
Df Sum of Sq RSS AIC
+ long 1 4.9798e+12 5.2880e+14 354673
+ ln_sqft15_lot 1 4.6309e+12 5.2915e+14 354683
+ cat_floors 3 2.6636e+12 5.3111e+14 354741
+ ln_bedroom 1 2.3197e+12 5.3146e+14 354746
+ ln_sqft_basement 1 1.1781e+12 5.3260e+14 354778
+ zipcode 1 1.0483e+12 5.3273e+14 354781
+ basementSq 1 9.7327e+11 5.3281e+14 354783
<none> 5.3378e+14 354808
+ ln_sqft_above 1 1.1736e+08 5.3378e+14 354810
- ln_sqft15_living 1 4.7234e+12 5.3850e+14 354934
- cat_cond 2 5.8636e+12 5.3964e+14 354963
- bathrooms 1 1.1498e+13 5.4528e+14 355117
- ln_sqft_living 1 2.2502e+13 5.5628e+14 355408
- sqft_livingSq 1 2.4519e+13 5.5830e+14 355461
- yr_built 1 4.3714e+13 5.7749e+14 355954
- lat 1 7.7879e+13 6.1166e+14 356792
- cat_grade 6 1.0009e+14 6.3387e+14 357303
Step: AIC=354673.2
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long
Df Sum of Sq RSS AIC
+ zipcode 1 5.1631e+12 5.2364e+14 354532
+ ln_sqft15_lot 1 2.3178e+12 5.2648e+14 354611
+ ln_bedroom 1 2.2417e+12 5.2656e+14 354613
+ cat_floors 3 1.6559e+12 5.2714e+14 354634
+ ln_sqft_basement 1 3.5716e+11 5.2844e+14 354665
+ basementSq 1 2.5273e+11 5.2855e+14 354668
+ ln_sqft_above 1 1.9632e+11 5.2860e+14 354670
<none> 5.2880e+14 354673
- long 1 4.9798e+12 5.3378e+14 354808
- cat_cond 2 6.3424e+12 5.3514e+14 354843
- ln_sqft15_living 1 6.9564e+12 5.3575e+14 354862
- bathrooms 1 1.0155e+13 5.3895e+14 354949
- ln_sqft_living 1 2.1923e+13 5.5072e+14 355264
- sqft_livingSq 1 2.4005e+13 5.5280e+14 355319
- yr_built 1 2.9005e+13 5.5780e+14 355450
- lat 1 7.4976e+13 6.0377e+14 356605
- cat_grade 6 9.4704e+13 6.2350e+14 357064
Step: AIC=354532.1
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode
Df Sum of Sq RSS AIC
+ ln_bedroom 1 2.6889e+12 5.2095e+14 354459
+ ln_sqft15_lot 1 2.6148e+12 5.2102e+14 354461
+ cat_floors 3 1.9934e+12 5.2164e+14 354483
+ ln_sqft_basement 1 4.4669e+11 5.2319e+14 354522
+ basementSq 1 3.1518e+11 5.2332e+14 354525
+ ln_sqft_above 1 1.5369e+11 5.2348e+14 354530
<none> 5.2364e+14 354532
- cat_cond 2 5.0582e+12 5.2869e+14 354668
- zipcode 1 5.1631e+12 5.2880e+14 354673
- ln_sqft15_living 1 5.8442e+12 5.2948e+14 354692
- long 1 9.0945e+12 5.3273e+14 354781
- bathrooms 1 1.0174e+13 5.3381e+14 354811
- ln_sqft_living 1 2.2256e+13 5.4589e+14 355137
- sqft_livingSq 1 2.4383e+13 5.4802e+14 355194
- yr_built 1 3.1562e+13 5.5520e+14 355384
- lat 1 8.0055e+13 6.0369e+14 356605
- cat_grade 6 9.3797e+13 6.1743e+14 356924
Step: AIC=354459
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_bedroom
Df Sum of Sq RSS AIC
+ ln_sqft15_lot 1 2.5873e+12 5.1836e+14 354388
+ cat_floors 3 1.9371e+12 5.1901e+14 354411
+ ln_sqft_basement 1 3.1835e+11 5.2063e+14 354452
+ ln_sqft_above 1 2.2157e+11 5.2072e+14 354455
+ basementSq 1 2.1630e+11 5.2073e+14 354455
<none> 5.2095e+14 354459
- ln_bedroom 1 2.6889e+12 5.2364e+14 354532
- cat_cond 2 5.2211e+12 5.2617e+14 354600
- ln_sqft15_living 1 5.5414e+12 5.2649e+14 354611
- zipcode 1 5.6103e+12 5.2656e+14 354613
- long 1 9.2636e+12 5.3021e+14 354714
- bathrooms 1 1.1543e+13 5.3249e+14 354777
- ln_sqft_living 1 2.0893e+13 5.4184e+14 355031
- sqft_livingSq 1 2.3458e+13 5.4440e+14 355099
- yr_built 1 3.2645e+13 5.5359e+14 355344
- lat 1 7.8817e+13 5.9976e+14 356512
- cat_grade 6 8.7651e+13 6.0860e+14 356715
Step: AIC=354388.4
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_bedroom + ln_sqft15_lot
Df Sum of Sq RSS AIC
+ cat_floors 3 1.1786e+12 5.1718e+14 354361
+ ln_sqft_basement 1 2.5896e+11 5.1810e+14 354383
+ ln_sqft_above 1 2.4374e+11 5.1812e+14 354384
+ basementSq 1 1.7800e+11 5.1818e+14 354385
<none> 5.1836e+14 354388
- ln_sqft15_lot 1 2.5873e+12 5.2095e+14 354459
- ln_bedroom 1 2.6613e+12 5.2102e+14 354461
- cat_cond 2 5.6486e+12 5.2401e+14 354543
- zipcode 1 5.9148e+12 5.2427e+14 354552
- long 1 6.2290e+12 5.2459e+14 354561
- ln_sqft15_living 1 6.7968e+12 5.2516e+14 354576
- bathrooms 1 9.3428e+12 5.2770e+14 354647
- ln_sqft_living 1 2.1606e+13 5.3996e+14 354982
- sqft_livingSq 1 2.4353e+13 5.4271e+14 355056
- yr_built 1 3.4394e+13 5.5275e+14 355323
- lat 1 7.3576e+13 5.9194e+14 356322
- cat_grade 6 8.7664e+13 6.0602e+14 356655
Step: AIC=354361.2
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_bedroom + ln_sqft15_lot + cat_floors
Df Sum of Sq RSS AIC
+ ln_sqft_basement 1 6.1393e+11 5.1657e+14 354346
+ basementSq 1 4.7841e+11 5.1670e+14 354350
+ ln_sqft_above 1 1.2760e+11 5.1705e+14 354360
<none> 5.1718e+14 354361
- cat_floors 3 1.1786e+12 5.1836e+14 354388
- ln_sqft15_lot 1 1.8287e+12 5.1901e+14 354411
- ln_bedroom 1 2.6563e+12 5.1984e+14 354434
- cat_cond 2 5.5588e+12 5.2274e+14 354513
- long 1 5.7172e+12 5.2290e+14 354520
- zipcode 1 6.0756e+12 5.2326e+14 354530
- ln_sqft15_living 1 7.2688e+12 5.2445e+14 354563
- bathrooms 1 8.7111e+12 5.2589e+14 354603
- ln_sqft_living 1 2.2021e+13 5.3920e+14 354967
- sqft_livingSq 1 2.4805e+13 5.4199e+14 355043
- yr_built 1 2.8454e+13 5.4563e+14 355140
- lat 1 7.0831e+13 5.8801e+14 356231
- cat_grade 6 8.4355e+13 6.0154e+14 356553
Step: AIC=354345.9
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_bedroom + ln_sqft15_lot + cat_floors + ln_sqft_basement
Df Sum of Sq RSS AIC
+ ln_sqft_above 1 4.0721e+12 5.1249e+14 354232
+ basementSq 1 4.8375e+11 5.1608e+14 354334
<none> 5.1657e+14 354346
- ln_sqft_basement 1 6.1393e+11 5.1718e+14 354361
- cat_floors 3 1.5336e+12 5.1810e+14 354383
- ln_sqft15_lot 1 1.4004e+12 5.1797e+14 354383
- ln_bedroom 1 2.4332e+12 5.1900e+14 354412
- long 1 5.0582e+12 5.2162e+14 354486
- cat_cond 2 5.4621e+12 5.2203e+14 354495
- zipcode 1 6.2943e+12 5.2286e+14 354521
- ln_sqft15_living 1 7.4978e+12 5.2406e+14 354554
- bathrooms 1 7.6238e+12 5.2419e+14 354558
- ln_sqft_living 1 2.2393e+13 5.3896e+14 354963
- sqft_livingSq 1 2.5066e+13 5.4163e+14 355035
- yr_built 1 2.7751e+13 5.4432e+14 355107
- lat 1 6.9569e+13 5.8614e+14 356187
- cat_grade 6 8.4968e+13 6.0153e+14 356555
Step: AIC=354232.5
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_bedroom + ln_sqft15_lot + cat_floors + ln_sqft_basement +
ln_sqft_above
Df Sum of Sq RSS AIC
+ basementSq 1 1.2786e+12 5.1122e+14 354198
<none> 5.1249e+14 354232
- cat_floors 3 1.2253e+12 5.1372e+14 354261
- ln_sqft15_lot 1 1.6352e+12 5.1413e+14 354277
- ln_bedroom 1 2.3387e+12 5.1483e+14 354297
- ln_sqft_above 1 4.0721e+12 5.1657e+14 354346
- ln_sqft_basement 1 4.5585e+12 5.1705e+14 354360
- long 1 5.0191e+12 5.1751e+14 354373
- cat_cond 2 6.1584e+12 5.1865e+14 354403
- zipcode 1 6.1501e+12 5.1864e+14 354404
- ln_sqft15_living 1 7.0572e+12 5.1955e+14 354430
- bathrooms 1 7.1803e+12 5.1967e+14 354433
- ln_sqft_living 1 2.3203e+13 5.3570e+14 354876
- sqft_livingSq 1 2.3296e+13 5.3579e+14 354879
- yr_built 1 2.7978e+13 5.4047e+14 355006
- lat 1 7.0050e+13 5.8255e+14 356099
- cat_grade 6 7.9187e+13 5.9168e+14 356316
Step: AIC=354198
price ~ cat_grade + lat + sqft_livingSq + yr_built + ln_sqft_living +
bathrooms + cat_cond + ln_sqft15_living + long + zipcode +
ln_bedroom + ln_sqft15_lot + cat_floors + ln_sqft_basement +
ln_sqft_above + basementSq
Df Sum of Sq RSS AIC
<none> 5.1122e+14 354198
- ln_sqft_basement 1 4.8589e+11 5.1170e+14 354210
- cat_floors 3 1.0097e+12 5.1223e+14 354221
- basementSq 1 1.2786e+12 5.1249e+14 354232
- ln_sqft15_lot 1 1.7078e+12 5.1292e+14 354245
- ln_bedroom 1 2.4266e+12 5.1364e+14 354265
- ln_sqft_above 1 4.8670e+12 5.1608e+14 354334
- long 1 4.9599e+12 5.1618e+14 354337
- cat_cond 2 5.9274e+12 5.1714e+14 354362
- zipcode 1 5.8970e+12 5.1711e+14 354363
- bathrooms 1 6.7329e+12 5.1795e+14 354387
- ln_sqft15_living 1 7.2325e+12 5.1845e+14 354401
- sqft_livingSq 1 1.4763e+13 5.2598e+14 354611
- ln_sqft_living 1 1.8037e+13 5.2925e+14 354702
- yr_built 1 2.7444e+13 5.3866e+14 354959
- lat 1 7.0666e+13 5.8188e+14 356085
- cat_grade 6 8.0458e+13 5.9167e+14 356318
Call:
lm(formula = price ~ cat_grade + lat + sqft_livingSq + yr_built +
ln_sqft_living + bathrooms + cat_cond + ln_sqft15_living +
long + zipcode + ln_bedroom + ln_sqft15_lot + cat_floors +
ln_sqft_basement + ln_sqft_above + basementSq, subset = -lev_points)
Coefficients:
(Intercept) cat_grade6 cat_grade7 cat_grade8 cat_grade9 cat_grade10 cat_grade11 lat sqft_livingSq
13601366.5 40609.2 106051.4 175162.0 296503.3 460287.4 748874.5 547544.6 206045.8
yr_built ln_sqft_living bathrooms cat_cond4 cat_cond5 ln_sqft15_living long zipcode ln_bedroom
-2456.5 -3213857.0 52846.8 31697.9 76039.7 116438.0 -184321.3 -486.2 -86052.8
ln_sqft15_lot cat_floors1.5 cat_floors2 cat_floors3 ln_sqft_basement ln_sqft_above basementSq
-17547.5 6621.6 1689.8 56038.6 -24429.0 327033.7 7216.4
Stepwise picked the same.
We now have two models: one without the leverage points and with. Let’s compute AIC to see which model it prefers. My hunch is that the model 2 will be chosen because it is simpler compared to model 1
extractAIC(m1)
[1] 33.0 370245.3
extractAIC(m2)
[1] 25 354198
As we can see model 2 was chosen. Let’s investigate m2 further and we will provide hypothesis testing.
Ho: B1=B2=B3…B23=B24
Ha: At least 1 B != 0
summary(m2)
Call:
lm(formula = price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq,
subset = -lev_points)
Residuals:
Min 1Q Median 3Q Max
-1013645 -93434 -11609 68571 2219533
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.360e+07 3.487e+06 3.900 9.64e-05 ***
cat_grade6 4.061e+04 6.655e+04 0.610 0.54172
cat_grade7 1.061e+05 6.662e+04 1.592 0.11143
cat_grade8 1.752e+05 6.681e+04 2.622 0.00875 **
cat_grade9 2.965e+05 6.702e+04 4.424 9.76e-06 ***
cat_grade10 4.603e+05 6.733e+04 6.837 8.43e-12 ***
cat_grade11 7.489e+05 6.828e+04 10.967 < 2e-16 ***
cat_cond4 3.170e+04 3.910e+03 8.107 5.59e-16 ***
cat_cond5 7.604e+04 6.343e+03 11.988 < 2e-16 ***
cat_floors1.5 6.622e+03 6.318e+03 1.048 0.29462
cat_floors2 1.690e+03 5.300e+03 0.319 0.74986
cat_floors3 5.604e+04 1.142e+04 4.908 9.32e-07 ***
bathrooms 5.285e+04 3.816e+03 13.848 < 2e-16 ***
ln_sqft_living -3.214e+06 1.418e+05 -22.666 < 2e-16 ***
ln_sqft15_lot -1.755e+04 2.516e+03 -6.974 3.20e-12 ***
ln_sqft_above 3.270e+05 2.778e+04 11.774 < 2e-16 ***
yr_built -2.457e+03 8.786e+01 -27.959 < 2e-16 ***
zipcode -4.862e+02 3.751e+01 -12.960 < 2e-16 ***
lat 5.475e+05 1.220e+04 44.864 < 2e-16 ***
long -1.843e+05 1.551e+04 -11.886 < 2e-16 ***
ln_sqft15_living 1.164e+05 8.113e+03 14.353 < 2e-16 ***
ln_bedroom -8.605e+04 1.035e+04 -8.314 < 2e-16 ***
ln_sqft_basement -2.443e+04 6.567e+03 -3.720 0.00020 ***
basementSq 7.216e+03 1.196e+03 6.035 1.63e-09 ***
sqft_livingSq 2.060e+05 1.005e+04 20.506 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 187400 on 14561 degrees of freedom
Multiple R-squared: 0.6688, Adjusted R-squared: 0.6683
F-statistic: 1225 on 24 and 14561 DF, p-value: < 2.2e-16
We have a F p-value=2.2e-16 < alpha=.05. Because of this we reject the null hypothesis and accept the alternative hypothesis. This is a significant model that fits well. The standard residual error is lower than model 1 at 187400, the R2=.6688 and R2a=.6683 are virtually the same (meaning we are not taking a penalty hit for the extra variables).
This model explains 66.88% of the variation within price.
We also do not suspect multicollinearity, as all of the t p-values are very significant (less than alpha=.01), and the F p-value is very significant. We also do not suspect bias from omitted variables, as our R2 value is very similar compared to the original model.
Let us see if this model is homoscedastic ## Homoscedasticity
plot(residuals(m2) ~ fitted.values(m2), main="Residuals vs. Fitted Values")
This plot displays that our model has incredibly varied variance. This is bad. Let’s build another model, model 3, in which we transform our price to diminish the residual standard errors.
ln_price = log(price)
m3 = lm(ln_price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq, subset = -lev_points)
summary(m3)
Call:
lm(formula = ln_price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq,
subset = -lev_points)
Residuals:
Min 1Q Median 3Q Max
-1.38524 -0.16167 -0.00347 0.15402 1.12771
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.246e+00 4.717e+00 0.900 0.368043
cat_grade6 1.673e-01 9.002e-02 1.858 0.063171 .
cat_grade7 3.399e-01 9.012e-02 3.772 0.000163 ***
cat_grade8 5.033e-01 9.037e-02 5.569 2.60e-08 ***
cat_grade9 6.834e-01 9.067e-02 7.537 5.09e-14 ***
cat_grade10 8.192e-01 9.108e-02 8.995 < 2e-16 ***
cat_grade11 9.606e-01 9.237e-02 10.400 < 2e-16 ***
cat_cond4 5.200e-02 5.289e-03 9.832 < 2e-16 ***
cat_cond5 1.175e-01 8.581e-03 13.689 < 2e-16 ***
cat_floors1.5 4.156e-02 8.547e-03 4.863 1.17e-06 ***
cat_floors2 3.624e-02 7.170e-03 5.054 4.37e-07 ***
cat_floors3 1.261e-01 1.545e-02 8.163 3.55e-16 ***
bathrooms 7.462e-02 5.162e-03 14.455 < 2e-16 ***
ln_sqft_living -2.088e+00 1.918e-01 -10.888 < 2e-16 ***
ln_sqft15_lot -2.335e-02 3.403e-03 -6.860 7.15e-12 ***
ln_sqft_above 1.102e-01 3.757e-02 2.932 0.003369 **
yr_built -4.105e-03 1.189e-04 -34.535 < 2e-16 ***
zipcode -5.725e-04 5.075e-05 -11.282 < 2e-16 ***
lat 1.308e+00 1.651e-02 79.213 < 2e-16 ***
long -1.213e-01 2.098e-02 -5.781 7.58e-09 ***
ln_sqft15_living 2.310e-01 1.097e-02 21.050 < 2e-16 ***
ln_bedroom -1.614e-01 1.400e-02 -11.524 < 2e-16 ***
ln_sqft_basement 3.865e-02 8.883e-03 4.351 1.37e-05 ***
basementSq -4.282e-03 1.618e-03 -2.647 0.008133 **
sqft_livingSq 1.575e-01 1.359e-02 11.585 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2535 on 14561 degrees of freedom
Multiple R-squared: 0.7471, Adjusted R-squared: 0.7467
F-statistic: 1792 on 24 and 14561 DF, p-value: < 2.2e-16
Booyah! Model 3 by far has the lowest residual standard error of any of the other models at 0.2535. Also, the R2 value=0.7471 and the R2a=0.7467 has greatly improved compared to the other models. We will now deem this as our current “final” model.
Let’s re-plot to see if we have homoscedasticity
plot(residuals(m3) ~ fitted.values(m3), main="Residuals vs. Fitted Values")
That is much better compared to model 2! Let’s see if extracting AIC values will agree with our model selection
extractAIC(m1)
[1] 33.0 370245.3
extractAIC(m2)
[1] 25 354198
extractAIC(m3)
[1] 25.00 -40013.93
Model 3 is significantly better than m3. Let’s continue rolling with this!
Let’s do backwards regression to see if we can find a better model from m3
step(m3, data=train_house_data, subset=-lev_points, direction = "backward")
Start: AIC=-40013.93
ln_price ~ cat_grade + cat_cond + cat_floors + bathrooms + ln_sqft_living +
ln_sqft15_lot + ln_sqft_above + yr_built + zipcode + lat +
long + ln_sqft15_living + ln_sqft15_lot + ln_bedroom + ln_sqft_basement +
basementSq + sqft_livingSq
Df Sum of Sq RSS AIC
<none> 935.50 -40014
- basementSq 1 0.45 935.95 -40009
- ln_sqft_above 1 0.55 936.05 -40007
- ln_sqft_basement 1 1.22 936.71 -39997
- long 1 2.15 937.64 -39982
- ln_sqft15_lot 1 3.02 938.52 -39969
- cat_floors 3 5.49 940.98 -39935
- ln_sqft_living 1 7.62 943.11 -39898
- zipcode 1 8.18 943.68 -39889
- ln_bedroom 1 8.53 944.03 -39884
- sqft_livingSq 1 8.62 944.12 -39882
- bathrooms 1 13.42 948.92 -39808
- cat_cond 2 14.61 950.10 -39792
- ln_sqft15_living 1 28.47 963.96 -39579
- yr_built 1 76.63 1012.12 -38868
- cat_grade 6 147.93 1083.43 -37885
- lat 1 403.13 1338.63 -34789
Call:
lm(formula = ln_price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq,
subset = -lev_points)
Coefficients:
(Intercept) cat_grade6 cat_grade7 cat_grade8 cat_grade9 cat_grade10 cat_grade11 cat_cond4 cat_cond5
4.2463282 0.1672742 0.3399081 0.5033175 0.6833538 0.8192183 0.9606272 0.0520015 0.1174568
cat_floors1.5 cat_floors2 cat_floors3 bathrooms ln_sqft_living ln_sqft15_lot ln_sqft_above yr_built zipcode
0.0415624 0.0362386 0.1260839 0.0746207 -2.0884395 -0.0233491 0.1101831 -0.0041047 -0.0005725
lat long ln_sqft15_living ln_bedroom ln_sqft_basement basementSq sqft_livingSq
1.3077863 -0.1212724 0.2310046 -0.1613592 0.0386478 -0.0042817 0.1574770
Backwards regression picked the same model. This makes perfect sense as all of the t p-values are very significant.
Based on feature engineering, let’s remove some of the sqft terms and see what happens. First we will compute vif (variance inflation factors). Based on the F p-value and the explanatory variables t p-values, we do not suspect collinearity. However, there are lots of terms that are similar that I would like to see if we can get rid of (for example the square feet variables)
library(Rcmdr)
package ‘Rcmdr’ was built under R version 3.2.5Loading required package: splines
Loading required package: RcmdrMisc
package ‘RcmdrMisc’ was built under R version 3.2.5Loading required package: car
package ‘car’ was built under R version 3.2.5Loading required package: sandwich
package ‘sandwich’ was built under R version 3.2.5Loading required package: effects
package ‘effects’ was built under R version 3.2.5Loading required package: carData
package ‘carData’ was built under R version 3.2.5
Attaching package: ‘carData’
The following objects are masked from ‘package:car’:
Guyer, UN, Vocab
lattice theme set by effectsTheme()
See ?effectsTheme for details.
RcmdrMsg: [1] NOTE: R Commander Version 2.4-1: Sun Dec 17 13:39:01 2017
Rcmdr Version 2.4-1
vif(m3)
GVIF Df GVIF^(1/(2*Df))
cat_grade 4.955828 6 1.142685
cat_cond 1.320442 2 1.071963
cat_floors 4.072576 3 1.263703
bathrooms 3.320578 1 1.822245
ln_sqft_living 1383.824140 1 37.199787
ln_sqft15_lot 1.726189 1 1.313845
ln_sqft_above 54.561570 1 7.386580
yr_built 2.692704 1 1.640946
zipcode 1.675613 1 1.294455
lat 1.202009 1 1.096362
long 1.937453 1 1.391924
ln_sqft15_living 2.844880 1 1.686677
ln_bedroom 1.869034 1 1.367126
ln_sqft_basement 180.491596 1 13.434716
basementSq 262.635051 1 16.206019
sqft_livingSq 1583.718304 1 39.795958
Bingo! ln_sqft_living vif value=37.199787 > 10 (the cutoff), ln_sqft_basement=13.434716, basementSq=16.206019, and sqft_livingSq=39.795958 which are greater than 10. We will remove it from the model and call it model 4.
m4 = lm(ln_price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_lot +
ln_bedroom,
subset = -lev_points)
summary(m4)
Call:
lm(formula = ln_price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_lot + ln_bedroom, subset = -lev_points)
Residuals:
Min 1Q Median 3Q Max
-1.30189 -0.16369 -0.00518 0.15518 1.18977
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.1964450 4.6561211 0.472 0.637125
cat_grade6 0.1254809 0.0920450 1.363 0.172823
cat_grade7 0.2874687 0.0920041 3.125 0.001784 **
cat_grade8 0.4785436 0.0922400 5.188 2.15e-07 ***
cat_grade9 0.7038185 0.0926154 7.599 3.16e-14 ***
cat_grade10 0.8848906 0.0930990 9.505 < 2e-16 ***
cat_grade11 1.0891373 0.0943707 11.541 < 2e-16 ***
cat_cond4 0.0485027 0.0054080 8.969 < 2e-16 ***
cat_cond5 0.1051255 0.0087569 12.005 < 2e-16 ***
cat_floors1.5 0.0253916 0.0086716 2.928 0.003415 **
cat_floors2 0.0328272 0.0072559 4.524 6.11e-06 ***
cat_floors3 0.0804827 0.0157020 5.126 3.00e-07 ***
bathrooms 0.0850908 0.0051997 16.365 < 2e-16 ***
ln_sqft_living 0.4509102 0.0139179 32.398 < 2e-16 ***
ln_sqft15_lot -0.0129704 0.0034364 -3.774 0.000161 ***
ln_sqft_above 0.0139719 0.0133822 1.044 0.296471
yr_built -0.0042888 0.0001212 -35.399 < 2e-16 ***
zipcode -0.0006274 0.0000517 -12.135 < 2e-16 ***
lat 1.3473624 0.0168172 80.118 < 2e-16 ***
long -0.1063691 0.0213026 -4.993 6.01e-07 ***
ln_bedroom -0.1905024 0.0142709 -13.349 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2594 on 14565 degrees of freedom
Multiple R-squared: 0.735, Adjusted R-squared: 0.7347
F-statistic: 2020 on 20 and 14565 DF, p-value: < 2.2e-16
summary(m3)
Call:
lm(formula = ln_price ~ cat_grade + cat_cond + cat_floors + bathrooms +
ln_sqft_living + ln_sqft15_lot + ln_sqft_above + yr_built +
zipcode + lat + long + ln_sqft15_living + ln_sqft15_lot +
ln_bedroom + ln_sqft_basement + basementSq + sqft_livingSq,
subset = -lev_points)
Residuals:
Min 1Q Median 3Q Max
-1.38524 -0.16167 -0.00347 0.15402 1.12771
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.246e+00 4.717e+00 0.900 0.368043
cat_grade6 1.673e-01 9.002e-02 1.858 0.063171 .
cat_grade7 3.399e-01 9.012e-02 3.772 0.000163 ***
cat_grade8 5.033e-01 9.037e-02 5.569 2.60e-08 ***
cat_grade9 6.834e-01 9.067e-02 7.537 5.09e-14 ***
cat_grade10 8.192e-01 9.108e-02 8.995 < 2e-16 ***
cat_grade11 9.606e-01 9.237e-02 10.400 < 2e-16 ***
cat_cond4 5.200e-02 5.289e-03 9.832 < 2e-16 ***
cat_cond5 1.175e-01 8.581e-03 13.689 < 2e-16 ***
cat_floors1.5 4.156e-02 8.547e-03 4.863 1.17e-06 ***
cat_floors2 3.624e-02 7.170e-03 5.054 4.37e-07 ***
cat_floors3 1.261e-01 1.545e-02 8.163 3.55e-16 ***
bathrooms 7.462e-02 5.162e-03 14.455 < 2e-16 ***
ln_sqft_living -2.088e+00 1.918e-01 -10.888 < 2e-16 ***
ln_sqft15_lot -2.335e-02 3.403e-03 -6.860 7.15e-12 ***
ln_sqft_above 1.102e-01 3.757e-02 2.932 0.003369 **
yr_built -4.105e-03 1.189e-04 -34.535 < 2e-16 ***
zipcode -5.725e-04 5.075e-05 -11.282 < 2e-16 ***
lat 1.308e+00 1.651e-02 79.213 < 2e-16 ***
long -1.213e-01 2.098e-02 -5.781 7.58e-09 ***
ln_sqft15_living 2.310e-01 1.097e-02 21.050 < 2e-16 ***
ln_bedroom -1.614e-01 1.400e-02 -11.524 < 2e-16 ***
ln_sqft_basement 3.865e-02 8.883e-03 4.351 1.37e-05 ***
basementSq -4.282e-03 1.618e-03 -2.647 0.008133 **
sqft_livingSq 1.575e-01 1.359e-02 11.585 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2535 on 14561 degrees of freedom
Multiple R-squared: 0.7471, Adjusted R-squared: 0.7467
F-statistic: 1792 on 24 and 14561 DF, p-value: < 2.2e-16
Although there is a very slight increase in the standard error, decrease in R2 and R2a in model 4, we still have a very significant model. Because of this we will use model 4 as our final model.
extractAIC(m1)
[1] 33.0 370245.3
extractAIC(m2)
[1] 25 354198
extractAIC(m3)
[1] 25.00 -40013.93
extractAIC(m4)
[1] 21.00 -39344.69
We now have an excellent model in model4. Let’s see how well the model does against the test dataset.
#detach(train_house_data)
attach(test_house_data)
pred = predict(m4)
temp_ln = sapply(test_house_data$price,log)
actual_preds = data.frame(cbind(actual=temp_ln, predictions=pred))
number of rows of result is not a multiple of vector length (arg 1)
min_max_accuracy = mean(apply(actual_preds, 1, min) / apply(actual_preds, 1, max))
min_max_accuracy
[1] 0.9604151
mape = mean(abs((actual_preds$predictions - actual_preds$actual))/actual_preds$actual)
mape
[1] 0.04091274
From the looks of it we essentially have a perfect model. For the MinMax accuracy, the closer the value is to 1 the better. The MinMax tells us how far off the prediction values are. In this case, since the MinMax percentage is 96.05%, we have an incredibly good model on our hands.
The Mean Absolute Percent Error (MAPE) will measures the size of the error in percentage terms. Since the MAPE is .04091274, the percent error is around 4.1 percent. This is very good and we are now done with our analysis!
If you would like go get the price of a home back into a normal scale, do an exp transformation to cancel out the natural log. For example:
example = predict(m4)[2]
# example is in the natural log scale
example
2
13.89521
# now we exponentiate it to get it back to the real scale
exp(example)
2
1082961
From this excursion, we have deemed m4 to be the best model that we built. The model is significant, does a good job of explaining the variation in price, predicts well, has a very high MinMax accuracy, is parsimonious, and it is homoscedastic. This was a very fun afternoon project to work on and I can’t wait to dive into more datasets using other techniques!!!
Data Science rules!
Until next time,
Joel Afriyie
Comments on Categorical Variables
We will treat grade, floors, and condition as categorical variables