listing_id | listing_date | listing_type | property_type | tenure | bedroom_count | location | asking_price | revision_date | active_days | is_5percent_reduced |
---|---|---|---|---|---|---|---|---|---|---|
15526443 | 2021-12-16 | buy | flat | leasehold | 2 | Gloucester | 155000 | 2021-12-16 | 0 | no |
15528531 | 2021-12-16 | buy | house | freehold | 3 | Coventry | 225000 | 2022-01-11 | 0 | no |
15542023 | 2021-09-07 | buy | house | freehold | 2 | Reading | 325000 | 2021-10-16 | 0 | no |
15550261 | 2021-12-22 | buy | house | freehold | 3 | Sheffield | 270000 | 2022-05-11 | 0 | no |
15557544 | 2021-12-22 | buy | house | freehold | 4 | Warrington | 299950 | 2022-01-25 | 0 | no |
15565980 | 2021-12-24 | buy | house | freehold | 2 | Watford | 450000 | 2022-02-25 | 0 | no |
Introduction
In real estate universe, often listings on the market experience updates due to either changes to property specific information or due to an influence of markets.
The market influences prices of listings in major way and can lead to price increase or decrease post original advertisement.
Of interest to a real investment company are price revisions that are < 0, i.e. where prices of listings decrease. Such properties can offer immediate improvement to rental yield investment cases and shorter times to close due to comparatively lower demand.
The current analysis aspires to be able to understand where and based on what factors could price changes have happened using a historic listings dataset. This is to validate the analysis against subject matter knowledge and also expand the current understanding about price revision behaviours in the market.
A better understanding of factors associated with price reductions can help drive strategic decisions e.g. if prices are more likely to be reduced at a particular location in recent data then the market may be experiencing a contraction there.
The analysis also extends to prediction of probability of price reduction based on most informative inputs, to be able to have a mechanism where operations teams can prioritise negotiations and due diligence for such properties to benefit from the possibility of their weaker market demand.
Data
The data is derived using 2 different data files listings.csv and revisions.csv.
These files individually contain basic attributes of properties and revised prices (both increased in decreased).
These are cleaned up for sensible choices of attributes in listing data and price revisions. There are no missing values and the data values are checked for the suitability of application. The script to cleanup and prepare data can be found here.
The columns are self explanatory and the thing worth noting is that active days indicate the time elapsed from listing_date to revision_date.
We’ve converted bedroom_count to a qualitative type, since bedroom count has a highly non linear effect on prices and other attributes of property (1->2 != 2 -> 3).
Additionally, there are only a few different bedroom counts in the data and it could be analysed without treating this information as quantitative.
EDA
There are unknown values in tenure which should likely be either of the other 2 categories. The proportion of these listings is about 8% in the data so we cannot drop these listings since their share is quite large.
Additionally, most often flats are observed to be leasehold (which is sensible given that land is not definitively owned) and houses are observed to be freehold. Thus suggesting a strong correlation between tenure and property_type, which means we can drop this column from analysis and use just the property_type for both pieces of information.
tenure | property_type | counts |
---|---|---|
freehold | flat | 33 |
freehold | house | 40117 |
leasehold | flat | 17176 |
leasehold | house | 21 |
unknown | flat | 456 |
unknown | house | 4689 |
We do not find a clear relationship between asking_price and properties that have had 5% reduction in prices. It is still possible for there to be some relationship but it is perhaps confounded by location or property_type etc.
We expect that inclusion of location and other factors in the model will clarify effect of asking_price on revisions.
Similar conclusion for active_days. Due to both of these variables having a long tail, these were log transformed and for the rest of the analysis we shall apply the same treatment.
There are locations which have had a higher share of properties reduced and this is encouraging as conditional on location effect other variables may be different.
We observe some association between reductions and bedroom count and property type too.
The proportion of properties undergoing price reductions is fairly (mean) stable over time, across almost all locations. We shall not focus on the time component in the rest of the analysis for simplification.
Model
Inference
is_5percent_reduced ~ property_type + bedroom_count + log(asking_price) +
log(active_days) + log(active_days) + location
Null deviance: 66829
Model deviance: 60479
Null df - Model df: 39
P-Value X^2: 0
Even without a formal test we know that a chi-sq distribution with n degrees of freedom should have an expected value of n. In our case (66829-60479=6350) is much larger than DoF: 39, so we can reject the hypothesis that model is not significantly different from the null model with no explanatory variables.
The chi-square p value suggests the same and we can continue analysing effects captured by the model.
Effect of asking price and active days
Both of these variables have shapes characteristic of a log transform which was applied previously. It can be observed that asking_price increasing in the lower end significantly affects the likelihood of price reduction. This is reflective of the fact that on a lower baseline, a change of price can easily materialise into a 5% difference as opposed to at higher end of price where 5% reduction requires a very large £ value reduction.
Warning in log(x): NaNs produced
Warning in log(x): NaNs produced
active_days does not seem to strongly affect probability of price reduction across its range of values. Towards the initial range of values, the effect seems to be similar to asking_price. The small magnitude of change in probabilities makes the judgement of significance of this variable ambiguos from visual inspection of this single example.
We can test individual explanatory variables by fitting models that drop each variable once and compute the difference in deviance observed.
drop1(lmod, test="Chi")
Single term deletions
Model:
is_5percent_reduced ~ property_type + bedroom_count + log(asking_price) +
log(active_days) + log(active_days) + location
Df Deviance AIC LRT Pr(>Chi)
<none> 60479 60559
property_type 1 60530 60608 51.4 7.718e-13 ***
bedroom_count 5 60698 60768 219.3 < 2.2e-16 ***
log(asking_price) 1 60677 60755 198.3 < 2.2e-16 ***
log(active_days) 1 65525 65603 5046.2 < 2.2e-16 ***
location 31 61157 61175 677.8 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A formal evaluation suggests that all of the explanatory variables are significant. This clarifies confusion around log(active_days). This is in contrast to what was observed during EDA, suggesting a confounding effect due to other variables in the model.
Effect of location
The plot below gives a comparative assessment of which locations are associated with higher odds of price reduction. This is useful since the summary strips away effects of other explanatory variables and allows a fair comparison between locations (unlike the plot in EDA).
For example, Hull which appeared mid-table in EDA is now near the top. One prominent reason is that Hull has the lowest average asking_price (~200,000£) which as seen above, is associated with a very low probability of price reduction. This fact confounds the effect of location itself which is now revealed.
The overall result may be combined with other data sources to affirm market level behaviours which can be useful for internal know how and targeted acquisitions.
As an example in the chart, places with high odds of reduction:
- Sheffield
- Hull
- Mansfield
- Warrington
Are all clustered together in the north not very far from each other. Indicative of a broader market behaviour of North UK.
Diagnostics
Relationship between Predicted values and Residuals
We construct residual plot by gruping residuals into bins where bins are based on similar predictor values. The choice of number of bins is arbitrary and is made to ensure that we have roughly 500 observations per bin.
The deviance residuals are not constrained to have mean zero so the mean level of the plot is not of interest. There is over prediction at the top end of predicted values which could be a good avenue to start exploring options for model improvement in the future.
Relationship between Explanatory variables and Residuals
Among the categorical variables only property_type appears to have strong association with residuals. We will not address it at this point but again park it as an assessment to be made in future.
<- aov(residuals ~ location + property_type + bedroom_count, train)
mod summary(mod)
Df Sum Sq Mean Sq F value Pr(>F)
location 31 41 1.339 1.112 0.306
property_type 1 30 30.363 25.214 5.15e-07 ***
bedroom_count 5 9 1.845 1.532 0.176
Residuals 49962 60164 1.204
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residuals exhibit a strong patter against log(active days). This is suggestive of a more flexible treatment of active days in the model (we shall opt for splines in a later version of the model).
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Unusual points
Examining the leverages, there don’t seem to be very unusual points that warrant further analysis. Unlike OLS residuals there is no reason to expect normality in this case and so absence of linearity is not of concern.
Goodness of fit
A preliminary examination of how well the model fits the data can be performed by visualising predicted probabilities against observed propertions of price reductions in the data.
When we make a prediction with probability p, we would hope that the event occurs in practice with that proportion.
There is no consistent deviation from what is expected and y=x line is contained in the bounds of majority of the data bins.
A reason to look at the fit visually is that numerical summaries which emulate \(R^2\) akin to OLS generally report very low numbers due to bounded response and MLE estimation not explicitly targeting variance of the data.
Nagelkerke's R^2: 0.162
As explained above this value is small and this is partly expected. This does not however rule out further model improvements like including additional informative explanatory variables and changing the functional forms in existing model (as was observed with log(active days)).
Sensitivity specificity tradeoff
The model can also be used to predict the outcome for each property in the dataset (we’ve separated a test set already). However, using a 0.5 probability threshold may not be appropriate and we can look at sensitivity/specificity trade off to make cost judgement on model classification outputs. At the default 0.5 threshold, the classification table looks as below.
predout
is_reduced no yes
no 26866 3685
yes 11634 7815
Setting levels: control = 0, case = 1
Setting direction: controls < cases
predout
is_reduced no yes
no 21451 9100
yes 8009 11440
The selection of threshold is going to be biased to favour Sensitivity since there may be desire to be precise about revisions. For now we just balance the 2 attributes and observe that upon choosing a reasonable threshold of p=0.35 the classification table already looks much better.
Conclusion
There is much scope for improvement in the current model.
A few points identified in the analysis above are:
- Greater overprediction at higher predicted values
- Property type is correlated with residuals
- log(active days) has a 2nd order like pattern with residuals
Very fundamentally though, the problem of predicting price reduction is tricky, there are a lot of subjective factors at play with each property along with market related factors.
This exercise is but a small step in understanding this phenomenon and it definitely has indicated that more information is required to enhance accuracy of the model.