Blog - Analysis of price reductions

Introduction

In real estate universe, often listings on the market experience updates due to either changes to property specific information or due to an influence of markets.
The market influences prices of listings in major way and can lead to price increase or decrease post original advertisement.
Of interest to a real investment company are price revisions that are < 0, i.e. where prices of listings decrease. Such properties can offer immediate improvement to rental yield investment cases and shorter times to close due to comparatively lower demand.

The current analysis aspires to be able to understand where and based on what factors could price changes have happened using a historic listings dataset. This is to validate the analysis against subject matter knowledge and also expand the current understanding about price revision behaviours in the market.
A better understanding of factors associated with price reductions can help drive strategic decisions e.g. if prices are more likely to be reduced at a particular location in recent data then the market may be experiencing a contraction there.

The analysis also extends to prediction of probability of price reduction based on most informative inputs, to be able to have a mechanism where operations teams can prioritise negotiations and due diligence for such properties to benefit from the possibility of their weaker market demand.

Data

The data is derived using 2 different data files listings.csv and revisions.csv.
These files individually contain basic attributes of properties and revised prices (both increased in decreased).

These are cleaned up for sensible choices of attributes in listing data and price revisions. There are no missing values and the data values are checked for the suitability of application. The script to cleanup and prepare data can be found here.

listing_id	listing_date	listing_type	property_type	tenure	bedroom_count	location	asking_price	revision_date	is_5percent_reduced
15526443	2021-12-16	buy	flat	leasehold	2	Gloucester	155000	2021-12-16	no
15528531	2021-12-16	buy	house	freehold	3	Coventry	225000	2022-01-11	no
15542023	2021-09-07	buy	house	freehold	2	Reading	325000	2021-10-16	no
15550261	2021-12-22	buy	house	freehold	3	Sheffield	270000	2022-05-11	no
15557544	2021-12-22	buy	house	freehold	4	Warrington	299950	2022-01-25	no
15565980	2021-12-24	buy	house	freehold	2	Watford	450000	2022-02-25	no

The columns are self explanatory and the thing worth noting is that active days indicate the time elapsed from listing_date to revision_date.

We’ve converted bedroom_count to a qualitative type, since bedroom count has a highly non linear effect on prices and other attributes of property (1->2 != 2 -> 3).
Additionally, there are only a few different bedroom counts in the data and it could be analysed without treating this information as quantitative.

EDA

There are unknown values in tenure which should likely be either of the other 2 categories. The proportion of these listings is about 8% in the data so we cannot drop these listings since their share is quite large.
Additionally, most often flats are observed to be leasehold (which is sensible given that land is not definitively owned) and houses are observed to be freehold. Thus suggesting a strong correlation between tenure and property_type, which means we can drop this column from analysis and use just the property_type for both pieces of information.

tenure	property_type	counts
freehold	flat	33
freehold	house	40117
leasehold	flat	17176
leasehold	house	21
unknown	flat	456
unknown	house	4689

We do not find a clear relationship between asking_price and properties that have had 5% reduction in prices. It is still possible for there to be some relationship but it is perhaps confounded by location or property_type etc.
We expect that inclusion of location and other factors in the model will clarify effect of asking_price on revisions.
Similar conclusion for active_days. Due to both of these variables having a long tail, these were log transformed and for the rest of the analysis we shall apply the same treatment.

There are locations which have had a higher share of properties reduced and this is encouraging as conditional on location effect other variables may be different.
We observe some association between reductions and bedroom count and property type too.

The proportion of properties undergoing price reductions is fairly (mean) stable over time, across almost all locations. We shall not focus on the time component in the rest of the analysis for simplification.

Model

Inference

is_5percent_reduced ~ property_type + bedroom_count + log(asking_price) + 
    log(active_days) + log(active_days) + location

Null deviance: 66829

Model deviance: 60479

Null df - Model df: 39

P-Value X^2: 0

Even without a formal test we know that a chi-sq distribution with n degrees of freedom should have an expected value of n. In our case (66829-60479=6350) is much larger than DoF: 39, so we can reject the hypothesis that model is not significantly different from the null model with no explanatory variables.
The chi-square p value suggests the same and we can continue analysing effects captured by the model.

Effect of asking price and active days

Both of these variables have shapes characteristic of a log transform which was applied previously. It can be observed that asking_price increasing in the lower end significantly affects the likelihood of price reduction. This is reflective of the fact that on a lower baseline, a change of price can easily materialise into a 5% difference as opposed to at higher end of price where 5% reduction requires a very large £ value reduction.

Warning in log(x): NaNs produced
Warning in log(x): NaNs produced

active_days does not seem to strongly affect probability of price reduction across its range of values. Towards the initial range of values, the effect seems to be similar to asking_price. The small magnitude of change in probabilities makes the judgement of significance of this variable ambiguos from visual inspection of this single example.
We can test individual explanatory variables by fitting models that drop each variable once and compute the difference in deviance observed.

drop1(lmod, test="Chi")

Single term deletions

Model:
is_5percent_reduced ~ property_type + bedroom_count + log(asking_price) + 
    log(active_days) + log(active_days) + location
                  Df Deviance   AIC    LRT  Pr(>Chi)    
<none>                  60479 60559                     
property_type      1    60530 60608   51.4 7.718e-13 ***
bedroom_count      5    60698 60768  219.3 < 2.2e-16 ***
log(asking_price)  1    60677 60755  198.3 < 2.2e-16 ***
log(active_days)   1    65525 65603 5046.2 < 2.2e-16 ***
location          31    61157 61175  677.8 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A formal evaluation suggests that all of the explanatory variables are significant. This clarifies confusion around log(active_days). This is in contrast to what was observed during EDA, suggesting a confounding effect due to other variables in the model.

Effect of location

The plot below gives a comparative assessment of which locations are associated with higher odds of price reduction. This is useful since the summary strips away effects of other explanatory variables and allows a fair comparison between locations (unlike the plot in EDA).
For example, Hull which appeared mid-table in EDA is now near the top. One prominent reason is that Hull has the lowest average asking_price (~200,000£) which as seen above, is associated with a very low probability of price reduction. This fact confounds the effect of location itself which is now revealed.
The overall result may be combined with other data sources to affirm market level behaviours which can be useful for internal know how and targeted acquisitions.
As an example in the chart, places with high odds of reduction:

Sheffield
Hull
Mansfield
Warrington

Are all clustered together in the north not very far from each other. Indicative of a broader market behaviour of North UK.

Diagnostics

Relationship between Predicted values and Residuals

We construct residual plot by gruping residuals into bins where bins are based on similar predictor values. The choice of number of bins is arbitrary and is made to ensure that we have roughly 500 observations per bin.

The deviance residuals are not constrained to have mean zero so the mean level of the plot is not of interest. There is over prediction at the top end of predicted values which could be a good avenue to start exploring options for model improvement in the future.

Relationship between Explanatory variables and Residuals

Among the categorical variables only property_type appears to have strong association with residuals. We will not address it at this point but again park it as an assessment to be made in future.

mod <- aov(residuals ~ location + property_type + bedroom_count, train)
summary(mod)

                 Df Sum Sq Mean Sq F value   Pr(>F)    
location         31     41   1.339   1.112    0.306    
property_type     1     30  30.363  25.214 5.15e-07 ***
bedroom_count     5      9   1.845   1.532    0.176    
Residuals     49962  60164   1.204                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residuals exhibit a strong patter against log(active days). This is suggestive of a more flexible treatment of active days in the model (we shall opt for splines in a later version of the model).

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Unusual points

Examining the leverages, there don’t seem to be very unusual points that warrant further analysis. Unlike OLS residuals there is no reason to expect normality in this case and so absence of linearity is not of concern.

Goodness of fit

A preliminary examination of how well the model fits the data can be performed by visualising predicted probabilities against observed propertions of price reductions in the data.
When we make a prediction with probability p, we would hope that the event occurs in practice with that proportion.

There is no consistent deviation from what is expected and y=x line is contained in the bounds of majority of the data bins.
A reason to look at the fit visually is that numerical summaries which emulate \(R^2\) akin to OLS generally report very low numbers due to bounded response and MLE estimation not explicitly targeting variance of the data.

Nagelkerke's R^2:  0.162

As explained above this value is small and this is partly expected. This does not however rule out further model improvements like including additional informative explanatory variables and changing the functional forms in existing model (as was observed with log(active days)).

Sensitivity specificity tradeoff

The model can also be used to predict the outcome for each property in the dataset (we’ve separated a test set already). However, using a 0.5 probability threshold may not be appropriate and we can look at sensitivity/specificity trade off to make cost judgement on model classification outputs. At the default 0.5 threshold, the classification table looks as below.

          predout
is_reduced    no   yes
       no  26866  3685
       yes 11634  7815

Setting levels: control = 0, case = 1

Setting direction: controls < cases

          predout
is_reduced    no   yes
       no  21451  9100
       yes  8009 11440

The selection of threshold is going to be biased to favour Sensitivity since there may be desire to be precise about revisions. For now we just balance the 2 attributes and observe that upon choosing a reasonable threshold of p=0.35 the classification table already looks much better.

Conclusion

There is much scope for improvement in the current model.
A few points identified in the analysis above are:

Greater overprediction at higher predicted values
Property type is correlated with residuals
log(active days) has a 2nd order like pattern with residuals

Very fundamentally though, the problem of predicting price reduction is tricky, there are a lot of subjective factors at play with each property along with market related factors.
This exercise is but a small step in understanding this phenomenon and it definitely has indicated that more information is required to enhance accuracy of the model.