Back to questions
As a Data Scientist supporting Blackstone's Real Estate investment business, you've been asked by a Portfolio Manager to build a model to predict how much a Class-A high-rise residential building is worth.
Because this is a niche asset category, you only have detailed data on a few thousand buildings. However, for each building you have data on, you have thousands of data points about that property.
What kind of model would you build? What kind of feature-engineering work would you do, to support that model?
Step1: Understand the Problem
What will this model ultimately be used for? What buildings will the business be considering when providing a prediction? Will they be combing through a large set of buildings and identifying the best deal to act on, or will they be making a decision on most every building they assess. If looking for the top handful of deals, then protecting against wild errors is key. If the latter, then having a low error on average is most important.
How important is interpretability? Will you need to identify comparable properties to convince the user of your estimate?
Step2: Understand the Data
Where is this data coming from? Why do we have so few buildings to train on? Are the building valuations in the training data set coming from actual transactions (building bought or sold), or are they appraisals? How does this compare with the actual use case.
As for the features available, how was this data collected? When we ultimately use this model to predict, will the features be collected in a comparable manner. Be sure to understand up-front any major differences between your train and use case know what patterns in the training data you are comfortable making into your model.
Step3: Gain some Intuition
First, get a broad sense of what the several thousand features that you have available are. How were they collected? If you asked me to prepare a list of several thousand features to describe a residential building, I'd have a hard time getting past just 100 features! Are some of these features actually time-series, collections of reviews, info on current tenants? There may be some sensible ways to repackage this data into something more wieldy, make sure to chat with the subject matter experts and avoid using features that will not carry well into the future!
With this context, now run some exploratory data analysis (EDA) to gain intuition around what patterns your model might find:
Another situation to look out for is that valuations may be heavily dependent on time and location. Economic situations and the nature of cities may change over time. Without proper consideration, your model could degrade very quickly. Do you need to pull in more features to describe the macro environment or the nature of the cities/neighborhoods where these buildings are located? This also, means that you will have to validate your model with an out-of-time sample.
Step4: Modeling!
With so much feature space relative to sample size (wide data), feature selection is key, otherwise we risk over-fitting. Some approaches we've already touched on, but potential strategies include:
Step5: Monitoring
Develop a plan for monitoring your model performance. Will you have a truth value for most of the values you predicted on, or will you only have one if the business ultimately made the deal? Are there other ways to make sure you get the target value for all the values you are predicting on?
Is the model degrading? Understanding the main patterns captured by your model will help you understand if those patterns no longer holding in a new environment or if you just need to make sure to capture the macro environment more accurately.
At some point you may also consider rebuilding the model with newer data, as well as incorporating your learnings from the original model build.