Feature Store: Why We Chose This Path

When we were researching Feature Store, we talked to acquaintances, colleagues and friends, collected references, went to conferences. Almost everyone asked us: do we really need this, wouldn’t it be easier to create our own Data Warehouse (DWH)? We didn’t immediately dismiss this path, but decided to figure out Feature Store for ourselves – a system consisting of a set of participating elements.

The fact is that real estate valuation in Domclick is divided into two types:

  1. Online processes where any user can go to the site and evaluate their home for sale or rent.
  2. Mortgage processes. Our models participate in the decision-making process. When a client submits an application, the model evaluates their property, and the bank decides whether to issue a mortgage.

But when I joined the team, there was already a huge legacy that lived, developed and was passed down from generation to generation. At that time, it was already the third generation, and with each change, it became more and more difficult to work with it.

And there were still many problems that I wanted to fix.

Long online forecasting. This is one of the most unpleasant problems. Our models work on two processes. It is critical for us to withstand high load and latency for clients of front and back teams involved in integrations. There are really a lot of them.

One of the reasons for long online forecasting is one database for one type of models. For example, for secondary housing, the OLAP and OLTP load was mixed in the database. It was necessary to first turn to online forecasts, then to analytics. If there was a replica, then it was possible to take data from it for training, but in general it was inconvenient and difficult.

Complex adding of new sources. Adding any source at the request of the business for the development of the model was unrealistically difficult.

Long and unstable calculations. Adding new sources allowed us to expand existing pipelines. They became even more complex and took longer to process. At the same time, our calculations involved critical-level pipelines. The first thought when waking up in the morning was: “Did it count?” And if not, then we had to restart everything. And the next day, an external client could impose a fine because of this. So the problem was critical.

Lack of memory on machines. The hardware is not infinite, and neither are the resources on it. Our heavy DAGs prevented other teams from working on the same machines. DAG ate up all the resources, no one could start, or the parallel DAG would crash.

Large DB size. An example of a DB that works for one type of models: Postgres with a size of 1.5 TB. It is clear that this DB can handle more, but nevertheless, when there is online + OLAP, this is too much, inconvenient.

Chaos in data flows. Whenever a business had tasks for data scientists, they would run to us with requests to add a few features, adjust the logic. But such requests sounded dubious, because the chaos looked something like this:

Every time we had to rack our brains over how to add this taking into account legacy in the third generation. At the same time, no one knew the legacy code: everything there is super cool and thought out to the smallest detail, but it is impossible to understand. Nevertheless, we tried to do it. The implementation that we have come to at the moment can be considered a high-quality MVP. There is a platform and sources from which we take data, run it through the platform and deliver it to the model service:

Selection of technologies

Feature Platform in Domclick consists of several elements:

  • Engine is the main element. It works on the basis of in-house DBT + Airflow. Since our teams mostly work with tabular data, a regular classic storage is enough, in which we will perform most of the calculations.
  • Data Lineage — OpenMetadata
  • Data Quality — SODA Framework

The Feature Store consists of two components:

  • Offline Storage — Greenplum, where all calculations are made. We already had it in the infrastructure and worked with the analytical circuit. And since we are doing MVP, we decided that Greenplum is suitable for us — it is enough to solve tabular problems.
  • Online Storage is Postgres. Each model has its own small online storage, which contains the data it needs. We have a lot of expertise in this DB in our company. There is not much data: Postgres is able to handle the required load. In addition, this DBMS solves the problem of searching for additional similar objects that are required for the models to work.

We expect implementation of:

  • ClickHouse
  • Kafka
  • s3
  • Redis

Implementation steps

We solved business problems and made Feature Store. We do not have a division into Delivery and Discovery, so our product team made MVP, because we needed it. In small but confident steps, we standardized our approach to working with data for the model, and it became much better.

Preparation

We started by studying the sources and queries that were in use at the time. It would seem like an obvious step, but in order to find everything, because of legacy, we had to study everything first.

Next, we formed a detailed storage layer. The offline storage was simply cut into layers:

  • Raw Layer, or raw data layer. This is where unnormalized data is stored.
  • Stage Layer. It currently acts as a detailed layer. It stores normalized data, which is converted from Raw to normalized data, and a certain data model. We did not complicate things, so we did not choose Data Vault or Anchor. In principle, our data model is similar to a snowflake.

Then we studied the logic of the current pipelines. After we formed a detailed storage layer, we began to study DAGs. Naturally, it took a long time, because the code was written chaotically, there was no standardization and tests.

Here’s how cold storage is designed:

There are layers Raw, Stage and Marts. The Marts layer stores the cold storage showcase, which works for a specific model. This is the data that is needed to train the model, take the current slice from there and put it in Online Storage. At the moment, this is quite convenient for us, because it contains data for the entire period, and it works for a specific model.

How the Marts layer is formed:

  1. Preparation of simple features. The source passes through all layers where calculations, aggregations, connections are performed. After that, the showcases are ready to be connected to cold storage. Everything is simple, because the calculation is given at the SQL level.
  2. Preparing complex features. Here, the starting point is Greenplum. From there, a batch of data is taken and run through DAG.

For example, running a batch through classification or clustering models. After the model has worked, new features appear that are either needed to work on the inference or will be needed in the future to train the model. After the calculations are completed, they are stored in a separate Postgres DB. Postgres becomes the source, and then it is run through all the layers in Greenplum and connected to cold storage.

The question of delivering data to Online Storage arises.

Data delivery

For this purpose, we came up with a templated delivery DAG.

Once the cold storage is ready, DAG starts working. It takes the current batch and delivers it to Online Storage. The model pulls the current data and works directly with the inference.

In simplified form, the diagram looks like this:

This templated DAG is managed at the config level, where we set parameters that are run on every successful run.

The DAG’s work consists of four tasks:

  1. Schema validation.
from typing import Optional, Sequence
from pydantic import BaseModel

class ContractMapping(BaseModel):
    conn_target_name: str
    schema_validation: type[BaseModel]
    query_collect: str
    query_insert: str
    query_delete: str
    swap_queries: Optional[Sequence[str]] = None

To make DAG work, we decided to try the concept of a data contract. It’s basically like a REST API contract, where we know exactly what the request for a specific endpoint should be, and what the response should be. We’re confident that if we send a specific request, we’ll get the same response structurally. Basically, it’s just a data class, described by rules that you define and use when working with.

Next, this data contract is registered in the DAG template, and when the config is launched, its use is called.

2. Data verification.

We check the data at the stage of delivery to production. In order for the model to receive high-quality data, high-quality verification is needed at the early stages, and this is a fairly large and complex process that needs to be approached thoroughly. We decided to focus on data delivery.

checks for buildings_table:
  - freshness(load_time) < 1d
  - duplicate_count(building_id):
    name: building unique
    fail: when > 0
  - missing_count(zhk_id):
    name: all building have zhk_id
    fail: when > 0

For quality assurance, we use the SODA framework . This open-source solution works with a large number of tools. While studying quality assurance tools, we looked at what the company already had and saw that SODA covered our needs, and it was already integrated into the data engineering team. We decided not to expand the technology stack and reuse this tool for our purposes.

3-4. Loading and deleting the delta. This completes the DAG’s work.

Data catalog

We looked at two tools: DataHub and OpenMetadata. In general, they are very similar in capabilities, but according to my references, DataHub was used and reached a working state much earlier. Many familiar companies have already successfully integrated it and it works without problems.

However, we chose OpenMetadata. There are two reasons:

  1. The UI is visually clearer, more convenient, more native. It is almost impossible to get confused in it.
  2. The number of infrastructure tools to support this resource. In the case of OpenMetadata, this is Airflow, Elastic, and Postgres. For DataHub, you will also need Kafka and MySQL instead of Postgres.

Data lineage

Let me remind you what our Data lineage looked like before:

It was a complex DAG that spawned and launched subsequent DAGs, which also had their own structure. It was not easy to change and improve anything, to form a single flow.

OpenMetadata’s Data Catalog has simplified data lineage. It’s now much easier to dive into the data streams of a specific model.

This is a screenshot from the official website, because our native screenshot did not pass the cybersecurity check. We got out of it and drew a diagram. It is much more voluminous, it has more sources and showcases than on the screenshot.

There is a Greenplum layer, in it the showcases are calculated and combined – magic happens in SQL. Then the final showcase is calculated on the Marts layer, DAG is launched and then sent to Online Storage.

A client from the front or back team logs into the model service and requests an online estimate. The model service consists of two components: usually a REST Python API and Online Storage, which the model works with. The API makes a request to the DB and receives additional features. They are converted there, or a calculation is made. As a result, a response is received for the client. Offline Storage Greenplum supplies data to Online Storage, and it does this on an ongoing basis.

Our models work according to the T-1 rule, i.e. all current data minus one day. This is enough. The conditions for storing and the amount of all data in Online Storage are determined by the DS team.

We have already implemented such an MVP. But this is not enough for us – we want to move forward, develop the platform and Feature Store.

Where are we going next?

In the future we want to see a system like this:

The blocks that we intend to add or expand are highlighted in color. In addition, we plan to add Meta Store and Meta SDK, as well as expand data sources and Online Storage.

The following technologies were chosen as target technologies:

  • Meta Store. Most likely Postgres.
  • Meta SDK. Possibly some Python framework.
  • Data sources — we want to expand to ClickHouse, Kafka, S3 to work together. Since we have many teams, we plan to make the Feature Store common to all teams.
  • Online Storage — Postgres is enough in some places, but in others it is absolutely insufficient. We will implement something else.

Meta Store + SDK

What we expect from these tools and elements in the platform is the following:

  • Feature Registration. We want the Meta Store to store all available metadata about features used by research analysts. The key factor is feature registration, which we represent as follows:
from feature_store import MetaStore

meta_store = MetaStore(dsn=dsn)

meta_store.feature_registry(
  type='feature:model',
  feature=geocode_feature,
  ml_model_meta=flats_model,
  description='Вычисление города и страны',
)

At the moment, the Meta Store SDK is in the research stage. We will either make an in-house solution, or take something from the market, or make a combination of “in-house plus vendor”.

  • Feature versioning. By registering features, we can get versioning. For training and retraining models, it is important which version of a feature was used at a given moment. It is also important for inference.
  • Declarative description of features. I want to understand what kind of feature it is, what it is about, what types it has, what it can do.
  • Feature Catalog. The final item is a data catalog that a DS team or ML team can use to find features that could potentially help in model development.
class Feature:
    name: str
    type: DataType
    version: str
    description: str

What have we achieved?

Feature Store is useful. We have done a lot of work in terms of technical rework, development, refinement and standardization, and we have received changes. But businesses need to be able to explain what benefit any technology or development that is used brings.

  1. Reduced Time-To-Market. We reduced TTM to about five weeks, and in the worst case. The impact of the Feature Store on the delivery speed is about 30% or more. I would like to note that this increase was given to us not only by the Feature Store, but also by a number of other changes within the team.Reducing Time-To-Market allows businesses to test more hypotheses. And the more hypotheses a business tests, the better the result for the company will be – revenue will increase, costs will decrease, the quality of work with the user will improve, etc.
  2. We optimized data storage. We also split the endless Postgres and reduced it to a current state, where Online Storage only works with the current data that the model needs.
  3. Improved performance. Also increased RPS needed by users. Now we are ready to digest about 250 RPS with the models we work with.Latency has decreased because the amount of data has decreased and searches need to be performed on a smaller amount of data. The speed of pipelines has also decreased.

Conclusions

It is worth thinking about implementing your own feature platform (Feature Store) in a company when there are already models in production, and when it becomes more difficult to support them and add new ones every day. This is a key factor. But it should be taken into account that you will have to prepare pipelines for Online Storage.

Don’t be afraid to experiment. I’m sure that when you work on the solution, you’ll have your own vision of how to do it best.

Similar Posts