Feature store что это

What is a Feature Store?

The Feature Store for machine learning is a feature computation and storage service that enables features to be registered, discovered, and used both as part of ML pipelines as well as by online applications for model inferencing. Feature Stores are typically required to store both large volumes of feature data and provide low latency access to features for online applications. As such, they are typically implemented as a dual-database system: a low latency online feature store (typically a key-value store or real-time database) and a scale-out SQL database to store large volumes of feature data for training and batch applications. The online feature store enables online applications to enrich feature vectors with near real-time feature data before performing inference requests. The offline feature store can store large volumes of feature data that is used to create train/test data for model development or by batch applications for model scoring. The Feature Store solves the following problems in ML pipelines:

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

The Feature Store for ML consists of both an Online and Offline database and Databricks can be used to transform raw data from backend systems into engineered features cached in the online and offline stores. Those features are made available to online and batch applications for inferencing and for creating train/test data for model training.

Engineer Features in Databricks, publish to the Feature Store

The process for ingesting and featurizing new data is separate from the process for training models using features that come from potentially many different sources. That is, there are often differences in the cadence for feature engineering compared to the cadence for model training. Some features may be updated every few seconds, while others are updated every few months. Models, on the other hand, can be trained on demand, regularly (every day or every week, for example), or when monitoring shows a model’s performance has degraded. Feature engineering pipelines are typically triggered at regular intervals when new data arrives or on-demand when source code is pushed to git because changes were made in how features are engineered.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Feature pipelines have a natural cadence for each data source, and the cached features can be reused by many downstream model training pipelines. Feature Pipelines can be developed in Spark or Pandas applications that are run on Databricks. They can be combined with data validation libraries like Deequ to ensure feature data is correct and complete.

The feature store enables feature pipelines to cache feature data for use by many downstream model training pipelines, reducing the time to create/backfill features. Groups of features are often computed together and have their own natural ingestion cadence, see figure above. Real-time features may be updated in the online feature store every few seconds using a streaming application, while batch features could be updated hourly, daily, weekly, or monthly.

In practice, feature pipelines are data pipelines, where the output is cleaned, validated, featurized data. As there are typically no guarantees on the correctness of the incoming data, input data must be validated and any missing values must be handled (often by either imputing them or ignoring them). One popular framework for data validation with Spark is AWS Deequ, as they allow you to extend traditional schema-based support for validating data (e.g., this column should contain integers) with data validation rules for numerical or categorical values. For example, while a schema ensures that a numerical feature is of type float, additional validation rules are needed to ensure those floats lie within an expected range. You can also check to ensure a columns’ values are unique, not null, that its descriptive statistics are within certain ranges. Validated data is then transformed into numeric and categorical features that are then cached in the feature store, and subsequently used both to train models and for batch/online model inferencing.

import hsfs
# “prod” is the production feature store

‍conn = hsfs.connection(host=”ea2.aws.hopsworks.ai”, project=”prod”)
featurestore = conn.get_feature_store()

‍ # read raw data and use Spark to engineer features
raw_data_df = spark.read.parquet(‘/parquet_partitioned’)
polynomial_features = raw_data_df.map(lambda x: x²)

‍ # Features computed together in a DataFrames are in the same feature group
fg = featurestore.create_feature_group(name=’fg_revenue’,
version=1,
type=’offline’)

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

In this code snippet, we connect to the Hopsworks Feature Store, read some raw data into a DataFrame from a parquet file, and transform the data into polynomial features. Then, we create a feature group, it’s version is ‘1’ and it is only to be stored in the ‘offline’ feature store. Finally, we ingest our new polynomial_dataframe into the feature group, and compute statistics over the feature group that are also stored in the Hopsworks Feature Store. Note that Pandas DataFrames are supported as well as Spark DataFrames, and there are both Python and Scala/Java APIs.

When a feature store is available, the output of feature pipelines is cached feature data, stored in the feature store. Ideally, the destination data sink will have support for versioned data, such as in Apache Hudi in Hopsworks Feature Store. In Hopsworks, feature pipelines upsert (insert or update) data into existing feature groups, where a feature group is a set of features computed together (typically because they come from the same backend system and are related by some entity or key). Every time a feature pipeline runs for a feature group, it creates a new commit in the sink Hudi dataset. This way, we can track and query different commits to feature groups in the Feature Store, and monitor changes to statistics of ingested data over time.

You can find an example notebook for feature engineering with PySpark in Databricks and registering features with Hopsworks here.

Model Training Pipelines in Databricks start at the Feature Store

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Model training with a feature store typically involves at least three stages:

‍import hsfs
conn = hsfs.connection(host=”ea2.aws.hopsworks.ai”, project=”prod”)
featurestore = conn.get_feature_store()

‍ # get feature groups from which you want to create a training dataset
fg1 = featurestore.get_feature_group(‘fg_revenue’, version=1)
fg2 = featurestore.get_feature_group(‘fg_users’, version=2)
# lazily join features
joined_features = fg1.select_all() \
.join(fg2.select([‘user_id’, ‘age’]), on=’user_id’)

td = featurestore.create_training_dataset(name=’revenue_prediction’,
version=1,
data_format=’tfrecords’,
storage_connector=sink,
split=<‘train’: 0.8, ‘test’: 0.2>)

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Data Scientists are able to rely on the quality and business logic correctness in published features and can therefore quickly export and create training datasets in their favourite data format.

You can find an example notebook for getting started with creating train/test datasets from Hopsworks in Databricks here.

Deploying the Hopsworks Feature Store for Databricks

Источник

Управление признаками сущностей в Apache Kafka

Во время работы над задачами машинного обучения с онлайн-данными есть необходимость собирать различные сущности в одну для дальнейшего анализа и оценки. Процесс сбора должен быть удобным и быстрым. А также часто должен предусматривать бесшовный переход от процесса разработки к промышленному использованию без дополнительных усилий и рутинной работы. Для решения этой проблемы можно воспользоваться подходом с использованием Feature Store. Этот подход со многими деталями описан вот здесь: Meet Michelangelo: Uber’s Machine Learning Platform. В этой статье описывается интерпретация указанного решения для управления признаками в виде прототипа.

Feature Store можно рассматривать как сервис, который должен выполнять свои функции строго по его спецификации. Прежде чем определить эту спецификацию, следует разобрать простой пример.

Пример

Пусть даны следующие сущности.

Фильм, который обладает идентификатором и заголовком.

Рейтинг фильма, у которого так же есть собственный идентификатор, идентификатор фильма, а также значение рейтинга. Рейтинг меняется во времени.

Источник рейтинга, который так же имеет собственный рейтинг. И меняется во времени.
И нужно эти сущности объединить в одну.

Вот что получается.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это
Диаграмма сущностей

Как можно понять, объединение происходит по ключам сущностей. Т.е. к фильму ищутся все рейтинги фильма, а к рейтингу фильма все рейтинги источника.

Обобщение примера

Теперь можно обобщить пример и масштабировать его на все сущности, которые могут быть связаны по ключам.

Есть kafka-потоки, которые определяют собой сущности: A, B… NN.
Нужно объединять эти потоки для создания новых потоков: AB, BCD… NM.
Этим процессом должен управлять сервис: Feature Stream Engine.

Feature Stream Engine умеет объединять сущности в kafka-потоках, используя хранилище метаданных Feature Stream Store и Feature Stream Center, как единую точку входа по управлению объединением.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это
Обобщенная диаграмма сущностей и Feature Stream Engine

Feature Stream Store

Хранилище метаданных представляет из себя сервис по хранению данных о потоках, сущностях и их связях.

Основная единица хранилища – это признак (feature).

Признак имеет свой идентификатор, ссылку на источник, наименование и тип.

Источник группирует признаки и привязывается к определенному потоку.

Feature Stream Center

Центр управления позволяет создавать новые потоки, а также взаимодействовать со службами доставки и развертывания для поддержки работы новых потоков в различных средах исполнения, в том числе и промышленной среде.

Feature Stream Engine

Feature Stream Engine обеспечивает работу с потоками, а так же взаимодействие с внешними сервисами и конечными пользователями.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это
Компоненты Feature Stream Engine

Архитектура Feature Stream Engine

Feature Stream Engine представляет из себя конструктор, который позволяет собирать признаки из различных потоков и доставлять этот функционал на различные среды.

Feature Stream Engine должен реализовывать следующие функции.

Описывать источники данных.
Привязывать источники данных к потокам kafka.
Описывать признаки и привязывать их к источникам данных.
Создавать новые источники данных на основе имеющихся путем объединения по ключам (особым признакам).
Развертывать функционал работы потоков в различных средах, включая промышленную среду.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это
Архитектура Feature Stream Engine

Прототип

Для реализации идеи необходимо упросить функциональность.

Будут объединяться несколько потоков по ключам и записываться в один поток.

Предположим, что метаданные описываются файлами со свойствами («configration.properties»).

Эти данные реализуют следующею модель.

Источники данных в виде имен topic-ов для kafka. Перечисляются через “,”.
Ключи в этих источниках данных. Перечисляются через “,”.
Имя результирующего topic-а.

Конвертация входных параметров в структуру, которая описывает объединение потоков.

Основной метод по объединению.

Создаются topic-и, в которых ключам являются, те ключи, по которым нужно объединять.

Отправка в конечный topic.

Конструируется объект объединения потоков (основной объект).

Создается обвязка для kafka.

Пример запуска приложения.

Язык: Java 1.8.
Библиотеки: kafka 2.6.0, jsoup 1.13.1.

Заключение

Изложенное решение имеет ряд ограничений и не реализует полный функционал. Но имеет и несколько преимуществ.

Во-первых: позволяет быстро конструировать объединение topic-в.
Во-вторых: позволяет быстро запускать объединение в различных средах.

Стоит отметить, что решение налагает ограничение на структуру входных данных. А именно, topic-и должны иметь табличную структуру. Для преодоления этого ограничения можно ввести дополнительный слой, который будет позволять сводить различные структуры к табличной.

Для промышленной реализации полной функциональности стоит обратить внимание на очень мощный и, самое главное, гибкий функционал: KSQL.

Источник

Feature Store for ML

Subscribe to the newsletter!

Data scientists are duplicating work because they don’t have a centralized feature store. Everybody I talk to really wants to build or even buy a feature store. if an organization had a feature store, the ramp-up period [for Data Scientists can be much faster].

Featured from the blog

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Feature Store Milestones

A Summary of the most important Feature Store milestones.
‍
by Nathalia Ariza

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Beyond Brainless AI with a Feature Store

Evolve your models from brain-free AI to Total Recall AI with the help of a Feature Store.
‍
by JimВ Dowling

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Data Lineage Doesn’t Have To Be Hard

Organizations now have the desire, and the regulatory requirement, to keep track of their data and how it is used inside of their company.
‍
by Jack Ploshnick

Hopsworks

The first fully open-source Feature Store, based around Dataframes (Spark/Pandas), Hive (offline), and MySQL Cluster (online) databases. Supports model training/management/serving and Provenance.

Other content:
‍
Pydata London meetup 2019

AmazonВ Feature Store

Amazon launched the Amazon SageMaker Feature Store during their annual Re:invent keynote.
‍
Documentation: Create, Store and Share Features with Amazon SageMaker Feature Store

Zipline

AirBnB use Zipline for Feature management as part of their BigHead platform for ML.

Comcast

Comcast have had 2 iterations of their Feature Store, and as of early 2020, appear to be using Redis as their online Feature Store. They have previously used Flink for online feature computation and its queryable state API.

Pinterest

Galaxy is Pinterest’s incremental dataflow-based Feature Store on AWS. It includes a DSL for Feature Engineering, Linchpin.

Aluxio Inc

Wix’ Feature Store is based on storing feature data in protobufs, with batch processing using SparkSQL on parquet files stored in S3 and online serving based on HBase/Redis. It provides a Python API for accessing training data as Pandas Dataframes.

Bigabid

The Bigabid Feature Store contains thousands o features and is a  centralized software library and documentation center that “creates a single feature from a standardized input (data)”. Read more here:
https://www.bigabid.com/blog/data-the-importance-of-having-a-feature-store

Apple

Overton is Apple’s platform for managing data and models for ML. There is a publication about it: Overton: A Data System for Monitoring and Improving Machine-Learned Products

StreamSQL

StreamSQL have built a Feature Store as a commercial product based on Apache Pulsar, Cassandra, and Redis.
‍

Feast

GoJek/Google released Feast in early 2019 and it is built around Google Cloud services: Big Query (offline) and Big Table (online) and Redis (low-latency), using Beam for feature engineering.

Tecton

Tecton are developing a managed feature store for AWS that manages both features and feature computations.

ScribbleData

ScribbleData have developed a feature store for ML.

Intuit

Intuit have built a feature store as part of their data science platform. It was developed for AWS and uses S3 and Dynamo as its offline/online feature serving layers.

Google MLops Platform with a Feature Store (Vertex AI)

Google released VertexAI, a managed platform for developing and operating AI applications with its own feature store.
‍
Documentation: Vertex AI, Feature Store

Door Dash

DoorDash created a large storage capacity and high read/write throughput Feature Store using Redis.

Databricks Feature Store

Databricks announced their feature store as a part of their Machine learning platform during the 2021 Data and AI summit’s keynote.

Splice Machine

Michelangelo Palette

The first Feature Store (by Uber) that provides a DSL and is heavily built around Spark/Scala with Hive (offline) and Cassandra (online) databases. It is now called Michelangelo Palette.

See also this talk about Michelangelo Palette at InfoQ:

Netflix

Netflix uses shared feature encoder libraries in their MetaFlow platform to ensure consistency between training and serving, and S3 for offline features and microservices for serving online features. There are shared feature engineering libraries, written in Java. Runway, their model mgmt platform, builds on Metaflow.

FBLearner

Not much is known about Facebook’s Feature Store, cursory information is given here.

Twitter

Twitter decided to build a library, not a store. It is a set of shared feature libraries and metadata, along with shared file systems, object stores, and databases.

Zomato

Zomato have used Flink to compute features in real-time and then integrate their real-time feature store with their applications. They note that the real-time feature store needs high throughput read and write at low latency (>1m writes/min). These use manged ElastiCache/Redis on AWS for the online feature store.

Survey Monkey

A Feature Store for AWS that has both an offline and an online database.

Spotify

A Feature Store for KubeFlow on GCP.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Concepts & Articles

Feature Store Articles

Feature Store Concepts

Consistent Features – Online & Offline

If feature engineering code is not the same in training and inferencing systems, there is a risk that the code will not be consistent, and, therefore, predictions may not be reliable as the features may not be the same. One solution is the have feature engineering jobs write their feature data to both an online and an offline database. Both training and inferencing applications need to read their features when they make predictions – online applications may need low latency (real-time) access to that feature data. The other solution is to use shared feature engineering libraries (ok, if your online application and training application are both able to use the same shared libraries (e.g., both are JVM-based)).

Time Travel

“Given these events in the past what were the feature values of the customer during the time of that event” Carlo Hyvönen

Time-travel is not normally found in databases – you cannot typically query the value of some column at some point in time. You can work around this by ensuring all schemas defining feature data include a datetime/event-time column. However, recent data lakes have added support for time-travel queries, by storing all updates enabling queries on old values for features.  Some data platforms supporting time travel functionality:

Feature Engineering

Michelangelo added a domain-specific language (DSL) to support engineering features from raw data sources (databases, data lake). However, it is also popular to use general purpose frameworks like Apache Spark/PySpark, Pandas, Apache Flink, and Apache Beam.

Materialize Train/Test Data?

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Online Feature Store

Models may have been trained with hundreds of features, but online applications may just receive a few of those features from an user interaction (userId, sessionId, productId, datetime, etc). The online feature store is used by online applications to lookup the missing features and build a feature vector that is sent to an online model for predictions. Online models are typically served over the network, as it decouples the model’s lifecycle from the application’s lifecycle.  The latency, throughput, security, and high availability of the online feature store are critical to its success in the enterprise. Below is shown the throughput of some key-value and in-memory databases that are used in existing feature stores.

Источник

What is a Feature Store?

Blog co-authored with Mike Del Balso, Co-Founder and CEO of Tecton, and cross-posted here

Data teams are starting to realize that operational machine learning requires solving data problems that extend far beyond the creation of data pipelines.

In Why We Need DevOps for ML Data, Tecton highlighted some of the key data challenges that teams face when productionizing ML systems.

Production data systems, whether for large scale analytics or real-time streaming, aren’t new. However, operational machine learning — ML-driven intelligence built into customer-facing applications — is new for most teams. The challenge of deploying machine learning to production for operational purposes (e.g. recommender systems, fraud detection, personalization, etc.) introduces new requirements for our data tools.

A new kind of ML-specific data infrastructure is emerging to make that possible.

Increasingly Data Science and Data Engineering teams are turning towards feature stores to manage the data sets and data pipelines needed to productionize their ML applications. This post describes the key components of a modern feature store and how the sum of these parts act as a force multiplier on organizations, by reducing duplication of data engineering efforts, speeding up the machine learning lifecycle, and unlocking a new kind of collaboration across data science teams.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Quick refresher: in ML, a feature is data used as an input signal to a predictive model.For example, if a credit card company is trying to predict whether a transaction is fraudulent, a useful feature might be whether the transaction is happening in a foreign country, or how the size of this transaction compares to the customer’s typical transaction. When we refer to a feature, we’re usually referring to the concept of that signal (e.g. “transaction_in_foreign_country”), not a specific value of the feature (e.g. not “transaction #1364 was in a foreign country”).

Enter the feature store

“The interface between models and data”

We first introduced feature stores in our blog post describing Uber’s Michelangelo platform. Feature stores have since emerged as a necessary component of the operational machine learning stack.

Feature stores make it easy to:

Feature stores aim to solve the full set of data management problems encountered when building and operating operational ML applications.

A feature store is an ML-specific data system that:

To support simple feature management, feature stores provide data abstractions that make it easy to build, deploy, and reason about feature pipelines across environments. For example, they make it easy to define a feature transformation once, then calculate and serve its values consistently across both the development environment (for training on historical values) and the production environment (for inference with fresh feature values).

Feature stores act as a central hub for feature data and metadata across an ML project’s life-cycle. Data in a feature store is used for:

Feature stores bring economies of scale to ML organizations by enabling collaboration. When a feature is registered in a feature store, it becomes available for immediate reuse by other models across the organization. This reduces duplication of data engineering efforts and allows new ML projects to bootstrap with a library of curated production-ready features.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Effective feature stores are designed to be modular systems that can be adapted to the environment in which they’re deployed. There are five primary components that typically make up a feature store. In the rest of this post, we will walk through those components and describe their role in powering operational ML applications.

Components of a Feature Store

There are 5 main components of a modern feature store: Transformation, Storage, Serving, Monitoring, and Feature Registry.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

In the following sections we’ll give an overview of the purpose and typical capabilities of each of these sections.

Serving

Feature stores serve feature data to models. Those models require a consistent view of features across training and serving. The definitions of features used to train a model must exactly match the features provided in online serving. When they don’t match, training-serving skew is introduced which can cause catastrophic and hard-to-debug model performance problems.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Feature stores abstract away the logic and processing used to generate a feature, providing users an easy and canonical way to access all features in a company consistently across all environments in which they’re needed.

When retrieving data offline (i.e. for training), feature values are commonly accessed through notebook-friendly feature store SDKs. They provide point-in-time correct views of the state of the world for each example used to train a model (a.k.a. “time-travel”).

For online serving, a feature store delivers a single vector of features at a time made up of the freshest feature values. Responses are served through a high-performance API backed by a low-latency database.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Storage

Feature stores persist feature data to support retrieval through feature serving layers. They typically contain both an online and offline storage layer to support the requirements of different feature serving systems.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Offline storage layers are typically used to store months’ or years’ worth of feature data for training purposes. Offline feature store data is often stored in data warehouses or data lakes like S3, BigQuery, Snowflake, Redshift. Extending an existing data lake or data warehouse for offline feature storage is typically preferred to prevent data silos.

Online storage layers are used to persist feature values for low-latency lookup during inference. They typically only store the latest feature values for each entity, essentially modeling the current state of the world. Online stores are usually eventually consistent, and do not have strict consistency requirements for most ML use cases. They are usually implemented with key-value stores like DynamoDB, Redis, or Cassandra.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Feature stores use an entity-based data model where each feature value is associated with an entity (e.g. a user) and a timestamp. An entity-based data model provides minimal structure to support standardized feature management, fits naturally with common feature engineering workflows, and allows for simple retrieval queries in production.

Transformation

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Operational ML applications require regular processing of new data into feature values so models can make predictions using an up-to-date view of the world. Feature stores both manage and orchestrate data transformations that produce these values, as well as ingest values produced by external systems. Transformations managed by feature stores are configured by definitions in a common feature registry (described below).

Feature stores commonly interact with three main types of data transformations:

Transformations that are applied only to data at rest

Data warehouse, data lake, database

User country, product category

Transformations that are applied to streaming sources

Kafka, Kinesis, PubSub

# of clicks per vertical per user in last 30 minutes, # of views per listing in past hour

Transformations that are used to produce features based on data that is only available at the time of the prediction. These features cannot be pre-computed.

Is the user currently in a supported location?
Similarity score between listing and search query

A key benefit is to make it easy to use different types of features together in the same models.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Models need access to fresh feature values for inference. Feature stores accomplish this by regularly recomputing features on an ongoing basis. Transformation jobs are orchestrated to ensure new data is processed and turned into fresh new feature values. These jobs are executed on data processing engines (e.g. Spark or Pandas) to which the feature store is connected.

Model development introduces different transformation requirements. When iterating on a model, new features are often engineered to be used in training datasets that correspond to historical events (e.g. all purchases in the past 6 months). To support these use cases, feature stores make it easy to run “backfill jobs” that generate and persist historical values of a feature for training. Some feature stores automatically backfill newly registered features for preconfigured time ranges for registered training datasets.

Transformation code is reused across environments preventing training-serving skew and frees teams from having to rewrite code from one environment to the next.

Feature stores manage all feature-related resources (compute, storage, serving) holistically across the feature lifecycle. Automating repetitive engineering tasks needed to productionize a feature, they enable a simple and fast path-to-production. Management optimizations (e.g. retiring features that aren’t being used by any models, or deduplicating feature transformations across models) can bring significant efficiencies, especially as teams grow increasingly the complexity of managing features manually.

Monitoring

When something goes wrong in an ML system, it’s usually a data problem. Feature stores are uniquely positioned to detect and surface such issues. They can calculate metrics on the features they store and serve that describe correctness and quality. Feature stores monitor these metrics to provide a signal of the overall health of an ML application.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

Feature data can be validated based on user defined schemas or other structural criteria. Data quality is tracked by monitoring for drift and training-serving skew. E.g. feature data served to models are compared to data on which the model was trained to detect inconsistencies that could degrade model performance.

When running production systems, it’s also important to monitor operational metrics. Feature stores track operational metrics relating to core functionality. E.g. metrics relating to feature storage (availability, capacity, utilization, staleness) or feature serving (throughput, latency, error rates). Other metrics describe the operations of important adjacent system components. For example, operational metrics for external data processing engines (e.g. job success rate, throughput, processing lag and rate).

Feature stores make these metrics available to existing monitoring infrastructure. This allows ML application health to be monitored and managed with existing observability tools in the production stack.

Having visibility into which features are used by which models, feature stores can automatically aggregate alerts and health metrics into views relevant to specific users, models, or consumers.

It’s not essential that all feature stores implement such monitoring internally, but they should at least provide the interfaces into which data quality monitoring systems can plug. Different ML use cases can have different, specialized monitoring needs so pluggability here is important.

Registry

A critical component in all feature stores is a centralized registry of standardized feature definitions and metadata. The registry acts as a single source of truth for information about a feature in an organization.

Feature store что это. Смотреть фото Feature store что это. Смотреть картинку Feature store что это. Картинка про Feature store что это. Фото Feature store что это

The registry is a central interface for user interactions with the feature store. Teams use the registry as a common catalog to explore, develop, collaborate on, and publish new definitions within and across teams.

The definitions in the registry configure feature store system behavior. Automated jobs use the registry to schedule and configure data ingestion, transformation, and storage. It forms the basis of what data is stored in the feature store and how it is organized. Serving APIs use the registry for a consistent understanding of which feature values should be available, who should be able to access them, and how they should be served.

The registry allows for important metadata to be attached to feature definitions. This provides a route for tracking ownership, project or domain specific information, and a path to easily integrate with adjacent systems. This includes information about dependencies and versions which is used for lineage tracking.

To help with common debugging, compliance, and auditing workflows, the registry acts as an immutable record of what’s available analytically and what’s actually running in production.

So far, we’ve looked at the core minimal components of a feature store. In practice, companies often have needs like compliance, governance, and security that require additional enterprise-focused capabilities. That will be the topic of a future blog post.

Where to go to get started

We see features stores as the heart of the data flow in modern ML applications. They are quickly proving to be critical infrastructure for data science teams putting ML into production. We expect 2021 to be a year of massive feature store adoption, as machine learning becomes a key differentiator for technology companies.

There are a few options for getting started with feature stores:

We wrote this blog post to provide a common definition of feature stores as they emerge as a primary component of the operational ML stack. We believe the industry is about to see an explosion of activity in this space.

Источник

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *

Feature TypeDefinitionCommon input data sourceCommon input data source