Datasets

Datasets YAML reference

A Spicepod can contain one or more datasets referenced by relative path, or defined inline.

datasets

Inline example:

spicepod.yaml

datasets:
  - from: spiceai:spice.ai/eth/beacon/eigenlayer
    name: strategy_manager_deposits
    params:
      app: goerli-app
    acceleration:
      enabled: true
      mode: inmemory # / file
      engine: arrow # / duckdb
      refresh_interval: 1h
      refresh_mode: full / append # update / incremental
      retention: 30m

spicepod.yaml

datasets:
  - from: databricks:databricks.com/spiceai/datasets
    name: uniswap_eth_usd
    params:
      environment: prod
    acceleration:
      enabled: true
      mode: inmemory # / file
      engine: arrow # / duckdb
      refresh_interval: 1h
      refresh_mode: full / append # update / incremental
      retention: 30m

Relative path example:

spicepod.yaml

datasets:
  - from: datasets/eth_recent_transactions

datasets/eth_recent_transactions/dataset.yaml

from: spiceai:spice.ai/eth.recent_transactions
name: eth_recent_transactions
type: overwrite
acceleration:
  enabled: true
  refresh: 1h

from

The from field is a string that represents the Uniform Resource Identifier (URI) for the dataset. This URI is composed of two parts: a prefix indicating the source of the dataset, and the actual link to the dataset.

The syntax for the from field is as follows:

from: <source>:<link>

Where:

  • <source>: The source of the dataset

    Currently supported sources:

    • spiceai
    • dremio
    • databricks
    • s3
    • postgres
  • <link>: The actual link to the dataset.

name

The name of the dataset. This is used to reference the dataset in the pod manifest, as well as in external data sources.

type

The type of dataset. The following types are supported:

  • overwrite - Overwrites the dataset with the contents of the dataset source.
  • append - Appends new data from dataset source to the dataset.

acceleration

Optional. Accelerate queries to the dataset by caching data locally.

acceleration.enabled

Enable or disable acceleration, defaults to true.

acceleration.engine

The acceleration engine to use, defaults to arrow. The following engines are supported:

  • arrow - Accelerated in-memory backed by Apache Arrow DataTables.
  • duckdb - Accelerated by an embedded DuckDB database.
  • postgres - Accelerated by an embedded DuckDB database.

acceleration.mode

Optional. The mode of acceleration. The following values are supported:

  • memory - Store acceleration data in-memory.
  • file - Store acceleration data in a file.

mode is currently only supported for the duckdb engine.

acceleration.refresh_mode

Optional. How to refresh the dataset. The following values are supported:

  • full - Refresh the entire dataset.
  • append - Append new data to the dataset.

acceleration.refresh_interval

Optional. How often data should be refreshed. Only supported for full datasets. For append datasets, the refresh interval not used.

i.e. 1h for 1 hour, 1m for 1 minute, 1s for 1 second, etc.

acceleration.retention

Optional. Only supported for append datasets. Specifies how long to retain data updates from the data source before they are deleted.

If not specified, the default retention is to keep all data.

acceleration.params

Optional. Parameters to pass to the acceleration engine. The parameters are specific to the acceleration engine used.

acceleration.engine_secret

Optional. The secret store key to use the acceleration engine connection credential. For supported data connectors, use spice login to store the secret.