Models

Models are the backbone of a vinyl project. Models are cleaned representations of your source data. Models can either be virtualized or backed by a table (common in data warehouses). Models can be published for use in the semantic layer or used internally for analysis.

To create a model, import the model decorator from vinyl and create function.

my_project/models.py
from vinyl import model

@model
def top_shops():
    pass

Models are isolated functions that handle transforms of data. Models are built in a pipelined way through mutable state. Models are deferred in their execution allowing for lazy evaluation.

Let’s add a dependency to our model. Dependencies can either be a source or another model.

my_project/models.py
from vinyl import model
from my_project.sources.flower_shops import FlowerShops

@model(deps=FlowerShops)
def top_shops(shops):
    pass

Vinyl is built on Ibis, a powerful and performant data manipulation library. This allows for a wide range of transformations to be performed on your data all within a model function. You can reference all of the model transforms in the reference section.

Let’s write a quick model to get the stores with the most transactions. Vinyl helps write data transforms for analytics use cases (like aggregations) quickly with only a few lines of code.

my_project/models.py
from vinyl import M, T, join, metric, model
from my_project.sources.flower_shops import FlowerShops
from my_project.sources.transactions import Transactions

from vinyl

@model(deps=[FlowerShops, Transactions])
def top_shops(shops, txns):
    st = join(shops, txns, on = ["store_nbr"])
    st.aggregate(
        cols={"num_transactions": st.transactions.sum()},
        by=[st.store_nbr, st.city]
    )
    st.sort(-st.num_transactions)
    return st
store_nbrcitynum_transactions
01003San Francisco, CA11026
03001Brooklyn, NY10981
06002Austin, TX8193

Use in Notebooks

Models can also be written and used outside of context of a Vinyl project (like in a jupyter notebook or a python script) using a context manager. We can write the model above as:

my_script.py
from my_project.sources.flower_shops import FlowerShops
from my_project.sources.transactions import Transactions

stores_txns = join(Stores(), StoreNumTransactions(), on = ["store_nbr"])

with stores_txns as st:
    st.aggregate(
        cols = {"num_transactions": st.transactions.sum()},
        by = [st.store_nbr, st.city]
    )
    st.sort(by= -st.num_transactions)

result = st.execute()

Metrics

Timeseries data is an important component of many analytics projects. Vinyl comes with a powerful metrics abstraction to help aggregation and slice time series data without having to write long and complicated SQL.

Let’s expand on our example above but now look at how sales for each store has changed over time. To use the metric layer of Vinyl, all we have to do is pass in the MetricStore object to our function and pass in the timestamp column to the metric method.

my_project/metrics.py
from vinyl import model, MetricStore
from my_project.sources.flower_shops import FlowerShops
from my_project.sources.transactions import Transactions


@model(deps=[FlowerShops, Transactions])
def shop_sales(shops: T, txns: T) -> M:
    st = shops.join(txns, ["store_nbr"])
    met = st.metric(
        by=[st.store_nbr, st.city],
        ts=st.purchased_at,
        cols={
            "sales": st.transactions.sum()
        }
    )
    return met

Our new metric shop_sales can now be used throughout our Vinyl project. The Metrics layer allows for dynamic queries across dimensions and time buckets. Metrics also provide useful features like fill options and trailing windows.

If we want to get the sales across each shop location over the past trailing 3 months, we can query it like:

s = shop_sales()
sales_metric = s.select([
    s.ts.floor(month=1),
    s.store_nbr,
    s.city,
    s.sales
], trailing=[None, 3])

sales_metric.execute()
tsstore_nbrcitysales
2023-12-0101003San Francisco, CA752
2023-12-0103001Brooklyn, NY1130
2023-12-0106002Austin, TX534
2023-01-0101003San Francisco, CA304
2023-01-0103001Brooklyn, NY540
2023-01-0106002Austin, TX121
2023-02-0101003San Francisco, CA910
2023-02-0103001Brooklyn, NY1520
2023-02-0106002Austin, TX862

If we want to change the time grain to weeks and look at the last 2 weeks, it’s as easy as changing the metric select to:

s = shop_sales()
sales_metric = s.select([
    s.ts.floor(weeks=1),
    s.store_nbr,
    s.city,
    s.sales
], trailing=[None, 2])

sales_metric.execute()
tsstore_nbrcitysales
2023-02-0101003San Francisco, CA619
2023-02-0103001Brooklyn, NY811
2023-02-0106002Austin, TX407
2023-02-1501003San Francisco, CA291
2023-02-1503001Brooklyn, NY709
2023-02-1506002Austin, TX455

We can wrap this metric as a model in our project:

@model(deps=[shop_sales])
def shop_sales_last_30_days(s: M) -> T:
    sales_metric = s.select([
        s.ts.truncate(days=1),
        s.store_nbr,
        s.city,
        s.sales
    ], trailing=[None, 30])

    return sales_metric

Currently Vinyl does not support deploying metrics directly and they must be wrapped as a model to be shared outside a vinyl project. This constraint will be removed in future releases.