This page leverages some basic transform syntax from later parts of the
walkthrough. Don’t worry too much about it for now: the core purpose of
this page is to understand how relations work in Vinyl.
Take the seattle_weather dataset from vinyl’s examples.
Copy
from vinyl.examples import seattle_weatherweather = seattle_weather()
weather is a Vinytable object and carries far more information that a
standard sql table or cte. The schema, for example, can be pulled easily
without running a query against the database.
Copy
print(weather.schema())
Columns are attrributes of the class, so the code below returns the
wind column.
Vinyl keeps track of its syntax lazily, and only compiles when a
variable is executed or its sql is generated.By default, vinyl will return a string representation of the ast
associated with a table unless specified otherwise. For example:
That said, there are two key cases where VinylTables are considered
mutable. This allows for a more ergonomic syntax, especially when you
are chaining several transforms together.The two cases are: 1. Within specially decorated functions (i.e. those
with @model or @metric decorator) 2. Context managersThe first case is designed to support
:
This approach can also be used inside vinyl pipeline functions to create
a sort of
“cte”.True CTEs don’t exist in vinyl. In immutable mode, you simply need to
save a subcalculation to a variable name to achieve the same
functionality. But in mutable mode, as in a pipeline function, saving
subcalculations as variables can cause issues, because it may have
off-target effects on the sources of the subcalculation.This is where the context manager comes in handy. It allows you to
create a temporary variable that is only available within the context of
the context manager.In the example below, we get the average wind by month and then use it
to calculate the wind speed deviation from the average.
This can be done more efficiently in vinyl using a window function, but
this is a good example of how to use the context manager.
Copy
from vinyl import T, model, join@model(deps=[seattle_weather])def wind_vs_avg(w: T) -> T: with w as mw: mw.aggregate( cols = {"avg_wind": mw.wind.mean()}, by = {"month": mw.date.dt.floor(months=1)} ) j = join(w, mw, on = [w.date.dt.floor(months=1) == mw.month]) j.select([w.date, w.wind - mw.avg_wind]) return jwind_vs_avg().execute("text")
Vinyl uses the Ibis library to generate SQL. This means that you can
write your queries in a dialect agnostic way. For example, the following
code works in across the dialects currently supported by Vinyl:
BigQuery
Snowflake
DuckDB
Postgres
Ibis itself supports almost 20 dialects, so we plan to add more over
time.As an example, let’s create a table that lists the max temperature of
each day by month:
Copy
with seattle_weather() as w: w.aggregate( cols = {"temp_max": w.temp_max.collect()}, sort = {"month": w.date.dt.floor(months=1)} )temp_by_month = wtemp_by_month.execute("text")
Here’s how vinyl translates the query to each dialect:
BigQuery
Snowflake
DuckDB
Postgres
Copy
SELECT `t1`.`month`, `t1`.`temp_max`FROM ( SELECT TIMESTAMP_TRUNC(`t0`.`date`, MONTH) AS `month`, ARRAY_AGG(`t0`.`temp_max` IGNORE NULLS) AS `temp_max` FROM `seattle_weather` AS `t0` GROUP BY 1) AS `t1`ORDER BY `t1`.`month` ASC NULLS LAST
In the context of a project, relations are created and managed
automatically using the source generation
process.In the context of a notebook or analysis, you can create relations from
a variety of sources, including: - Pandas DataFrames - Polars
DataFrames - PyArrow Tables - CSVs - JSONs - Parquet filesFor in-memory sources, use VinylTable.from_memory as:
Copy
from vinyl import VinylTableimport pandas as pdd = {'col1': [1, 2], 'col2': [3, 4]}df = pd.DataFrame(data=d)VinylTable.from_memory(df)
In the properties section action, you’ve already seen some of the most
common ways to visualize relations:
print() to view the AST
.schema() to view the schema
.visualize() to view the query plan
.execute() to run the query and view it as text, pyarrow, or
pandas objects
.to_sql() to view the SQL-equivalent
In addition to these, there are a few other key methods to know: -
.eda() provides a quick exploratory data analysis of the relation,
excluding its most common values - .chart() provides a grammar of
graphics interface to quickly visualize the relationLet’s view examples of each of these: