Relations
This page leverages some basic transform syntax from later parts of the walkthrough. Don’t worry too much about it for now: the core purpose of this page is to understand how relations work in Vinyl.
Properties
As in SQL, relations are the core data model. But relations in Vinyl have some unique properties:
Relations are column aware
Take the seattle_weather
dataset from vinyl’s examples.
from vinyl.examples import seattle_weather
weather = seattle_weather()
weather
is a Vinytable object and carries far more information that a
standard sql table or cte. The schema, for example, can be pulled easily
without running a query against the database.
print(weather.schema())
Columns are attrributes of the class, so the code below returns the
wind
column.
weather.wind
r0 := DatabaseTable: seattle_weather
date timestamp(6)
precipitation float64
temp_max float64
temp_min float64
wind float64
weather string
wind: r0.wind
And, after a basic transform:
wind_doubled = weather.select(weather.wind * 2)
wind_doubled
r0 := DatabaseTable: seattle_weather
date timestamp(6)
precipitation float64
temp_max float64
temp_min float64
wind float64
weather string
Project[r0]
Multiply(wind, 2): r0.wind * 2
This means you can access the information at any point in your queries. For example
temps = weather.select([col for col in weather.columns if col.startswith("temp")])
print(temps.schema())
will select the temperature columns from weather
.
Relations are lazy
Vinyl keeps track of its syntax lazily, and only compiles when a variable is executed or its sql is generated.
By default, vinyl will return a string representation of the ast associated with a table unless specified otherwise. For example:
temps
r0 := DatabaseTable: seattle_weather
date timestamp(6)
precipitation float64
temp_max float64
temp_min float64
wind float64
weather string
Project[r0]
temp_max: r0.temp_max
temp_min: r0.temp_min
If you’d like to see a graphical representation, use .visualize()
:
temps.visualize()
To execute this, you only need to run:
temps.execute()
temp_max | temp_min | |
---|---|---|
0 | 12.8 | 5.0 |
1 | 10.6 | 2.8 |
2 | 11.7 | 7.2 |
3 | 12.2 | 5.6 |
4 | 8.9 | 2.8 |
… | … | … |
1456 | 4.4 | 1.7 |
1457 | 5.0 | 1.7 |
1458 | 7.2 | 0.6 |
1459 | 5.6 | -1.0 |
1460 | 5.6 | -2.1 |
1461 rows × 2 columns
By default, this returns a pandas DataFrame. You can also return a text or pyarrow representation by specifying that as in the examples below:
temps.execute("text")
temps.execute("pyarrow")
pyarrow.Table
temp_max: double
temp_min: double
----
temp_max: [[12.8,10.6,11.7,12.2,8.9,...,4.4,5,7.2,5.6,5.6]]
temp_min: [[5,2.8,7.2,5.6,2.8,...,1.7,1.7,0.6,-1,-2.1]]
You can also execute it and save the results to various formats (csv,
json, etc.) using .save()
.
Relations are selectively mutable
By default, VinylTables are immutable. For example, the original
table ast is printed when run is the same as the original weather
.
weather.select(weather.wind * 2)
weather
DatabaseTable: seattle_weather
date timestamp(6)
precipitation float64
temp_max float64
temp_min float64
wind float64
weather string
That said, there are two key cases where VinylTables are considered mutable. This allows for a more ergonomic syntax, especially when you are chaining several transforms together.
The two cases are: 1. Within specially decorated functions (i.e. those
with @model
or @metric
decorator) 2. Context managers
The first case is designed to support :
from vinyl import T, model
@model(deps=[seattle_weather])
def wind(w: T) -> T:
w.select(w.wind * 2)
return w
wind()
r0 := DatabaseTable: seattle_weather
date timestamp(6)
precipitation float64
temp_max float64
temp_min float64
wind float64
weather string
Project[r0]
Multiply(wind, 2): r0.wind * 2
The second use case is designed primarily for analysis use cases (i.e. in a notebook):
with seattle_weather() as w:
w.select(w.wind * 2)
w
r0 := DatabaseTable: seattle_weather
date timestamp(6)
precipitation float64
temp_max float64
temp_min float64
wind float64
weather string
Project[r0]
Multiply(wind, 2): r0.wind * 2
This approach can also be used inside vinyl pipeline functions to create a sort of “cte”.
True CTEs don’t exist in vinyl. In immutable mode, you simply need to save a subcalculation to a variable name to achieve the same functionality. But in mutable mode, as in a pipeline function, saving subcalculations as variables can cause issues, because it may have off-target effects on the sources of the subcalculation.
This is where the context manager comes in handy. It allows you to create a temporary variable that is only available within the context of the context manager.
In the example below, we get the average wind by month and then use it to calculate the wind speed deviation from the average.
This can be done more efficiently in vinyl using a window function, but this is a good example of how to use the context manager.
from vinyl import T, model, join
@model(deps=[seattle_weather])
def wind_vs_avg(w: T) -> T:
with w as mw:
mw.aggregate(
cols = {"avg_wind": mw.wind.mean()},
by = {"month": mw.date.dt.floor(months=1)}
)
j = join(w, mw, on = [w.date.dt.floor(months=1) == mw.month])
j.select([w.date, w.wind - mw.avg_wind])
return j
wind_vs_avg().execute("text")
Relations are dialect-independent
Vinyl uses the Ibis library to generate SQL. This means that you can write your queries in a dialect agnostic way. For example, the following code works in across the dialects currently supported by Vinyl:
- BigQuery
- Snowflake
- DuckDB
- Postgres
Ibis itself supports almost 20 dialects, so we plan to add more over time.
As an example, let’s create a table that lists the max temperature of each day by month:
with seattle_weather() as w:
w.aggregate(
cols = {"temp_max": w.temp_max.collect()},
sort = {"month": w.date.dt.floor(months=1)}
)
temp_by_month = w
temp_by_month.execute("text")
Here’s how vinyl translates the query to each dialect:
SELECT
`t1`.`month`,
`t1`.`temp_max`
FROM (
SELECT
TIMESTAMP_TRUNC(`t0`.`date`, MONTH) AS `month`,
ARRAY_AGG(`t0`.`temp_max` IGNORE NULLS) AS `temp_max`
FROM `seattle_weather` AS `t0`
GROUP BY
1
) AS `t1`
ORDER BY
`t1`.`month` ASC NULLS LAST
Creating relations
In the context of a project, relations are created and managed automatically using the source generation process.
In the context of a notebook or analysis, you can create relations from a variety of sources, including: - Pandas DataFrames - Polars DataFrames - PyArrow Tables - CSVs - JSONs - Parquet files
For in-memory sources, use VinylTable.from_memory
as:
from vinyl import VinylTable
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
VinylTable.from_memory(df)
InMemoryTable
data:
PandasDataFrameProxy:
col1 col2
0 1 3
1 2 4
And for file-based sources, use VinylTable.from_file
:
VinylTable.from_file("data/barley.parquet")
DatabaseTable: barley
yield float64
variety string
year int64
site string
Exploring relations
In the properties section action, you’ve already seen some of the most common ways to visualize relations:
print()
to view the AST.schema()
to view the schema.visualize()
to view the query plan.execute()
to run the query and view it as text, pyarrow, or pandas objects.to_sql()
to view the SQL-equivalent
In addition to these, there are a few other key methods to know: -
.eda()
provides a quick exploratory data analysis of the relation,
excluding its most common values - .chart()
provides a grammar of
graphics interface to quickly visualize the relation
Let’s view examples of each of these:
weather.eda(topk=2)
r0 := DatabaseTable: seattle_weather
date timestamp(6)
precipitation float64
temp_max float64
temp_min float64
wind float64
weather string
r1 := Aggregate[r0]
groups:
weather: r0.weather
metrics:
Count(weather): Count(r0.weather)
r2 := Aggregate[r0]
groups:
wind: r0.wind
metrics:
Count(wind): Count(r0.wind)
r3 := Aggregate[r0]
groups:
temp_min: r0.temp_min
metrics:
Count(temp_min): Count(r0.temp_min)
r4 := Aggregate[r0]
groups:
temp_max: r0.temp_max
metrics:
Count(temp_max): Count(r0.temp_max)
r5 := Aggregate[r0]
groups:
date: r0.date
metrics:
Count(date): Count(r0.date)
r6 := Aggregate[r0]
groups:
precipitation: r0.precipitation
metrics:
Count(precipitation): Count(r0.precipitation)
r7 := Project[r0]
isna: SearchedCase(cases=[IsNull(r0.weather)], results=[1], default=0)
r8 := Project[r0]
isna: SearchedCase(cases=[IsNull(r0.wind)], results=[1], default=0)
r9 := Project[r0]
isna: SearchedCase(cases=[IsNull(r0.temp_min)], results=[1], default=0)
r10 := Project[r0]
isna: SearchedCase(cases=[IsNull(r0.temp_max)], results=[1], default=0)
r11 := Project[r0]
isna: SearchedCase(cases=[IsNull(r0.date)], results=[1], default=0)
r12 := Project[r0]
isna: SearchedCase(cases=[IsNull(r0.precipitation)], results=[1], default=0)
r13 := Sort[r1]
desc r1['Count(weather)']
r14 := Sort[r2]
desc r2['Count(wind)']
r15 := Sort[r3]
desc r3['Count(temp_min)']
r16 := Sort[r4]
desc r4['Count(temp_max)']
r17 := Sort[r5]
desc r5['Count(date)']
r18 := Sort[r6]
desc r6['Count(precipitation)']
r19 := Limit[r13, n=2]
r20 := Limit[r14, n=2]
r21 := Limit[r15, n=2]
r22 := Limit[r16, n=2]
r23 := Limit[r17, n=2]
r24 := Limit[r18, n=2]
r25 := Aggregate[r7]
metrics:
pos: 5
name: 'weather'
type: 'string'
nullable: True
nulls: Sum(r7.isna)
non_nulls: Sum(1 - r7.isna)
null_frac: Mean(r7.isna)
r26 := Aggregate[r8]
metrics:
pos: 4
name: 'wind'
type: 'float64'
nullable: True
nulls: Sum(r8.isna)
non_nulls: Sum(1 - r8.isna)
null_frac: Mean(r8.isna)
r27 := Aggregate[r9]
metrics:
pos: 3
name: 'temp_min'
type: 'float64'
nullable: True
nulls: Sum(r9.isna)
non_nulls: Sum(1 - r9.isna)
null_frac: Mean(r9.isna)
r28 := Aggregate[r10]
metrics:
pos: 2
name: 'temp_max'
type: 'float64'
nullable: True
nulls: Sum(r10.isna)
non_nulls: Sum(1 - r10.isna)
null_frac: Mean(r10.isna)
r29 := Aggregate[r11]
metrics:
pos: 0
name: 'date'
type: 'timestamp(6)'
nullable: True
nulls: Sum(r11.isna)
non_nulls: Sum(1 - r11.isna)
null_frac: Mean(r11.isna)
r30 := Aggregate[r12]
metrics:
pos: 1
name: 'precipitation'
type: 'float64'
nullable: True
nulls: Sum(r12.isna)
non_nulls: Sum(1 - r12.isna)
null_frac: Mean(r12.isna)
r31 := Aggregate[r19]
metrics:
values: ArrayCollect(r19.weather)
counts: ArrayCollect(r19['Count(weather)'])
pos: 5
r32 := Aggregate[r20]
metrics:
values: ArrayCollect(Cast(r20.wind, to=string))
counts: ArrayCollect(r20['Count(wind)'])
pos: 4
r33 := Aggregate[r21]
metrics:
values: ArrayCollect(Cast(r21.temp_min, to=string))
counts: ArrayCollect(r21['Count(temp_min)'])
pos: 3
r34 := Aggregate[r22]
metrics:
values: ArrayCollect(Cast(r22.temp_max, to=string))
counts: ArrayCollect(r22['Count(temp_max)'])
pos: 2
r35 := Aggregate[r23]
metrics:
values: ArrayCollect(Cast(r23.date, to=string))
counts: ArrayCollect(r23['Count(date)'])
pos: 0
r36 := Aggregate[r24]
metrics:
values: ArrayCollect(Cast(r24.precipitation, to=string))
counts: ArrayCollect(r24['Count(precipitation)'])
pos: 1
r37 := JoinChain[r25]
JoinLink[inner, r31]
r25.pos == r31.pos
values:
pos: r25.pos
name: r25.name
type: r25.type
nullable: r25.nullable
nulls: r25.nulls
non_nulls: r25.non_nulls
null_frac: r25.null_frac
values: r31.values
counts: r31.counts
r38 := JoinChain[r26]
JoinLink[inner, r32]
r26.pos == r32.pos
values:
pos: r26.pos
name: r26.name
type: r26.type
nullable: r26.nullable
nulls: r26.nulls
non_nulls: r26.non_nulls
null_frac: r26.null_frac
values: r32.values
counts: r32.counts
r39 := JoinChain[r27]
JoinLink[inner, r33]
r27.pos == r33.pos
values:
pos: r27.pos
name: r27.name
type: r27.type
nullable: r27.nullable
nulls: r27.nulls
non_nulls: r27.non_nulls
null_frac: r27.null_frac
values: r33.values
counts: r33.counts
r40 := JoinChain[r28]
JoinLink[inner, r34]
r28.pos == r34.pos
values:
pos: r28.pos
name: r28.name
type: r28.type
nullable: r28.nullable
nulls: r28.nulls
non_nulls: r28.non_nulls
null_frac: r28.null_frac
values: r34.values
counts: r34.counts
r41 := JoinChain[r29]
JoinLink[inner, r35]
r29.pos == r35.pos
values:
pos: r29.pos
name: r29.name
type: r29.type
nullable: r29.nullable
nulls: r29.nulls
non_nulls: r29.non_nulls
null_frac: r29.null_frac
values: r35.values
counts: r35.counts
r42 := JoinChain[r30]
JoinLink[inner, r36]
r30.pos == r36.pos
values:
pos: r30.pos
name: r30.name
type: r30.type
nullable: r30.nullable
nulls: r30.nulls
non_nulls: r30.non_nulls
null_frac: r30.null_frac
values: r36.values
counts: r36.counts
r43 := Project[r37]
name: r37.name
type: r37.type
nullable: r37.nullable
nulls: r37.nulls
non_nulls: r37.non_nulls
null_frac: r37.null_frac
values: r37.values
counts: r37.counts
r44 := Project[r38]
name: r38.name
type: r38.type
nullable: r38.nullable
nulls: r38.nulls
non_nulls: r38.non_nulls
null_frac: r38.null_frac
values: r38.values
counts: r38.counts
r45 := Project[r39]
name: r39.name
type: r39.type
nullable: r39.nullable
nulls: r39.nulls
non_nulls: r39.non_nulls
null_frac: r39.null_frac
values: r39.values
counts: r39.counts
r46 := Project[r40]
name: r40.name
type: r40.type
nullable: r40.nullable
nulls: r40.nulls
non_nulls: r40.non_nulls
null_frac: r40.null_frac
values: r40.values
counts: r40.counts
r47 := Project[r41]
name: r41.name
type: r41.type
nullable: r41.nullable
nulls: r41.nulls
non_nulls: r41.non_nulls
null_frac: r41.null_frac
values: r41.values
counts: r41.counts
r48 := Project[r42]
name: r42.name
type: r42.type
nullable: r42.nullable
nulls: r42.nulls
non_nulls: r42.non_nulls
null_frac: r42.null_frac
values: r42.values
counts: r42.counts
r49 := Union[r47, r48, distinct=False]
r50 := Union[r49, r46, distinct=False]
r51 := Union[r50, r45, distinct=False]
r52 := Union[r51, r44, distinct=False]
r53 := Union[r52, r43, distinct=False]
Project[r53]
name: r53.name
type: r53.type
nullable: r53.nullable
nulls: r53.nulls
non_nulls: r53.non_nulls
null_frac: r53.null_frac
values: r53.values
counts: r53.counts
Charting relies on the awesome lets-plot library from JetBrains
from vinyl.chart import geom
weather.chart(geom.scatter, x=weather.date, y=weather.temp_max, interactive = False)
Charts made with Vinyl are interactive by default, but we turn that feature off so our docs site can render them properly.
Was this page helpful?