Relations
This page leverages some basic transform syntax from later parts of the walkthrough. Don’t worry too much about it for now: the core purpose of this page is to understand how relations work in Vinyl.
Properties
As in SQL, relations are the core data model. But relations in Vinyl have some unique properties:
Relations are column aware
Take the seattle_weather
dataset from vinyl’s examples.
weather
is a Vinytable object and carries far more information that a
standard sql table or cte. The schema, for example, can be pulled easily
without running a query against the database.
Columns are attrributes of the class, so the code below returns the
wind
column.
And, after a basic transform:
This means you can access the information at any point in your queries. For example
will select the temperature columns from weather
.
Relations are lazy
Vinyl keeps track of its syntax lazily, and only compiles when a variable is executed or its sql is generated.
By default, vinyl will return a string representation of the ast associated with a table unless specified otherwise. For example:
If you’d like to see a graphical representation, use .visualize()
:
To execute this, you only need to run:
temp_max | temp_min | |
---|---|---|
0 | 12.8 | 5.0 |
1 | 10.6 | 2.8 |
2 | 11.7 | 7.2 |
3 | 12.2 | 5.6 |
4 | 8.9 | 2.8 |
… | … | … |
1456 | 4.4 | 1.7 |
1457 | 5.0 | 1.7 |
1458 | 7.2 | 0.6 |
1459 | 5.6 | -1.0 |
1460 | 5.6 | -2.1 |
1461 rows × 2 columns
By default, this returns a pandas DataFrame. You can also return a text or pyarrow representation by specifying that as in the examples below:
You can also execute it and save the results to various formats (csv,
json, etc.) using .save()
.
Relations are selectively mutable
By default, VinylTables are immutable. For example, the original
table ast is printed when run is the same as the original weather
.
That said, there are two key cases where VinylTables are considered mutable. This allows for a more ergonomic syntax, especially when you are chaining several transforms together.
The two cases are: 1. Within specially decorated functions (i.e. those
with @model
or @metric
decorator) 2. Context managers
The first case is designed to support :
The second use case is designed primarily for analysis use cases (i.e. in a notebook):
This approach can also be used inside vinyl pipeline functions to create a sort of “cte”.
True CTEs don’t exist in vinyl. In immutable mode, you simply need to save a subcalculation to a variable name to achieve the same functionality. But in mutable mode, as in a pipeline function, saving subcalculations as variables can cause issues, because it may have off-target effects on the sources of the subcalculation.
This is where the context manager comes in handy. It allows you to create a temporary variable that is only available within the context of the context manager.
In the example below, we get the average wind by month and then use it to calculate the wind speed deviation from the average.
This can be done more efficiently in vinyl using a window function, but this is a good example of how to use the context manager.
Relations are dialect-independent
Vinyl uses the Ibis library to generate SQL. This means that you can write your queries in a dialect agnostic way. For example, the following code works in across the dialects currently supported by Vinyl:
- BigQuery
- Snowflake
- DuckDB
- Postgres
Ibis itself supports almost 20 dialects, so we plan to add more over time.
As an example, let’s create a table that lists the max temperature of each day by month:
Here’s how vinyl translates the query to each dialect:
Creating relations
In the context of a project, relations are created and managed automatically using the source generation process.
In the context of a notebook or analysis, you can create relations from a variety of sources, including: - Pandas DataFrames - Polars DataFrames - PyArrow Tables - CSVs - JSONs - Parquet files
For in-memory sources, use VinylTable.from_memory
as:
And for file-based sources, use VinylTable.from_file
:
Exploring relations
In the properties section action, you’ve already seen some of the most common ways to visualize relations:
print()
to view the AST.schema()
to view the schema.visualize()
to view the query plan.execute()
to run the query and view it as text, pyarrow, or pandas objects.to_sql()
to view the SQL-equivalent
In addition to these, there are a few other key methods to know: -
.eda()
provides a quick exploratory data analysis of the relation,
excluding its most common values - .chart()
provides a grammar of
graphics interface to quickly visualize the relation
Let’s view examples of each of these:
Charting relies on the awesome lets-plot library from JetBrains
Charts made with Vinyl are interactive by default, but we turn that feature off so our docs site can render them properly.
Was this page helpful?