Advanced: KeyPaths
A small audience of readers may have noticed that our use of attr.
, msr.
and rel.
have some special behavior compared with other values in Python.
If this is you, then you are very observant!
Other guides intentionally omit technical detail about how some magic works to get you started with Hashquery.
If you are building abstractions on top of Hashquery, you may want to better understand how these work.
Integrating attr
, msr
and rel
into your APIs
If you are building abstractions on top of Hashquery, you may want your
APIs to accept the lazy KeyPath references emitted by attr
, msr
and rel
.
That way, using your extended functions will work just as well as using the
built-in ones.
Deferring KeyPaths
If your API takes in an attr
, msr
or rel
as an input, but not a model
reference, then you likely want to defer evaluation until the references
are resolved.
To do so, decorate your function with
defer_keypath_args()
.
This decorator will defer evaluation of its implementation until all its
arguments are ready. If the function without the decorator returned T
, it
now will return KeyPath[T]
. The function is not made asynchronous, it just
now generates KeyPaths. You do not need to modify the implementation.
For example, the implementation of
func.count_if()
is wrapped withdefer_keypath_args()
, since it accepts an expression as input, that may have references toattr.
orrel.
.@defer_keypath_args def count_if(condition: ColumnExpression) -> ColumnExpression: return func.sum(func.cases((condition, 1), other=0))
Resolving KeyPaths
If your API takes in an attr
, msr
or rel
as an input, and also a model
reference, then you likely want to resolve references against that input model.
If you want to resolve references against another model that isn’t a parameter of the function, then you will still use
defer_keypath_args()
.
To do so, decorate your function with
resolve_keypath_args_from()
.
Most methods on
Model
use this decorator, resolving any passed parameters againstself
. For example:class Model(): @resolve_keypath_args_from(_.self) def sort(self, expr: ColumnExpression, dir: str) -> "Model": ...
The WHY and HOW of KeyPaths
Some users may be more curious to learn about why Keypaths are even included, since they’re kinda unconventional. The following discussion presents a small slice of our reasoning behind their inclusion.
Warning
This section is an advanced guide that assumes more familiarity with Python
than other articles on this page. We anticipate most consumers to never need
to fully understand how attr
, msr
and rel
work at a language-level.
For those brave enough to stick around, let’s get into it.
The problem: Building self-referential data structures
Users in Hashquery are constructing immutable, self-referential data structure.
A method like Model.with_attributes()
adds new attributes
and a method like Model.aggregate()
may use those attributes.
Let’s naively write how a user may call these functions, and pay specific attention to the line where we reference an attribute. This is not valid Hashquery code:
sales_by_year = (
Model()
.with_source(project.connections.uploads, "sales.parquet")
.with_attributes("timestamp")
.aggregate(
groups=[sales_by_year.get_attribute("timestamp").by_year],
measures=[func.count()]
)
)
There’s something off here. We can’t reference sales_by_year
in the definition
of sales_by_year
! It doesn’t actual exist yet.
Okay, but perhaps we can declare it first:
sales_by_year = Model()
sales_by_year = (
sales_by_year
.with_source(project.connections.uploads, "sales.parquet")
.with_attributes("timestamp")
.aggregate(
groups=[sales_by_year.get_attribute("timestamp").by_year],
measures=[func.count()]
)
)
This actually still won’t work, since sales_by_year
, on the line where
we call sales_by_year.get_attribute
will still only be equal to Model()
,
which doesn’t have any timestamp
attribute!
If we really wanted this syntax, we’d have to break up every expression
to its own line, building up sales_by_year
statement by statement:
sales_by_year = Model()
sales_by_year = sales_by_year.with_source(project.connections.uploads, "sales.parquet")
sales_by_year = sales_by_year.with_attributes("timestamp")
sales_by_year = (
sales_with_attr
.aggregate(
groups=[sales_by_year.get_attribute("timestamp").by_year],
measures=[func.count()]
)
)
That’s … pretty ugly, and prone to error.
Another potential approach could be to mutate our sales model with each
function, but we found that created really difficult to understand bugs
in user code which required manual deepcopy
s to resolve, and in some cases
still didn’t fully fix the issue.
The solution: deferred evaluation
We want a way to allow users to define what to access before any access actually occurs. This is a common pattern in functional programming contexts, where lazy evaluation is often preferred, such as in Publisher/Consumer or Reactive state libraries.
Let’s see how that would look with our query:
1sales_by_year = (
2 Model()
3 .with_source(project.connections.uploads, "sales.parquet")
4 .with_attributes(lambda model: "timestamp")
5 .aggregate(
6 groups=[lambda model: model.get_attribute("timestamp").by_year],
7 measures=[lambda model: func.count()]
8 )
9)
As you can see, in every place where we might want to self-reference the model, we’re now passing a function which declares how to access some value. At some later point, the library will actually run these lambdas and get real values.
This is more correct, but quite the headache.
So we built KeyPath
.
KeyPaths
KeyPath
is just a structured
helper for making the lambdas in the prior example.
KeyPath().get_attribute("timestamp").by_year
# declares the same thing as
lambda root: root.get_attribute("timestamp").by_tear
Instead of a lambda
, we store a structured representation of the accessor.
Unlike a lambda
, these can be serialized, immutably modified, and can be
reflected to aid in error messages or the automatic naming of resources.
In Hashquery, there are three root KeyPaths, attr
, msr
and rel
.
These are each shorthand to access a model’s attributes, measures, and
relations respectively:
attr.timestamp.by_year
# is roughly
lambda model: model._private_api_get_attribute("timestamp").by_year