Your first analysis

This tutorial will use the demo project to get started.

Getting the count of items

Let’s start simple. We’ll import a connection, and count the amount of records in a table.

First, let’s import our connection. This does not actually connect to your data warehouse, it is just a reference as to what data connection will be used later on.

from hashquery import *
from hashquery.demo import demo_project

connection = demo_project.connections.uploads

Next, we’ll define a simple Model for a table. This model represents all the records in the table named sales.parquet.

sales = Model().with_source(demo_project.connections.uploads, "sales.parquet")

Let’s now sub-model this to form a new model. While the previous model represented all records in the table, this new model will represent the total count of the records in the table.

We can do so by calling Model.aggregate() with no groups, and a single measure: func.count().

total_sales = sales.aggregate(groups=[], measures=[func.count()])

Finally, we’ll run the model (Model.run()) and collect the result as a Pandas DataFrame (df):

print(total_sales.run().df)

When you run this script, you should should see a printed item containing a value for the count of records in the table.

If you want to see the SQL that ran, you can reference .sql_query on the value returned by Model.run():

result = total_sales.run()
print(result.sql_query) # the query
print(result.df) # the data

Breaking out by groups

Let’s revisit our analysis. Instead of gathering a count of all records, let’s group the records into buckets by year, and count the totals within each bucket.

sales_by_year = (
    sales.aggregate(
        groups=[column("timestamp").by_year],
        measures=[func.count()],
    )
    .sort(column("timestamp"))
)
print(sales_by_year.run().df)

This should now show you the top 3 years on record.

Modeling & referencing properties

Our queries work well, but they aren’t very reusable. Anytime somebody needs to reference the columns, they need to find the physical name of the column in the database table, which could change. Similarly, our measure may become more complex, accounting for business logic about double counting, and if the logic was spread across many queries, you would have to update it in many places.

What we want to do is have a layer between the raw expressions and the semantics of the model. We’ll use the model as a centralized, shared definition of what’s interesting about the table.

For this tutorial, we’ll just attach attributes and measures.

Attributes are expressions which are a property of an individual record.
Measures are expressions which are a property of a group of records.

You can attach attributes and measures onto a model using Model.with_attributes() and Model.with_measures() respectively. We’ll attach these to our base model, so any analysis using this table can reuse them.

sales = (
    Model()
    .with_source(demo_project.connections.uploads, "sales.parquet")
    .with_attributes(sale_year=column("timestamp").by_year)
    .with_measure(total_sales=func.count())
)

We can then update our references in our sub-model to use the new definitions. In HashQuery, we reference attributes on models using attr., measures using msr. and relations using rel..

sales_by_year = (
    sales.aggregate(
        groups=[attr.sale_year],
        measures=[msr.total_sales],
    )
    .sort(attr.sale_year)
)
print(sales_by_year.run().df)

Now our sub-model query will automatically adjust if we change the definition for my_attribute or my_measure. In addition, sales now has more metadata about what’s interesting about the table, which allows tools in Hashboard to offer better UIs for non-technical consumers of your data.

Next Steps

You can learn about the core concepts and principles of Hashquery under the Concepts sidebar. For further examples, check out Common Patterns. API documentation can be found under API Reference.

Have fun querying! Please let us know if you have any feedback, or encounter any issues.