Your first analysis
This tutorial will use the demo project to get started.
Getting the count of items
Let’s start simple. We’ll import a connection, and count the amount of records in a table.
First, let’s import our connection. This does not actually connect to your data warehouse, it is just a reference as to what data connection will be used later on.
1from hashquery import *
2from hashquery.demo import demo_project
3
4connection = demo_project.connections.uploads
Next, we’ll define a simple Model
for a table. This model represents
all the records in the table named sales.parquet
.
5sales = Model().with_source(demo_project.connections.uploads, "sales.parquet")
Let’s now sub-model this to form a new model. While the previous model represented all records in the table, this new model will represent the total count of the records in the table.
We can do so by calling Model.aggregate()
with no groups, and a single
measure: func.count()
.
6total_sales = sales.aggregate(groups=[], measures=[func.count()])
Finally, we’ll run the model (Model.run()
) and collect the result
as a Pandas DataFrame (df
):
7print(total_sales.run().df)
When you run this script, you should should see a printed item containing a value for the count of records in the table.
If you want to see the SQL that ran, you can reference .sql_query
on the
value returned by Model.run()
:
result = total_sales.run()
print(result.sql_query) # the query
print(result.df) # the data
Code so far
1from hashquery import *
2from hashquery.demo import demo_project
3
4connection = demo_project.connections.uploads
5sales = Model().with_source(demo_project.connections.uploads, "sales.parquet")
6total_sales = sales.aggregate(groups=[], measures=[func.count()])
7print(total_sales.run().df)
Breaking out by groups
Let’s revisit our analysis. Instead of gathering a count of all records, let’s group the records into buckets by year, and count the totals within each bucket.
6sales_by_year = (
7 sales.aggregate(
8 groups=[column("timestamp").by_year],
9 measures=[func.count()],
10 )
11 .sort(column("timestamp"))
12)
13print(sales_by_year.run().df)
This should now show you the top 3 years on record.
Code so far
1from hashquery import *
2from hashquery.demo import demo_project
3
4connection = demo_project.connections.uploads
5sales = Model().with_source(demo_project.connections.uploads, "sales.parquet")
6sales_by_year = (
7 sales.aggregate(
8 groups=[column("timestamp").by_year],
9 measures=[func.count()],
10 )
11 .sort(column("timestamp"))
12)
13print(sales_by_year.run().df)
Modeling & referencing properties
Our queries work well, but they aren’t very reusable. Anytime somebody needs to reference the columns, they need to find the physical name of the column in the database table, which could change. Similarly, our measure may become more complex, accounting for business logic about double counting, and if the logic was spread across many queries, you would have to update it in many places.
What we want to do is have a layer between the raw expressions and the semantics of the model. We’ll use the model as a centralized, shared definition of what’s interesting about the table.
For this tutorial, we’ll just attach attributes and measures.
Attributes are expressions which are a property of an individual record.
Measures are expressions which are a property of a group of records.
You can attach attributes and measures onto a model using
Model.with_attributes()
and Model.with_measures()
respectively.
We’ll attach these to our base model, so any analysis using this table can
reuse them.
5sales = (
6 Model()
7 .with_source(demo_project.connections.uploads, "sales.parquet")
8 .with_attributes(sale_year=column("timestamp").by_year)
9 .with_measure(total_sales=func.count())
10)
We can then update our references in our sub-model to use the new definitions.
In HashQuery, we reference attributes on models using attr.
, measures using
msr.
and relations using rel.
.
11sales_by_year = (
12 sales.aggregate(
13 groups=[attr.sale_year],
14 measures=[msr.total_sales],
15 )
16 .sort(attr.sale_year)
17)
18print(sales_by_year.run().df)
Now our sub-model query will automatically adjust if we change the definition
for my_attribute
or my_measure
. In addition, sales
now has more
metadata about what’s interesting about the table, which allows tools in
Hashboard to offer better UIs for non-technical consumers of your data.
Final code
1from hashquery import *
2from hashquery.hashboard_api.project_importer import ProjectImporter
3
4connection = demo_project.connections.uploads
5sales = (
6 Model()
7 .with_source(demo_project.connections.uploads, "sales.parquet")
8 .with_attributes(sale_year=column("timestamp").by_year)
9 .with_measure(total_sales=func.count())
10)
11sales_by_year = (
12 sales.aggregate(
13 groups=[attr.sale_year],
14 measures=[msr.total_sales],
15 )
16 .sort(attr.sale_year)
17)
18print(sales_by_year.run().df)
Next Steps
You can learn about the core concepts and principles of Hashquery under the Concepts sidebar. For further examples, check out Common Patterns. API documentation can be found under API Reference.
Have fun querying! Please let us know if you have any feedback, or encounter any issues.