Event Analysis
Hashquery provides out of the box support for event analytics such as funnels and retention curves.
Modeling event analytics
We recommend defining a model’s activity schema
using Model.with_activity_schema()
. This attaches a semantic description
to your model on how to apply event analytics functions to it. It requires:
group – This will split the events for each actor that invokes the event, such as
user_id
orcustomer_id
.timestamp – When the event was detected, such as
created_at
ortimestamp
.event_key – Typically this is a column like
event_name
orevent_type
denoting the kind of event that occurred.
Funnels
The simplest event analytics primitive is Model.funnel()
. Funnel
takes a list of events and returns a summary table indicating how many distinct
actors (split by activity_schema.group
) reached each stage of the funnel.
events.funnel(
# replace these with the events sequence you care about
top_of_funnel="users",
steps=["ad_impression", "visit", "purchase"],
)
May result in a table like the following:
step |
count |
---|---|
users |
30,040 |
ad_impression |
27,028 |
visit |
8,794 |
purchase |
2,341 |
This means 30,040
distinct users exist in our dataset, and 27,028
of them
were served at least one ad. Of those users, 8,794
went to the website and
then 2,341
made at least one purchase.
Note that if a user purchased something, but didn’t see an ad before their
purchase, they wouldn’t be included in the funnel. They would only counted in
the users
row. Users must match every proceeding step in order to be
included in step in the funnel. For this reason, funnels strictly decrease
as they move downward.
Comparing Funnels
Let’s imagine we want to compare how well one funnel behaves compares to another. For example, we may be running two ad campaigns at once, and curious to see which is resulting in more conversions.
Let’s call seeing the first ad event ad_impression_a
and the second ad
ad_impression_b
. If your events table does not capture this inside of
a single column, you might form the expression using an expression
like the following, concatenating the event name with the user group:
events = events.with_activity_schema(
# ...existing
event_key=func.cases(
((attr.event_name == "ad_impression"), attr.event_name + "_" + rel.user.ab_test_group)
other=attr.event_name
)
)
Once we have our events set up, let’s define two funnels:
funnel_a = events.funnel(["ad_impression_a", "visit", "purchase"])
funnel_b = events.funnel(["ad_impression_b", "visit", "purchase"])
and then we’ll join them together to form a new table:
funnels = (
funnel_a.with_join_one(
funnel_b,
named="funnel_b",
condition=func.or_(
attr.step == rel.funnel_b.step,
func.and_(
attr.step == "ad_impression_a",
rel.funnel_b.step == "ad_impression_b",
)
)
)
.pick(
attr.step,
attr.count.named("A funnel"),
rel.funnel_b.count.named("B funnel"),
)
)
step |
A funnel |
B funnel |
---|---|---|
count |
15,020 |
15,020 |
ad_impression |
14,308 |
12,720 |
visit |
2,594 |
6,200 |
purchase |
2,101 |
240 |
Looks like ad B may have been more evocative, as more people clicked into the website, but it lead to less purchases overall. Maybe the ad misinformed users? Meanwhile, while A didn’t capture as many clicks, it lead to a lot of purchases. Perhaps we could make a new ad that has the benefits of both. Time to contact marketing.
Retention/Survival Curves
In future, Hashquery will provide Model.retention
to
achieve retention and survival curves quickly. For now, you will need to use
.match_steps
to form your own retention curves.
Advanced: .match_steps
Model.match_steps()
is the core building block that other analytics
functions use. It allows you to efficiently collect and join together an ordered
list of events for a user.
Similar to .funnel
we pass it an ordered list of steps to match:
events.match_steps([
"ad_impression",
"visit",
"purchase",
])
Unlike .funnel
, which aggregates into a summary table, .match_steps
actually
returns a table aggregated by activity_schema.group
. For each row, you have
columns representing each event matched by the analysis, modeled as a JOIN
to the original events table. Effectively this creates a series of “first after” temporal joins.
That’s a little hard to picture, so let’s draw it out.
Imagine our events table looks like the following, containing a single user.
This table now includes all the properties of our event, not just the ones
we added to .with_activity_schema
.
group |
timestamp |
event_name |
value |
event_id |
---|---|---|---|---|
user_a |
2024-01-01 |
ad_impression |
-$0.04 |
a1 |
user_a |
2024-01-02 |
visit |
NULL |
b2 |
user_a |
2024-01-03 |
visit |
NULL |
c3 |
user_a |
2024-01-04 |
purchase |
+$40 |
d4 |
user_a |
2024-01-05 |
ad_impression |
-$0.05 |
e5 |
Now we run .match_steps(["ad_impression", "purchase"])
. For each user,
this will construct one row. Let’s look at the user_a
row:
group | ad_impression | purchase | ||||
---|---|---|---|---|---|---|
timestamp | value | event_id | timestamp | value | event_id | |
user_a |
2024-01-01 |
-$0.04 |
a1 |
2024-01-04 |
+$40 |
d4 |
As you can see, we selected the first matching step for ad_impression
and joined that event in, and then selected the first matching purchase
event
after that and joined that in. We can access each event’s data through
rel.ad_impression
and rel.purchase
respectively.
For a user which doesn’t match all the steps, that relation’s data will contain
all NULL
s, similar to the result of a LEFT INNER JOIN
like in SQL.
This table is sometimes not useful on its own. However, you can perform lots of
neat aggregation on top. For example, we could compute the time between ad and
purchase (rel.purchase.timestamp - rel.ad_impression.timestamp
), or look for
correlations between how much we paid for an ad rel.ad_impression.value
, and
conversion rate rel.purchase.timestamp != None
.