Reconstructing customer journeys from transactional records using Apache Spark

spark

python

plotly

frequent-itemset-mining

sankey-diagram

Author

wovago

Published

August 8, 2023

Abstract

In this notebook we will peform customer touchpoint analysis using Apache Spark. We will automatically reconstruct customer journeys from transactional data consisting of purchase events and customer touchpoint events.

Introduction

In this notebook we will peform customer touchpoint analysis using Apache Spark. A customer touchpoint refers to any interaction or point of contact that a customer has with a business, brand, product, or service throughout the entire customer journey. It encompasses all the different ways through which a customer can engage with or experience a company, both online and offline. Customer touchpoints play a crucial role in shaping a customer’s perception, satisfaction, and overall experience with a brand.

In this analysis we will automatically reconstruct customer journeys from transactional data consisting of purchase events and customer touchpoint events. We will perform frequent itemset mining on those transactional records to identify combinations of customer touchpoints that constitute common customer journeys leading to purchases or conversions. To perform frequent itemset mining we will use the fp-growth algorithm that is provided by Apache Spark’s mlib library.

The dataset which we will use for this analysis can be downloaded at https://www.kaggle.com/datasets/kishlaya18/customer-purchase-journey-netherlands. This dataset contains transactional records of travel purchases together with corresponding customer touchpoint events. So let’s start with the analysis.

Data preprocessing with Apache Spark

code

import findspark
import matplotlib.pyplot as plt
import pandas as pd
import plotly.graph_objects as go
import pyspark
import pyspark.pandas as ps
import seaborn as sns
from pyspark.ml.fpm import FPGrowth, PrefixSpan
from pyspark.sql import SparkSession, Window
from pyspark.sql import functions as F
from pyspark.sql.functions import col

First we will start the analysis by initializing the Spark session.

code

findspark.init()
spark = SparkSession.builder.appName("customer_journey").getOrCreate()

[Stage 903:============================>                            (1 + 1) / 2]

We then will read in the transactional data containing purchases with the corresponding customer touchpoints leading to that purchase.

code

df = spark.read.csv('./data/TravelData.csv', header=True, inferSchema=True)

We still perform some type casting and inspect the imported data schema.

code

df = df.withColumn("duration", col("Duration").cast("int")).drop('Duration')
df.printSchema()

root
 |-- UserID: integer (nullable = true)
 |-- PurchaseID: integer (nullable = true)
 |-- TIMESPSS: timestamp (nullable = true)
 |-- DEVICE_TYPE: string (nullable = true)
 |-- type_touch: integer (nullable = true)
 |-- purchase_own: integer (nullable = true)
 |-- purchase_any: integer (nullable = true)
 |-- MobilePanel: integer (nullable = true)
 |-- FixedPanel: integer (nullable = true)

We can now inspect some records to see what the data looks like.

code

df.show(10)

+------+----------+-------------------+-----------+----------+------------+------------+-----------+----------+
|UserID|PurchaseID|           TIMESPSS|DEVICE_TYPE|type_touch|purchase_own|purchase_any|MobilePanel|FixedPanel|
+------+----------+-------------------+-----------+----------+------------+------------+-----------+----------+
|  9558|         1|2016-03-12 22:18:34|     MOBILE|         7|           0|           0|          0|         0|
|  9558|         1|2016-03-13 05:25:49|     MOBILE|         7|           0|           0|          0|         0|
|  9558|         1|2016-03-13 12:26:21|     MOBILE|         7|           0|           0|          0|         0|
|  9657|         2|2015-07-18 13:00:38|      FIXED|         4|           0|           0|          0|         0|
|     1|         3|2015-08-07 19:16:59|      FIXED|         7|           0|           0|          0|        17|
|     1|         4|2015-11-03 12:31:35|      FIXED|         1|           0|           0|          0|        17|
|     1|         4|2015-11-03 12:39:22|      FIXED|         1|           0|           0|          0|        17|
|     1|         4|2015-11-24 22:29:39|      FIXED|         1|           0|           0|          0|        17|
|     1|         4|2015-11-24 22:30:39|      FIXED|         1|           0|           0|          0|        17|
|     1|         4|2015-11-24 22:30:44|      FIXED|         1|           0|           0|          0|        17|
+------+----------+-------------------+-----------+----------+------------+------------+-----------+----------+
only showing top 10 rows

[Stage 903:============================>                            (1 + 1) / 2]

From the above table, we can observe that a purchase consists of multiple events. Each event is listed as a seperate record and describes a certain customer touchpoint that was involved in the purchase. To prepare the data for our frequent itemset analysis, we will group all events with their corresponding purchase. We can do this by aggregating all records per purchase ID using Apache Spark. We will partition the data by UserID and PurchcaseID to improve performance. The type of customer toucpoint is recorded in the variable type_touch. Furthermore, we will also order each partition by TIMESPSS, which provides a time stamp for each event. This ensures that touchpoint events for each purchase are ordered chronologically.

code

w = Window.partitionBy("UserID", "PurchaseID").orderBy("TIMESPSS")

touch_sequences_df = (
    df.withColumn("sorted_touch_points", F.collect_list("type_touch").over(w))
    .groupBy("PurchaseID")
    .agg(F.collect_set("type_touch").alias("sorted_touch_points"))
)

touch_sequences_df.show(truncate=False)

[Stage 903:========>        (1 + 1) / 2][Stage 1528:============>   (3 + 1) / 4]

+----------+----------------------+
|PurchaseID|sorted_touch_points   |
+----------+----------------------+
|1         |[7]                   |
|2         |[4]                   |
|3         |[7]                   |
|4         |[1]                   |
|5         |[16, 13, 18, 7]       |
|6         |[16, 6, 4]            |
|7         |[16]                  |
|8         |[1, 16, 3, 10, 7, 4]  |
|9         |[5]                   |
|10        |[5]                   |
|11        |[16]                  |
|12        |[1, 16, 7]            |
|13        |[1]                   |
|14        |[9, 1, 16, 2, 3, 7, 4]|
|15        |[7, 4]                |
|16        |[1]                   |
|17        |[1]                   |
|18        |[1]                   |
|19        |[7]                   |
|20        |[10]                  |
+----------+----------------------+
only showing top 20 rows

[Stage 903:============================>                            (1 + 1) / 2]

For our subsequent analysis we will use Apache Spark to peform frequent itemset mining on the data frame with touch events. Frequent itemset mining is a data mining technique used to discover sets of items that frequently co-occur together in a dataset. It is a fundamental concept in association rule mining, which aims to find interesting relationships or patterns in large transactional or categorical datasets. In this case we will use it too see whether we can discover interesting patterns in touch event sequences that lead to purchases. Before we will start the analysis, let’s have a look at a commonly used algorithm to perform frequent itemset mining, called the FP-Growth algorithm.

FP-Growth algorithm

The FP-Growth (Frequent Pattern Growth) algorithm [Han et al., 2000] is a popular algorithm used for mining frequent itemsets and discovering association rules in transactional databases or datasets with a similar structure. It’s particularly useful for large datasets where traditional algorithms like Apriori might be inefficient due to their high computational complexity.

FP-Growth works by building a compact data structure called an FP-Tree (Frequent Pattern Tree), which allows efficient mining of frequent itemsets without generating candidate itemsets explicitly. Here’s how the algorithm works:

1. Constructing the FP-Tree:

Scan the Dataset: In the first pass over the dataset, count the frequency of each item. Items with support above a predefined threshold are considered frequent.
Sort Items: Sort the frequent items in descending order of their support.
Build the FP-Tree: Construct the FP-Tree by scanning the dataset again and adding each transaction to the tree. Each transaction is represented as a path in the tree. The tree nodes represent items, and their paths represent the sequence of items in a transaction.

2. Mining Frequent Itemsets:

Mining Conditional Pattern Bases: For each frequent item in the dataset, mine the Conditional Pattern Base (CPB). The CPB is a set of paths in the FP-Tree that contain the item. These paths are used to construct a smaller FP-Tree, called the Conditional FP-Tree.
Recursion: Recursively mine the Conditional FP-Tree to extract frequent itemsets.

3. Generating Association Rules:

From Frequent Itemsets: Once frequent itemsets are discovered, generate association rules based on these itemsets. Association rules express relationships between items and are typically in the form “If X, then Y.”
Calculating Confidence: Calculate the confidence of each association rule. Confidence measures how often the rule has been found to be true in the dataset.
Pruning and Filtering: Prune association rules based on user-defined thresholds, such as minimum confidence and minimum support.

Frequent itemset mining

We will run the FP-growth algorithm on the touch event data to generate frequent itemsets, which could help us identify interesting patterns. To run the FP-Growth algorithm we can use Apache Spark’s built-in FPGrowth method.

We will set parameter minSupport to 0.1, meaning that the pattern needs to present minimally in 10% of cases. Furthermore, we set parameter minConfidence to 0.9. Minimum Confidence is an indication of how often an association rule has been found to be true. By setting this parameter quite high, we will only retain high-quality patterns that are supported by association rules.

Once we have initialized the algoritym, we are ready to run the FP-Growth algorithm on the preprocessed purchases with their constituting touchpoint sequences.

code

fpGrowth = FPGrowth(itemsCol="sorted_touch_points", minSupport=0.1, minConfidence=0.9)
model = fpGrowth.fit(touch_sequences_df)

[Stage 903:============================>                            (1 + 1) / 2]

After the fp-growth algorithm has finished, we can inspect frequent itemsets that were discovered by the algorithm, together with their corresponding frequency. We will also convert the Spark data frame to a Pandas data frame for further processing and visualization.

code

itemset_df = model.freqItemsets.sort(col("freq").desc()).toPandas()
itemset_df.head(40)

	items	freq
0	[1]	19157
1	[7]	17185
2	[7, 1]	11670
3	[4]	10396
4	[16]	7857
5	[4, 1]	7742
6	[13]	7395
7	[4, 7]	7162
8	[4, 7, 1]	6128
9	[16, 1]	5685
10	[13, 1]	5583
11	[13, 7]	5455
12	[16, 7]	5269
13	[13, 7, 1]	4781
14	[16, 7, 1]	4432
15	[16, 4]	3991
16	[13, 4]	3677
17	[16, 4, 1]	3467
18	[16, 4, 7]	3310
19	[13, 4, 1]	3236
20	[13, 4, 7]	3196
21	[10]	3066
22	[16, 4, 7, 1]	3013
23	[13, 4, 7, 1]	2946

[Stage 903:============================>                            (1 + 1) / 2]

We will still add a trailing zero to each itemset, denoting the actual converson event. This will be useful for visualization purposes later on.

code

itemset_df['items'] = itemset_df['items'].apply(lambda x: x + [0])
itemset_df.head()

	items	freq
0	[1, 0]	19157
1	[7, 0]	17185
2	[7, 1, 0]	11670
3	[4, 0]	10396
4	[16, 0]	7857

We will have a further look at the discovered itemsets later on, but first let’s have a quick look at the extracted association rules.

Association rule extraction

Association rules are a type of pattern or relationship that can be discovered from transactional or categorical data using data mining techniques. Association rule mining aims to find interesting relationships between items in a dataset, particularly those that frequently co-occur together.

We can also inspect the assocation rules that were extracted by the fp-growth algorithm. We can observe that all derived assocation rules have a confidence higher than 90%. We will also convert it to a Pandas data frame immediately.

code

association_rules_df = model.associationRules.toPandas()
association_rules_df.head()

	antecedent	consequent	confidence	lift	support
0	[13, 4, 1]	[7]	0.910383	1.536924	0.101544
1	[13, 4, 7]	[1]	0.921777	1.395970	0.101544
2	[16, 4, 7]	[1]	0.910272	1.378546	0.103854

[Stage 903:============================>                            (1 + 1) / 2]

To help interpreting these association rules, we will import a mapping table that maps touchpoint IDs to more verbose descriptions.

code

lookup_df = pd.read_csv('./data/touch_types.tsv', sep='\t', names=['touch_point', 'description', 'initiation'])
lookup_df.head(20)

	touch_point	description	initiation
0	1	Accomodations Website	Consumer initiated touchpoint
1	2	Accomodations App	Consumer initiated touchpoint
2	3	Accomodations Search	Consumer initiated touchpoint
3	4	Information / comparison Website	Consumer initiated touchpoint
4	5	Information / comparison App	Consumer initiated touchpoint
5	6	Information / comparison Search	Consumer initiated touchpoint
6	7	Touroperator / Travel agent Website Competitor	Consumer initiated touchpoint
7	8	Touroperator / Travel agent App Competitor	Consumer initiated touchpoint
8	9	Touroperator / Travel agent Search Competitor	Consumer initiated touchpoint
9	10	Touroperator / Travel agent Website Focus brand	Consumer initiated touchpoint
10	12	Touroperator / Travel agent Search Focus brand	Consumer initiated touchpoint
11	13	Flight tickets Website	Consumer initiated touchpoint
12	14	Flight tickets App	Consumer initiated touchpoint
13	15	Flight tickets Search	Consumer initiated touchpoint
14	16	generic search	Consumer initiated touchpoint
15	18	AFFILIATES	Company intiated touchpoint (Focus brand)
16	19	BANNER	Company intiated touchpoint (Focus brand)
17	20	EMAIL	Company intiated touchpoint (Focus brand)
18	21	PREROLLS	Company intiated touchpoint (Focus brand)
19	22	RETARGETING	Company intiated touchpoint (Focus brand)

[Stage 903:============================>                            (1 + 1) / 2]

To facilitate further lookups, we will further convert the Pandas dataframe to a dictionary. Note that we also add an additional dictionary item 0 => 'Conversion' to annotate conversion events as well.

code

lookup_dict = lookup_df.set_index('touch_point').to_dict()['description']
lookup_dict[0] = 'Conversion'
for k,v in lookup_dict.items():
    print(k,":", v)

1 : Accomodations Website
2 : Accomodations App
3 : Accomodations Search
4 : Information / comparison Website
5 : Information / comparison App
6 : Information / comparison Search
7 : Touroperator / Travel agent Website Competitor
8 : Touroperator / Travel agent App Competitor
9 : Touroperator / Travel agent Search Competitor
10 : Touroperator / Travel agent Website Focus brand
12 : Touroperator / Travel agent Search Focus brand
13 : Flight tickets Website
14 : Flight tickets App
15 : Flight tickets Search
16 : generic search
18 : AFFILIATES
19 : BANNER
20 : EMAIL
21 : PREROLLS
22 : RETARGETING
0 : Conversion

[Stage 903:============================>                            (1 + 1) / 2]

code

association_rules_df["joined"] = (
    association_rules_df["antecedent"] + association_rules_df["consequent"]
)
association_rules_df["joined"] = association_rules_df["joined"].apply(lambda x: x + [0])
association_rules_df.head()

	antecedent	consequent	confidence	lift	support	descriptions	joined
0	[13, 4, 1]	[7]	0.910383	1.536924	0.101544	[Flight tickets Website, Information / compari...	[13, 4, 1, 7, 0]
1	[13, 4, 7]	[1]	0.921777	1.395970	0.101544	[Flight tickets Website, Information / compari...	[13, 4, 7, 1, 0]
2	[16, 4, 7]	[1]	0.910272	1.378546	0.103854	[generic search, Information / comparison Webs...	[16, 4, 7, 1, 0]

[Stage 903:============================>                            (1 + 1) / 2]

To better interpret the results, it is mandatory to understand the following variables:

Antecedent: The item or set of items that appear in the “if” part of the rule. It represents the condition or premise of the rule.
Consequent: The item or set of items that appear in the “then” part of the rule. It represents the outcome or consequence of the rule.
Support: The support of a rule is the proportion of transactions that contain both the antecedent and the consequent. It measures the frequency of the rule in the dataset.
Confidence: The confidence of a rule is the proportion of transactions containing the antecedent that also contain the consequent. It measures the strength of the implication from the antecedent to the consequent.
Lift: The lift of a rule measures the degree of association between the antecedent and the consequent, taking into account the support of both items. It indicates whether the presence of the antecedent increases the likelihood of the consequent beyond what would be expected by chance.

We can now convert and interpret the derived assocation rules, which can be used to predict certain future events based on some prior events. In this case the association rules predict an interaction with a touroperator or accomidations website, leading to a conversion. The associated lift values for all association rules are higher than 1, indicating an increased likelhood for a customer interaction.

We can convert the association rules to text as follows.

code

association_rules_df['descriptions'] = association_rules_df['joined'].apply(lambda x: [lookup_dict[i] for i in x])

print("Extracted association rules:\n")
for i,v in enumerate(association_rules_df['descriptions'].apply(lambda x: "\n   => ".join(x))):
    print(i+1, ': ', v, sep='')

Extracted association rules:

1: Flight tickets Website
   => Information / comparison Website
   => Accomodations Website
   => Touroperator / Travel agent Website Competitor
   => Conversion
2: Flight tickets Website
   => Information / comparison Website
   => Touroperator / Travel agent Website Competitor
   => Accomodations Website
   => Conversion
3: generic search
   => Information / comparison Website
   => Touroperator / Travel agent Website Competitor
   => Accomodations Website
   => Conversion

[Stage 903:============================>                            (1 + 1) / 2]

Customer journey visualization

Frequent itemsets represent common customer touch event sequences that occur together frequently. So these sequences can be seen as customer journeys across various touchpoints leading eventually to a purchase or conversion. We can derive those customer journeys from the extracted frequent itemsets we have discovered earlier. So let’s have a look.

code

itemset_df['touch_description'] = itemset_df['items'].apply(lambda x: [lookup_dict[i] for i in x])

print("Frequent customer journeys:\n")
for i,v in enumerate(itemset_df['touch_description'].apply(lambda x: "\n   => ".join(x)).to_list()):
    print(i+1, ': ', v, sep='')

Frequent customer journeys:

1: Accomodations Website
   => Conversion
2: Touroperator / Travel agent Website Competitor
   => Conversion
3: Touroperator / Travel agent Website Competitor
   => Accomodations Website
   => Conversion
4: Information / comparison Website
   => Conversion
5: generic search
   => Conversion
6: Information / comparison Website
   => Accomodations Website
   => Conversion
7: Flight tickets Website
   => Conversion
8: Information / comparison Website
   => Touroperator / Travel agent Website Competitor
   => Conversion
9: Information / comparison Website
   => Touroperator / Travel agent Website Competitor
   => Accomodations Website
   => Conversion
10: generic search
   => Accomodations Website
   => Conversion
11: Flight tickets Website
   => Accomodations Website
   => Conversion
12: Flight tickets Website
   => Touroperator / Travel agent Website Competitor
   => Conversion
13: generic search
   => Touroperator / Travel agent Website Competitor
   => Conversion
14: Flight tickets Website
   => Touroperator / Travel agent Website Competitor
   => Accomodations Website
   => Conversion
15: generic search
   => Touroperator / Travel agent Website Competitor
   => Accomodations Website
   => Conversion
16: generic search
   => Information / comparison Website
   => Conversion
17: Flight tickets Website
   => Information / comparison Website
   => Conversion
18: generic search
   => Information / comparison Website
   => Accomodations Website
   => Conversion
19: generic search
   => Information / comparison Website
   => Touroperator / Travel agent Website Competitor
   => Conversion
20: Flight tickets Website
   => Information / comparison Website
   => Accomodations Website
   => Conversion
21: Flight tickets Website
   => Information / comparison Website
   => Touroperator / Travel agent Website Competitor
   => Conversion
22: Touroperator / Travel agent Website Focus brand
   => Conversion
23: generic search
   => Information / comparison Website
   => Touroperator / Travel agent Website Competitor
   => Accomodations Website
   => Conversion
24: Flight tickets Website
   => Information / comparison Website
   => Touroperator / Travel agent Website Competitor
   => Accomodations Website
   => Conversion

To provide more insights and help users idenfifying patterns, we will generate a Sankey diagram using the frequent itemsets. A Sankey diagram is a type of data visualization that illustrates the flow of quantities or values between multiple entities. It’s particularly useful for showing the distribution of values, proportions, or quantities across different stages or categories in a system. As such, Sankey diagrams provide a clear and intuitive way to visualize complex data flows and relationships. In our case, the Sankey diargram will depict common sequences of touchpoint events, i.e. customer journeys, leading to conversions.

Before creating the Sankey diagram we will still define an auxiliary function that will convert hex color codes to rgba colors. This will allow us to specify the opacity of the colors as well. Using this function, we can create color palettes that we will use for coloring groups in our Sankey diagram.

code

def hex_to_rgba(hex_color, alpha):
    hex_color = hex_color.lstrip('#')
    r, g, b = tuple(int(hex_color[i:i + 2], 16) for i in (0, 2, 4))
    return f"rgba({r}, {g}, {b}, {alpha})"
    
colors = sns.color_palette("Dark2_r", 7).as_hex()
palette = [hex_to_rgba(p, 0.8) for p in colors]
palette_opacity = [hex_to_rgba(p, 0.4) for p in colors]

[Stage 903:============================>                            (1 + 1) / 2]

We can now create the Sankey diagram as follows using Plotly as a data visualization library.

code

# Extract distinct touch points from data frame with itemsets
unique_touch_points = set(event for seq in itemset_df["items"] for event in seq)

# Create a node dictionary
node_dict = {event: idx for idx, event in enumerate(unique_touch_points)}

# Prepare source and target nodes
sources = []
targets = []
for seq in itemset_df["items"]:
    if len(seq) > 1:
        for i in range(len(seq) - 1):
            sources.append(node_dict[seq[i]])
            targets.append(node_dict[seq[i + 1]])

# Create Sankey diagram
fig = go.Figure(
    go.Sankey(
        arrangement="snap",
        # configure the node properties
        node=dict(
            x=[1.0, 0.9, 0.1, 0.65, 0.1, 0.1],
            y=[0.5, 0.0, 0.3, 0.3, 0.55, 0.75],
            pad=0,
            thickness=20,
            line=dict(color="black", width=0.5),
            label=[lookup_dict[i] for i in list(unique_touch_points)],
            color=palette,

        ),
        # configure the link properties
        link=dict(
            source=sources,
            target=targets,
            value=journeys_df["freq"],
            color=[palette_opacity[s] for s in sources],
            label=[f"{p:.2f}%" for p in journeys_df["proportion"]],
            hovertemplate=[lookup_dict[i] for i in list(unique_touch_points)]
        ),
    )
)

fig.update_layout(
    title_text="Frequent customer journey paths",
    autosize=False,
    width=800,
    hovermode="x",
    font=dict(size=15, color="black"),
    height=400,
    margin={"t": 30, "b": 20, "l": 0},
)

fig.show()

Unable to display output for mime type(s): application/vnd.plotly.v1+json

[Stage 903:============================>                            (1 + 1) / 2]

The Sankey diagram provides a nice and intuitive overview of typical customer journes leading to conversions. Note also that we can immediately see proportions for trajectories by the height of the grouping bars.

Conclusion

In this analysis we have demonstrated that frequent itemset mining can be a useful approach to disover meaningful patterns from transactional data records. Furthermore, Apache Spark provides a powerful framework for data prepreprocesing and a scalable library for frequent itemset extraction using the bult-in FP-Growth algorithm. Finally, Sankey diagrams can provide an intuive way to visualize sequences of events.

References

Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. ACM sigmod record, 29(2), 1-12.