In this notebook we will peform customer touchpoint analysis using Apache Spark. We will automatically reconstruct customer journeys from transactional data consisting of purchase events and customer touchpoint events.
Introduction
In this notebook we will peform customer touchpoint analysis using Apache Spark. A customer touchpoint refers to any interaction or point of contact that a customer has with a business, brand, product, or service throughout the entire customer journey. It encompasses all the different ways through which a customer can engage with or experience a company, both online and offline. Customer touchpoints play a crucial role in shaping a customer’s perception, satisfaction, and overall experience with a brand.
In this analysis we will automatically reconstruct customer journeys from transactional data consisting of purchase events and customer touchpoint events. We will perform frequent itemset mining on those transactional records to identify combinations of customer touchpoints that constitute common customer journeys leading to purchases or conversions. To perform frequent itemset mining we will use the fp-growth algorithm that is provided by Apache Spark’s mlib library.
The dataset which we will use for this analysis can be downloaded at https://www.kaggle.com/datasets/kishlaya18/customer-purchase-journey-netherlands. This dataset contains transactional records of travel purchases together with corresponding customer touchpoint events. So let’s start with the analysis.
Data preprocessing with Apache Spark
code
import findsparkimport matplotlib.pyplot as pltimport pandas as pdimport plotly.graph_objects as goimport pysparkimport pyspark.pandas as psimport seaborn as snsfrom pyspark.ml.fpm import FPGrowth, PrefixSpanfrom pyspark.sql import SparkSession, Windowfrom pyspark.sql import functions as Ffrom pyspark.sql.functions import col
First we will start the analysis by initializing the Spark session.
From the above table, we can observe that a purchase consists of multiple events. Each event is listed as a seperate record and describes a certain customer touchpoint that was involved in the purchase. To prepare the data for our frequent itemset analysis, we will group all events with their corresponding purchase. We can do this by aggregating all records per purchase ID using Apache Spark. We will partition the data by UserID and PurchcaseID to improve performance. The type of customer toucpoint is recorded in the variable type_touch. Furthermore, we will also order each partition by TIMESPSS, which provides a time stamp for each event. This ensures that touchpoint events for each purchase are ordered chronologically.
code
w = Window.partitionBy("UserID", "PurchaseID").orderBy("TIMESPSS")touch_sequences_df = ( df.withColumn("sorted_touch_points", F.collect_list("type_touch").over(w)) .groupBy("PurchaseID") .agg(F.collect_set("type_touch").alias("sorted_touch_points")))touch_sequences_df.show(truncate=False)
For our subsequent analysis we will use Apache Spark to peform frequent itemset mining on the data frame with touch events. Frequent itemset mining is a data mining technique used to discover sets of items that frequently co-occur together in a dataset. It is a fundamental concept in association rule mining, which aims to find interesting relationships or patterns in large transactional or categorical datasets. In this case we will use it too see whether we can discover interesting patterns in touch event sequences that lead to purchases. Before we will start the analysis, let’s have a look at a commonly used algorithm to perform frequent itemset mining, called the FP-Growth algorithm.
FP-Growth algorithm
The FP-Growth (Frequent Pattern Growth) algorithm [Han et al., 2000] is a popular algorithm used for mining frequent itemsets and discovering association rules in transactional databases or datasets with a similar structure. It’s particularly useful for large datasets where traditional algorithms like Apriori might be inefficient due to their high computational complexity.
FP-Growth works by building a compact data structure called an FP-Tree (Frequent Pattern Tree), which allows efficient mining of frequent itemsets without generating candidate itemsets explicitly. Here’s how the algorithm works:
1. Constructing the FP-Tree:
Scan the Dataset: In the first pass over the dataset, count the frequency of each item. Items with support above a predefined threshold are considered frequent.
Sort Items: Sort the frequent items in descending order of their support.
Build the FP-Tree: Construct the FP-Tree by scanning the dataset again and adding each transaction to the tree. Each transaction is represented as a path in the tree. The tree nodes represent items, and their paths represent the sequence of items in a transaction.
2. Mining Frequent Itemsets:
Mining Conditional Pattern Bases: For each frequent item in the dataset, mine the Conditional Pattern Base (CPB). The CPB is a set of paths in the FP-Tree that contain the item. These paths are used to construct a smaller FP-Tree, called the Conditional FP-Tree.
Recursion: Recursively mine the Conditional FP-Tree to extract frequent itemsets.
3. Generating Association Rules:
From Frequent Itemsets: Once frequent itemsets are discovered, generate association rules based on these itemsets. Association rules express relationships between items and are typically in the form “If X, then Y.”
Calculating Confidence: Calculate the confidence of each association rule. Confidence measures how often the rule has been found to be true in the dataset.
Pruning and Filtering: Prune association rules based on user-defined thresholds, such as minimum confidence and minimum support.
Frequent itemset mining
We will run the FP-growth algorithm on the touch event data to generate frequent itemsets, which could help us identify interesting patterns. To run the FP-Growth algorithm we can use Apache Spark’s built-in FPGrowth method.
We will set parameter minSupport to 0.1, meaning that the pattern needs to present minimally in 10% of cases. Furthermore, we set parameter minConfidence to 0.9. Minimum Confidence is an indication of how often an association rule has been found to be true. By setting this parameter quite high, we will only retain high-quality patterns that are supported by association rules.
Once we have initialized the algoritym, we are ready to run the FP-Growth algorithm on the preprocessed purchases with their constituting touchpoint sequences.
After the fp-growth algorithm has finished, we can inspect frequent itemsets that were discovered by the algorithm, together with their corresponding frequency. We will also convert the Spark data frame to a Pandas data frame for further processing and visualization.
We will still add a trailing zero to each itemset, denoting the actual converson event. This will be useful for visualization purposes later on.
code
itemset_df['items'] = itemset_df['items'].apply(lambda x: x + [0])itemset_df.head()
items
freq
0
[1, 0]
19157
1
[7, 0]
17185
2
[7, 1, 0]
11670
3
[4, 0]
10396
4
[16, 0]
7857
We will have a further look at the discovered itemsets later on, but first let’s have a quick look at the extracted association rules.
Association rule extraction
Association rules are a type of pattern or relationship that can be discovered from transactional or categorical data using data mining techniques. Association rule mining aims to find interesting relationships between items in a dataset, particularly those that frequently co-occur together.
We can also inspect the assocation rules that were extracted by the fp-growth algorithm. We can observe that all derived assocation rules have a confidence higher than 90%. We will also convert it to a Pandas data frame immediately.
To facilitate further lookups, we will further convert the Pandas dataframe to a dictionary. Note that we also add an additional dictionary item 0 => 'Conversion' to annotate conversion events as well.
code
lookup_dict = lookup_df.set_index('touch_point').to_dict()['description']lookup_dict[0] ='Conversion'for k,v in lookup_dict.items():print(k,":", v)
To better interpret the results, it is mandatory to understand the following variables:
Antecedent: The item or set of items that appear in the “if” part of the rule. It represents the condition or premise of the rule.
Consequent: The item or set of items that appear in the “then” part of the rule. It represents the outcome or consequence of the rule.
Support: The support of a rule is the proportion of transactions that contain both the antecedent and the consequent. It measures the frequency of the rule in the dataset.
Confidence: The confidence of a rule is the proportion of transactions containing the antecedent that also contain the consequent. It measures the strength of the implication from the antecedent to the consequent.
Lift: The lift of a rule measures the degree of association between the antecedent and the consequent, taking into account the support of both items. It indicates whether the presence of the antecedent increases the likelihood of the consequent beyond what would be expected by chance.
We can now convert and interpret the derived assocation rules, which can be used to predict certain future events based on some prior events. In this case the association rules predict an interaction with a touroperator or accomidations website, leading to a conversion. The associated lift values for all association rules are higher than 1, indicating an increased likelhood for a customer interaction.
We can convert the association rules to text as follows.
code
association_rules_df['descriptions'] = association_rules_df['joined'].apply(lambda x: [lookup_dict[i] for i in x])print("Extracted association rules:\n")for i,v inenumerate(association_rules_df['descriptions'].apply(lambda x: "\n => ".join(x))):print(i+1, ': ', v, sep='')
Frequent itemsets represent common customer touch event sequences that occur together frequently. So these sequences can be seen as customer journeys across various touchpoints leading eventually to a purchase or conversion. We can derive those customer journeys from the extracted frequent itemsets we have discovered earlier. So let’s have a look.
code
itemset_df['touch_description'] = itemset_df['items'].apply(lambda x: [lookup_dict[i] for i in x])print("Frequent customer journeys:\n")for i,v inenumerate(itemset_df['touch_description'].apply(lambda x: "\n => ".join(x)).to_list()):print(i+1, ': ', v, sep='')
To provide more insights and help users idenfifying patterns, we will generate a Sankey diagram using the frequent itemsets. A Sankey diagram is a type of data visualization that illustrates the flow of quantities or values between multiple entities. It’s particularly useful for showing the distribution of values, proportions, or quantities across different stages or categories in a system. As such, Sankey diagrams provide a clear and intuitive way to visualize complex data flows and relationships. In our case, the Sankey diargram will depict common sequences of touchpoint events, i.e. customer journeys, leading to conversions.
Before creating the Sankey diagram we will still define an auxiliary function that will convert hex color codes to rgba colors. This will allow us to specify the opacity of the colors as well. Using this function, we can create color palettes that we will use for coloring groups in our Sankey diagram.
code
def hex_to_rgba(hex_color, alpha): hex_color = hex_color.lstrip('#') r, g, b =tuple(int(hex_color[i:i +2], 16) for i in (0, 2, 4))returnf"rgba({r}, {g}, {b}, {alpha})"colors = sns.color_palette("Dark2_r", 7).as_hex()palette = [hex_to_rgba(p, 0.8) for p in colors]palette_opacity = [hex_to_rgba(p, 0.4) for p in colors]
The Sankey diagram provides a nice and intuitive overview of typical customer journes leading to conversions. Note also that we can immediately see proportions for trajectories by the height of the grouping bars.
Conclusion
In this analysis we have demonstrated that frequent itemset mining can be a useful approach to disover meaningful patterns from transactional data records. Furthermore, Apache Spark provides a powerful framework for data prepreprocesing and a scalable library for frequent itemset extraction using the bult-in FP-Growth algorithm. Finally, Sankey diagrams can provide an intuive way to visualize sequences of events.
References
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. ACM sigmod record, 29(2), 1-12.