JavaScript Tracker for Machine Learning
Introduction
Different types of data are needed to use machine learning. Data can be acquired from different sources such as from websites, databases, CRM systems and more. The goal of this project was to build a tool that does the collection of anonymized user interactions from websites.
Executive Summary
JavaScript tracker was developed. The main features of the tracker are:
- Single-click deployment and build process
- Schemaless data collection mechanism
- Asynchronous implementation
- Modularity
Results
It was possible to reverse engineer most of the features that high-end tracking solutions such as Mixpanel, Google Analytics or Amplitude offer. Most of the tracking functionality is already available or can be easily added. Due to the single-click deployment and automated build process the setup takes 5 to 10 minutes. As this is open source software, there is no need to pay any fees and the cost savings are significant. This solution enables companies to own the data and there is no need to ship the data to external providers, which increases privacy. The functionality to opt-out from the tracking by the user is part of this solution. No personally identifiable information gets sent anywhere. Since the tracker is loaded asynchronous, there are no performance penalties.
Architecture
Figure 1.0 shows the architecture of the deployment and the build module. The JavaScript tracker can be installed directly on the website or via any Tag Management Solution. A step-by-step instruction can be found in the project repository.
Figure 1.0: JavaScript Tracker Deployment Module
The whole solution is split into two modules. The first module is the JavaScript tracker itself. The second module is the deployment and the build module. Each module is customized independently.
The JavaScript tracker uses 1st party cookie to identify the user. The identification happens anonymously. The anonymous id can be used for stitching events on the backend. The current solution is GDRP compliant. The anonymous id can be combined with a user id from other systems e.g. backend and is done by sending an identity call during a login/registration process. All events are schema-less, which means there is no fixed schema. It’s up to the data pipeline owner to decide what and how the tracking should be implemented. This approach offers possibilities but can lead to implications.
The deployment module was developed to speed up the implementation. It can be deployed via Cloudformation. During the deployment process, a so-called custom resource is triggering the process of combining the JavaScript files. The whole deployment happens without any human intervention. One of the requirements of the project was to remove the human from the build chain. The JavaScript tracker can be built only after the Pipes Core module is deployed. The Pipes Core module offers the functionality to store the raw event data.
Conclusion
Almost any functionality that the high-end tracking tools offer is included in this solution. The tracker was written in a such a way that it works on every browser. Users who don’t wish to be tracked, can opt-out from the tracking.
Recommendations
This solution was tested on production systems. It’s a robust working prototype. Some other aspects need to be considered in the future:
- Browsers are moving to the automatic deletion of 1st party cookies after 7 days. Therefore the cookie should be set server-side. Server-side cookies are not deleted by the browsers.
- No batching of events and offline tracking. Currently, every call gets sent to the tracking endpoint. This approach is not efficient. A better solution is to send events in batches. The same applies to offline tracking.
- The tracking code is not hosted on CDN (content delivery network). To decrease the latency, the tracking code should be served from a CDN.
Due to time constraints, the code is not covered with tests. In the future versions of the project, all mentioned aspects will be improved or further developed.