Sharing Is Caring: Our First Dataset Release for The AI Community

Today, we are beyond thrilled to announce the release of our first anonymous shopping dataset to the AI community. This release follows the recent publication of our article on intent detection in the peer-reviewed journal Nature Scientific Reports. Now that we’ve shared the exciting news, we’ll share the somewhat less exciting title of our article: Shopper Intent Prediction from Clickstream E‑Commerce Data with Minimal Browsing Information. We promise it’s great.

To make a long story short, a team of researchers from across academia and Coveo extensively tested the hypothesis that it is possible to detect user intent based on click-stream data. More specifically, we wanted to find the answer to the following question: If I record every click on a target Ecommerce website, after how many clicks am I able to reliably predict if the user is a buyer or just a window shopper?

Our work is mostly about drawing a “map of the territory”, trading off model accuracy for the sake of model simplicity. And a core focus for us is ensuring that our research findings are actually valuable for real shops out there – and not just for the Amazons and Rakutens of the world.

If this sounds - complicated - we have some background reading for you

Read: Powerful Personalization in Ecommerce – No Big Data Required

Learn more

So what’s in this dataset of ours? More than 5 million individual shopping events from an “average” website. And these events are partitioned into sessions (which include dwell times and product information).

By releasing it to the world, we hope to fuel more research into digital user behavior – prediction, recommendation, personalization. We also want to set a good example for how to democratize cutting-edge applied machine learning, by providing the community with not only ideas (published papers), but also the means to act upon them (research-friendly datasets).

The present dataset has been released with very friendly T&Cs, and fair research use is permitted provided (1) basic precautions are observed and (2) the citation of the original work is included in any derivative work that utilized the data and/or ideas from the research paper.

Given its structure, the dataset is interesting for many models and use cases, not just the original time-series classification task. Some of the many ways it can be utilized include:

Replicating our results and evaluating model effectiveness by simulating a conversion rate similar to your shop (following the methodology we put forward in the paper);
Improving on our models, with the usual two tricks: adding more features and/or testing different architectures;
Exploiting the rich dataset to investigate other, complementary tasks – e.g. next event prediction

But, as always, the most interesting uses are the ones we haven’t even thought about yet: surprise us!

Data Details

Original research: Shopper Intent Prediction from Clickstream E‑Commerce Data with Minimal Browsing Information

Authors:

Borja Requena – Institut de Ciencies Fotoniques, The Barcelona Institute of Science and Technology
Giovanni Cassani – Department of Cognitive Science and Artificial Intelligence, Tilburg University
Jacopo Tagliabue – Coveo AI Labs
Ciro Greco – Coveo AI Labs
Lucas Lacasa – School of Mathematical Sciences, Queen Mary University of London

Dataset Information: Github

Dig Deeper

To learn more about how to personalize the online shopping experience without a ton of data, check out some of our previous blog posts. We promise they’re great too:

Solving the Cold Start Problem

(Definitely Not) Lost in Translation: ‘Translating’ Products for Multi-Brand Personalization

How to Grow a (Product) Tree: Building personalized category suggestions with Ludwig

Mind-reading as you type: how to provide personalized query suggestions to new shoppers

Making computers as smart as…babies: What Artificial Intelligence can learn from Diaper Intelligence

Clothes in Space: Real-time personalization in less than 100 lines of code

Acknowledgments

All authors wish to thank Richard Tessier and the Coveo legal team for supporting our research and believing in this data-sharing initiative.