I, Sitecore: Integrating machine learning

August 26, 2014

I've more recently released a video recap of the presentation I gave to the New England Sitecore User Group.

Homage

You've unlocked a door with a key of imagination. Beyond it is another dimension. A dimension of clouds. A dimension of sites. A dimension of mind. You're moving into a land of both shadow and substance, of things and ideas. You've just crossed over into, the Sitecore zone. (shocking music)


On a rather nice day in November, at a bar in Cambridge, after an event known as the Laidlaw Classic (annual birthday football game), an Oasis employee mentioned to me that he was taking a class on Genetic Algorithms. After questioning him to death, since I'd never heard of it, I looked up what he recommended and found a hello world genetic algorithm on GitHub. With it, I was able to take it apart and understand the mechanics. After finally realizing what it was doing, I was able to hotwire it, abstract it a bit, tweak it, and make it provide content for a website based on what type of links I'd been clicking.  

If this sounds like I'm headed in a particular direction, then you're right, it's personalization. I've written before about some aspects of personalization dealing with the rules engine to provide content but this could prove to be much easier. Instead of several steps including analyzing your analytics, a task list better managed by an algorithm, marketers and content editors would be free to focus on what they do best: creating killer content. In turn the technical tasks would fall to those who prefer to do them: developers.  

At the time of the release of the article, what I've built isn't ready to be deployed. It's a prototype. This is an open call to developers who are interested, to take some time and experiment with it. There's a few different uses for an algorithm like this that I can think of (Personalization, Device Detection, Caching and Security) but I'm sure there's more. Also this project isn't tied to Sitecore's personalization engine. It is free standing and the algorithm data and click events are only stored in session. It was a lot easier to build the protoype by limiting my scope and there are still some problems yet to be solved for this to be a package-able, turnkey solution. 

That being said, let's not be deterred. Let's shuffle on. 

Machine Learning

Ok, so at the core of this technology I'm using a Genetic Algorithm. This is branch of Machine Learning which is a branch of AI. It's not gonna get up and walk away on it's own but it can traverse a large field of data that you yourself couldn't. The field of data I'm referring to is the combination of all your users, their interactions and all the content available to them to view. Given a sufficiently large user base and content repository, you're looking at quite a large number of combinations. 

Manually creating personalization schemas will take you forever to get the level of detail the algorithm can produce. Sure, the algorithm isn't perfect but it will improve over time and since a user can change it's mind at any given time, what constitutes a "good" solution is a moving target. This is the real motivation for allowing an adaptive algorithm to do the work. Instead of constantly tweaking and tuning your personalization, the algorithm will continually adjust itself, leaving you free to spend the day doing more productive things.   

Genetic Algorithm

So what's going on behind the scenes? How does the algorithm work? Well, it's using a combination of random chance and probability. It's using random chance to create variability (randomized solutions). Then it's using probability to focus those variations into a productive direction. There's a number of percentage values that are used by the algorithm to set limits on things like mating, mutation and elitism. I've also added a few into my implementation for additional functionality. 

Populations

The algorithm starts by generating a population of entities known as Karyotypes. These entities will have a sequence of Genes. Genes can represent anything. In this case, they represent content categories; tags. The number of genes relates to the number of renderings you need filled. The Gene list of tags tells you what category of content to provide for the renderings. This is so that each rendering can have its datasource set with content from that specific content category. Given two renderings and three categories, an entity's Gene could look something like this: Category1-Category1 or Category3-Category2. The population will start with around 100 different entities and each entity will be a complete random mix of tags for genes. Over time, the population will partition a set of entities that best represent your clicking behavior. 

Fitness

The way the algorithm determines which entity to use as a solution, is to compare their fitness. Fitness can be anything you want but in this case, I've decided that the the total clicks for a category is it's value and that list of categories an entity contains can then be summed up. In my sample site (provided in the source code) I've used colors as tag categories. This was to make it easier to visualize what was happening. As the user clicks on navigation items, values are stored by category. The more clicks, the more valuable it is. The image below shows how a number of clicks for colors would be used to measure the fitness of entities for a page with 2 renderings. For example, since Red was clicked twice, it's value is 2 and a if an entity is Red-Red, then it's sum is 4. 

fitness

In this algorithm, I'm shrinking the value of a click over time so that more recent activity is more valued than further in the past. Big shout out to Andrew Blackmore at Velir for his post that I used as a model. By default it will not reach zero but a fraction of the original value. Only when you've clicked on something a sufficient number of times will the value remain above the threshold. In the image here, you can see the decay of two separate clicks on Red and how on the second click, the value begins to grow:

fitness threshold

Genetic Algorithm Lifecycle

The fitness values are used to sort the entities and provide the best solution. The algorithm then starts to mate (mix genes) between entities and then re-sort. This is repeated until an "optimal" solution is found but in this case the "optimal" solution changes with the user's habits. 

ga lifecycle

Elitism

A percentage of the population at the "most fit" end is not replaced during the mating process so that the population will maintain its fitness or improve but never decline. This percentage is known as the elites.  

Mating

Mating will loop through proletariat (non-elites) in the population and randomly select a partner to mate with and replace itself with the child. The probability that a proletariat will end up mating is determined by the Crossover Ratio, which generally is greater than 50% since you want a good chance of mating. If an entity is chosen to mate, it randomly selects a partner and continues to try looking for a more fit parent a certain number of times. The number of times it will try is the Tournament Size. During this process it's possible the other parent could end up being an elite. When mating finally does occur, the process of mixing the genes between two parents, in a biological sense, is known as Meiosis.

 

elitism

Meiosis

To understand Meiosis you'll have to know a bit more about how the Genes are stored in an entity. This diagram shows the class hierarchy within my version of the algorithm. Each subsequent row expands the object into it's components: 

relationship

For an individual entity (Karyotype) there are two Haploids, one from each parent. They are made up of a Dictionary of Chromosomes. Each Chromosome is a List of Genes. Meiosis produces a new Haploid by randomly mixing the two in itself. This means that each parent will provide a completely new Haploid for a new entity. Here's an example of how a single Chromosome is combined:

meiosis

Phenotype

When a new entity is created, it needs to determine what characteristics it will display. Only half of it's available Genes will be used. The other half remains hidden. The half that is shown is known as the Phenotype. It will compare the Genes from both parent by using the Dominance setting to determine the Phenotype. The Phenotype of an entity is what is used to judge its fitness. Each Gene has a Dominance setting, which is set randomly in population initiation. If the Gene's Dominance settings are the same, one is chosen randomly, but if they differ, the Dominant wins over the Recessive.  

phenotype 

Mutating

There's also a bit of random mutation thrown into the mix which provides yet more variability but too much can cause instability. It, like mating, is reserved for the proletariat section of the population. The Mutation Ratio is the probability that an entity will mutate.

mutation

Page Lifecycle

So now that you know how the algorithm works, how does it apply this to the page content? The crux of the situation is setting the datasource for a rendering to a value defined in the config and then during a pipeline processor, the datasource is switched out with a content item based on what the algorithm has chosen. Then of course there's a click event handler, which stores what tags have been clicked. Here's a high level breakdown of the chain of events. 

page lifecycle

For a developer, you'll first want to create an algorithm definition for your site. This contains all the values needed for the algorithm to run and then add an attribute to your site node that tells it which algorithm definition to use. Each site can have it's own or they can share. From there, you'll just need to determine which renderings you'll want this be used on and set the datasource to a common value in the config. Above and beyond setting up the algorithm, you'll also need to have a tagging system. At least a folder full of tags and a field on both content items and navigation items. I don't recommend tagging any more navigation than you need to. Top level tagging is really enough. The concept of having the algorithm choose sub-category content would require a separate list of click event data and running the algorithm on it separately. I'm not sure how computationally expensive that will be but I imagine it could grow to be quite taxing. For now it's just a top-level category selector. 

For a content editor, their life is actually pretty cushy. All they need to do is create tags, tag content and then tag the navigation to line up with the content. The goal really was to make this a more user friendly experience.

The 411

Ok, so I've got the source code for my project, called SitecoreGeneticAlgorithm, on GitHub. It's broken up into several C# projects: GA.Nucleus (genetic algorithm), GA.SC (Sitecore specific adaptation), GA.SC.UI (Sitecore specific config, web service, layouts, and packages) and GA.SC.UI.SampleSite (generic site adaptation). The sample site should be replaced by your web app. It doesn't take a lot to wire up the system and it provides a working example of what you'll need to do. Here's a breakdown of the libraries starting at the top:

GA.SC.UI.SampleSite

The project really only contains a few files with little custom code that you'll need to change.

 

GA.SC.UI.SampleSite.config

The config is a standard site config definition with one additional attribute: gaName. This tells the system which algorithm definition to use. In this case, it's set to use the "gamanager", which is used in a diagnostic tool which I'll explain in the next library. You will likely want to create your own algorithm definition and tweak the values from there.

GASampleSite-releasedate.zip

The GASampleSite*.zip is the installation package for the site including the content sublayouts and templates.

 

GAMaster.aspx

The GAMaster.aspx is standard but with one addition: the click event handler. The way that user clicks are registered is through a web service. I'm connecting to it using jQuery. 

BaseSublayout

This class is used for easy access to the datasource.

 

GADrivenContent.aspx

The GADrivenContent.aspx is actually completely standard. The name implies it does more than just simply display the datasource but it doesn't. It's only responsible for displaying the datasource that gets set. Keeping this agnostic will require the search part of the Sitecore adaptation to be tweaked to match content types to rendering types. Currently it's expecting a known type. 

 

GAPage.aspx

The GAPage.aspx actually only displays the field content and doesn't have any algorithm specific function.

 

TagUtil.cs

The TagUtil.cs is used on the GADrivenContent page to retrieve the tag value from a datasource. It does this by taking the "Tag" field (search box field) from the datasource, converting it to items and retrieving their tag names. The "Tag" field is used to pair content with navigation elements so that when a navigation element is clicked a corresponding content item can be chosen. An example would be to have navigation elements like electronics, home goods and books. The electronics link would be tagged with "electronics" and then computers, phones etc. would also be tagged as "electronics" so that when the electronics section was clicked on, more computers, phones etc. would be displayed.

GA.SC.UI

This project also doesn't contain many files. I don't forsee a lot of changes to what's already there but there will likely be additions over time.

GA.config

The GA.config file contains all the information needed to run a Genetic Algorithm for Sitecore as well as class references for the algorithm, engagement values and a pipeline event that sets datasources on renderings. Here's a breakdown of the values:

  • tagFolder - The ID of the folder (Bucket) where tags are stored
  • contentFolder - The ID of the folder (Bucket) where content is stored
  • contentTagField - The field on the content items that tags are set
  • datasourceValue - The value of the datasource field that indicates it should be replaced
  • chromosomeName - The name of the chromosome in the Haploid dictionary for an entity (only works with one so far)
  • populationType - The class type of the population
  • karyotypeType - The class type of the karyotype
  • evProviderType - The class type of the EV Provider
  • crossoverRatio - The probability the proletariat will mate
  • elitismRatio - The percentage of population that is elite
  • fitnessRatio - The percentage of the most fit value that is allowed to be chosen as a solution
  • fitnessThreshold - The threshold the combined click events must be to initiate the algorithm
  • mutatioRatio - The probability the proletariat will mutate
  • tourneySize - The number of times a better parent is searched for
  • popSize - The size of the population
  • evModifier - Defines how click events decay over time
    • minRatio - The lower bound percentage a click value will reach
    • decimalPlaces - The number of decimal places a value should be calculated to
    • timespanAspect - The time frame the half life is based on (milliseconds, seconds, minutes, hours, days) 
    • halfLife - Half life in terms of Timespan Aspect. IE: 2 days. 

 

GAManager.aspx

The GAManager.aspx is a diagnostic tool to help understand the mechanics and test the algorithm without having to implement a site. 

 

GAModule-releasedate.zip

The GAModule*.zip installs the GAManager under /sitecore/system/Modules and the core content items, templates and sublayouts it requires. 

 

EventTracking.asmx

The EventTracking.asmx is the web service that is being called by the Sample Site to store click event data. 

GA.SC

This project does contain a good deal of the functionality and will be where most of the changes will be coming to. 

EngagementValues (EV)

There are three classes: DefaultEngagementValueProvider, DefaultValueModifier and EngagementValue. Then there are three interfaces these are based on so that you can provide your own. 

An EngagementValue stores a Value and a DateTime. The DateTime can be used to determine if a value is still relevant.

The DefaultValueModifier will take each EngagementValue and measure the value over time on a curve. The properties for this curve can be set in the config. The concept for this was pulled from Andrew Blackmore's post on Velir's blog. The idea behind it is to make a user's older clicking habits obsolete at some point in time. By gradually bringing the value down to zero, you're able to work against current activity. My particular implementation doesn't bring them to zero but instead a fraction of the original value. This allows the algorithm to still imply a value on something that has been clicked a lot in the past but not recently. Calculating values on a curve is an increasing taxing task and I currently am disabling it because it's too computationally intensive. It will likely be moved to a separate server to be run and queried.

The DefaultEngagementValueProvider is using a static (singleton) to store the values. This is hacky and will be replaced with a provider that supports a database for querying. 

Pipeline

There is an InsertRenderings pipeline processor called SetGADatasource, that fishes out all the renderings who need their source set by the genetic algorithm. It starts the algorithm and then tells it how many genes it needs (length of chromosome). Then provides it all the tags available to use as genes (IGene). The combination of the length and content for an entity is the Genotype. From a biological perspective, I believe I am using this correctly but it might be off. Feel free to correct me.

Web Service

The way that click events are tracked is by hitting a web service that stores them. This is implemented in using jQuery in the Sample Site. 

ConfigUtil

All of the config values are accessible through this class.

Algorithm Specific

The PageKaryotype, SCPopulation and TagGene are the implementation of the core algorithm library. The PageKaryotype is an entity in the SCPopulation. It defines a Fitness method using the TagGenes and EVs to sort the entities, which is the heart of how to guide the algorithm. 

The SCPopulation class manages storing/retrieving the population in session. This will eventually be changed to a provider system so it can be stored in session, file or database. Otherwise it doesn't actually modify the functionality in the DefaultPopulation class that it extends. 

The TagGene is the implementation of the IGene interface by providing the tag name as the GeneID. Genes are the building blocks of the life-imitating entity that is created. 

GA.Nucleus

This is, as the name implies, the core of the algorithm. The focus of this class is around Genes, Chromosomes, Haploids, Karyotypes and a Population. The biological jargon that I'm using is an attempt to be accurate to the model of nature, not solely as an attempt to be nerdful. But definitely somewhat because I'm a nerd... Aaaanyway.

Populations and Karyotypes

An entity in a Population is a Karyotype. The Karyotype is made up of two Haploids and a Gender(bool). The Gender prevents mating with oneself. It's two Haploids are from it's parents. Both the mother and father provide mixtures of their Haploids. 

Haploids

A Haploid is a dictionary of Chromosomes which adds depth to the detail an entity has. The sample site only uses one: PageContent. Although there could be multiple chromosomes for a page to represent different page areas (right and left column) or the types of content for a page (ad, form, informational). This is the base for institutional memory. The state of this information at any given time is the result of the time provided for it to traverse the information you've provided, the group of entities that constitute progress (elites) and the mixture of random possibilities (proletariat).

Chromosomes

The Chromosomes are a List of Genes. In the sample site the List of Genes are a list of tags that tell you what category of content to provide for that rendering. The content is currently done by simple Sitecore Query but will be replaced by an implementation of a search engine, like Lucene. 

Genes

Genes are only responsible for having an ID which is used to identify what it represents and a Dominance property which is used when combining the genes of the parents to know what it will express in the Phenotype.

Phenotype

The Phenotype is the set of Genes an entity expresses. It is used to judge it's fitness. It is a Haploid that is defined by comparing the parent's Genes and their Dominance. These are all set randomly in the population initiation. A combination of two Dominant Genes will choose one by 50/50 chance. Same if both are Recessive. But in a match between Dominant and Recessive, the Dominant will win out. The purpose for adding this into the system is to increase the amount of variation. An entity won't only have a set of Genes that it's displaying but will also contain an entirely hidden set. This information, although hidden, can resurface in a child upon Mating. It really allows a small population the ability to adapt. 

Genotype

This stores the length and list of Genes to be used per Chromosome. The GenotypeList is a Dictionary of Genotypes that pairs with a Haploid. When the Haploid is created it uses the matching Genotype as a guide for what to construct.

Overall Usage

The Population, Haploid, Chromosome and Gene all provide a default class that can be used. The only thing you'll need to create, is an extension of the abstract class BaseKaryotype. I've provided this implementation in the GA.SC.PageKaryotype. This implements the fitness function and defines what makes a good result. These also are based on an interface so that you can replace them with your own implementations if desired. 

Eventually population size becomes a limiting factor. Data storage and algorithm processing will grow with a population. I have not yet found this limit and don't have recommendations on what is required to support any given number of users.

Another limiting factor is the Engagement Value Modifier. When calculating values on a curve, eventually the cost of measuring the asymptote can be demanding on a processor. There may be ways to cut it short through the calculation or simply by not calculating values past a certain date. It's likely this will need to be passed off and run as a batch on another machine to work at scale.

The end? Hardly. 

Although that's the end of this article, that's not where this ends. This is only the beginning of a long and interesting journey. I'm not going to stop until some robot from the future and crazy lady come busting into my home trying to make me stop. Your move universe. (squints at the sky)

References

In no particular order: