Skip to content
Home » Blog » Catalog: Using Elasticsearch to Build Content Personalization

Catalog: Using Elasticsearch to Build Content Personalization

  • by

At Iterable, we wish to empower our prospects to embrace actually personalised content material in advertising and marketing campaigns. Customers can now add non-user information to Iterable Catalog and run highly effective queries over them. They can construct Catalog Collections utilizing particular person recipient information as parameters of the gathering.

For instance, a meals supply service can use Iterable’s Catalog function to create a “new restaurants near you” Collection and ship out a marketing campaign that finds newly opened eating places inside a mile of every person’s location. Catalog Collections are fully managed by the marketer, enabling complicated personalization with zero engineering assets.

The thought of storing non-user information inside Iterable shouldn’t be new. Historically, such information was saved as S3 blobs, making it accessible solely by ID. With Catalog, we’ve moved this information into the identical Elasticsearch backend powering Iterable’s person and occasion segmentation. The new backend helps wealthy question semantics on all properties of a catalog merchandise, together with issues like geolocation, star ranking, product class.

Technical necessities

The technical necessities for Iterable Catalog embrace:

  • Legacy system function parity + information migration and conflicting information sort assist
  • Fast small doc write speeds
  • Scalable quick and sophisticated queries (up to 10s of hundreds/second)
  • Partial doc updates

High-level Architecture

Iterable has by no means restricted the kind of information that every buyer could have in any Catalog. It’s necessary for us to retailer no matter information our prospects want, whether or not these are eating places, menus, product stock, or songs, and many others. The Catalog ingestion pipeline is pretty easy: requests are available in by way of Iterable’s asynchronous API and are revealed to Kafka; Kafka customers remodel, batch, and index the requests into Elasticsearch (ES).

We use Kafka and Akka Streams in our ingestion service to batch updates collectively, for much less I/O and quicker writes into Elasticsearch. This was one thing that was by no means applied within the S3 model. We additionally created new bulk operation endpoints on the Iterable API degree to make it simpler for purchasers to feed in additional information.

Elasticsearch is a good instrument for this method as a result of it’s a distributed NoSQL search engine that may carry out many small, quick queries. Its indices merely retailer JSON paperwork. It additionally has a strong question DSL that may present complicated and granular queries in opposition to its indices.

This has an analogous form in contrast to our person and occasion ingestion pipeline. You can examine how we optimized its efficiency right here.

Elasticsearch Mappings and Types

In order to optimize question pace in opposition to fields in every catalog merchandise, ES wants to know what sort every discipline is – does it signify a quantity, a string, an object, a date, or location information? At index time, ES builds secondary information buildings based mostly on the kind of every discipline, so as to carry out the quick lookups at question time. The sort of discipline additionally determines the category and nature of queries that may be carried out in opposition to it. Elasticsearch is wise; by default, it seems to be on the first incidence of a discipline in an index and creates a mapping for it based mostly on the inferred sort. Once it has inferred {that a} discipline is a quantity, future occurrences of that discipline should all be numbers, or the replace might be dropped.

However, up to now, we’ve got regularly seen prospects make errors sending us information. Elasticsearch will create mappings based mostly on these errors. When this occurs, future “corrected” doc writes could not get saved as a result of they not match the unique doc sort. Another consideration is that legacy information in S3 was by no means strictly typed, and doesn’t care if a buyer adjustments a discipline sort. Perhaps one doc has {"serialNumber": 123}, and one other has {"serialNumber": "123four"}. The distinction is refined, however this state shouldn’t be allowed as soon as the paperwork are migrated into ES. Is serialNumber a string, or a protracted? It would depend upon which doc was listed first. This non-deterministic habits is clearly undesired and must be averted.

Controlling Mappings

Automation is sweet… till it isn’t. We realized that we want to stop Elasticsearch from wrongly inferring mappings varieties. We need our prospects, who perceive their information one of the best, to inform us explicitly what sort a discipline is, if and provided that they need to carry out queries in opposition to that discipline. Until they outline the sphere mappings, they shouldn’t exist in Elasticsearch. We didn’t need to introduce an entire new supply of fact database/storage system for non permanent mappings information, as a result of it’s inconceivable to hold the exterior system in step with the asynchronous, finally constant nature of Elasticsearch. We hoped to leverage Elasticsearch itself to retailer this information.

At a excessive degree, we achieved this by breaking every doc into uncooked and outlined parts, storing each inside Elasticsearch. For every catalog, we bootstrap an Elasticsearch index with dynamic templates that turns off computerized sort inference for uncooked fields, and as a substitute designates every as a disabled object:

Let’s break this down:

  • path_match performs a comparability in opposition to the JSON path of a discipline.
    • For Example, for {"foo": {"bar": true}} the trail of bar is foo.bar
  • Any discipline whose path begins with uncooked might be utilized the next properties:
    • "type": "object"
    • "enabled": false
  • These two collectively implies that any baby of uncooked can have youngsters, however ES doesn’t do something particular by way of constructing supporting information buildings for the sphere or its youngsters, so not one of the information within the discipline is enabled for looking.
  • We can nonetheless retailer the uncooked information, and on the mappings degree, we’ve got a file of all the sphere names that exist.

Document Transformation During Ingestion

Let’s use a concrete instance to describe what occurs throughout Catalog updates. Let’s say a buyer creates a “Restaurants” catalog. Iterable will make certain the catalog has the required dynamic template as described above.

The buyer can now ship in a brand new catalog merchandise replace to the “Restaurants” catalog, for id uuid123:

Iterable’s Kafka shopper (the ingestion service) really transforms the every unique doc by wrapping it inside a uncooked discipline and sending it to ES to retailer as the next:

When ES shops the reworked doc, it’ll replace mappings for the “Restaurants” catalog to embrace the brand new uncooked fields:

Note that although the fields usually are not outlined with their pure varieties, their existence is recorded, and any discipline’s information could be overwritten with out affecting every other discipline in a subsequent partial replace name.

Again, none of those fields could be searched upon at this level. They’re all youngsters of a uncooked discipline that the shopper didn’t create. However the shopper can nonetheless retrieve their information by performing GET requests on the id of the restaurant. In this case, they’ll retrieve the information for uuid123. We would simply return the contents of the uncooked discipline on the doc with that id.

Because the information in every of those uncooked fields is an “unsearchable object”, Elasticsearch doesn’t care that the true values for these fields could have conflicting varieties. One can take into consideration this idea just like the Object tremendous sort in Java.

Any and all information could be upcasted, so to converse, to a “unsearchable object”. The dangerous state described above—one doc having {"serialNumber": 123} and one other having {"serialNumber": "123four"}—is absolutely supported. Both 123 and "123four" are “unsearchable objects.” With these easy guidelines and transformations, we’ve got replicated performance in our S3 system and may migrate all legacy information into the brand new database.

Allowing Customers to Explicitly Define Mappings

It’s good to have legacy function parity, however now we’ve got to enhance the expertise. What can we do if we wish prospects to carry out queries on their eating places? There will clearly be a couple of merchandise in every catalog, and we wish prospects to seek for solely the gadgets which can be related. Recall that earlier than permitting looking on a sure discipline, ES wants to know the true sort of the sphere. It can’t be an unsearchable object.

We created each API endpoints and a person interface to permit prospects to see current outlined and undefined fields, in addition to outline beforehand undefined fields. A buyer can outline fields which will or could not exist already on an merchandise within the catalog. So in the event that they’d like, they’ll outline mappings earlier than ingesting any information.

Keeping with the above instance, let’s say the shopper is now certain that location will all the time be a lat/lon geographical discipline and so they need to question on that discipline. They can use our “update mappings” endpoint to ship us a location definition for the sphere, and we’d replace the ES mappings for the catalog accordingly to appear to be this:

Observe that there’s now a brand new location discipline that could be a sibling of the uncooked discipline and it has an specific sort: geo_point. Once location is understood to be of this kind, all subsequent updates within the “Restaurants” catalog on the location discipline will set off updates in opposition to environment friendly information buildings for looking or aggregating in opposition to this discipline’s location information. So prospects can carry out searches like “find all restaurants whose location field is within 1 mile of a lat/lon point.”

Also notice that we didn’t (and can’t) take away the uncooked.location discipline. Technically, all outlined fields exist twice inside mappings.

Ingestion With Defined Types

Once there are specific mappings that exist within the catalog, it’s inadequate to remodel the unique doc by merely wrapping the information contained in the uncooked discipline as we’ve executed beforehand. We additionally want to duplicate the outlined fields’ information to match the mappings. For all outlined fields, we want to extract the information within the fields. So if the person despatched the uuid123 doc with the identical information once more, our ingestion pipeline will now remodel the information to appear to be this:

Any queries in opposition to location might be carried out in opposition to the outer location discipline.

Caching Mappings

Those accustomed to Elasticsearch could have seen that we simply launched a brand new bottleneck in our ingestion pipeline. Each Kafka shopper has to carry out an ES get-mappings for every catalog merchandise replace that we obtain! This is a really costly name, and may simply overwhelm Elasticsearch when carried out at scale, so we wished some form of cache to retailer these mappings.

Here are some observations and necessities for the cache:

  • Each Kafka shopper can index into a number of catalogs; every catalog has its personal mappings
  • Fields and their varieties in every mapping can’t be modified, however new fields could be added
  • Because the purchasers explicitly replace mappings, the cache must not ever serve stale mappings information

The mappings cache we constructed deserves a way more detailed deep dive in a future weblog. This submit covers solely the high-level resolution we applied for the mappings cache system.

Each shopper holds an in-memory cache with a dictionary of catalog id to mappings. A separate Redis occasion holds info on when every native cache wants to invalidate a Catalog entry. Before every shopper will get discipline sort mappings for a specific Catalog, it checks Redis to see if it wants to invalidate its native cache entry for the Catalog.

Catalog Collections

At this level, we’ve got every thing we want to retailer our buyer’s Catalog information. They can simply index and get paperwork as they’ve all the time executed, however how do they really construct Collections, or subsets of their Catalogs? Our front-end crew a constructed a strong and delightful UI for outlining and saving Catalog Collections. Under the hood, a Catalog Collection is just an information mannequin representing Catalog search standards which could be translated into an Elasticsearch question at marketing campaign ship time. It could have placeholder “dynamic” values that might be resolved by fields from a person profile.

An instance Catalog Collection could be outlined by the next search:

New fried hen eating places inside 5 miles of every person.

Using the Collections builder within the UI:

For the “Restaurants” Catalog, this interprets to a pseudo-model for this could be outlined as one thing like:

Those accustomed to ES know that the majority of this could simply be translated right into a bool question with a should question and a filter. The first component within the should question is a time period or phrases question in opposition to the “category” discipline, with fried hen as the worth. This asks if “fried chicken” seems on the class discipline of any doc.

The second question is a date vary question the place the grandOpeningDate worth is larger than 30 days in the past. The geo_distance_range queries the location discipline to see if it’s inside a radius of 5m of some lat/lon geo information retrieved from a recipient profile’s home_location discipline. If home_location on the recipient person profile resolves to {"lat": 40, "lon": -70}, the question on the index corresponding to the “Restaurants” catalog may resemble:

Results and Future

All non-users/occasions information inside Iterable now lives on this new Catalog system. We have been in a position to seamlessly migrate buyer information from S3 with out information loss or buyer interruptions. At time of writing, the Catalog function has simply been launched to all prospects. Since its announcement, we’ve got seen large curiosity to use Catalog/Catalog Collections from current and potential prospects.

In the longer term, we see this turning into a totally automated content material suggestion engine service that our prospects can plug into. Technically, there are just a few extra issues that we are able to add to our ingestion pipeline to cowl nook circumstances that we all know Elasticsearch can’t deal with effectively. We may also add downstream publishes to feed information into Spark or another machine studying streaming system for mannequin coaching. We can add methods for purchasers to present extra details about every catalog in order that Iterable is aware of how to most effectively unfold their information throughout Elasticsearch nodes throughout clusters.

Check out the function in motion at our 2019 Activate demo to see how Iterable can energy personalization in development advertising and marketing.

Leave a Reply

Your email address will not be published. Required fields are marked *