SERVICES
Recipe Recommendation Engine with Traditional ML + Genie on Databricks Free Edition
For the Databricks Free Edition Hackathon, I wanted to show that traditional machine learning still has a big role to play today, and how it can work hand in hand with Databricks’ newer AI tooling. As a concrete use case, I built a recipe recommendation engine that suggests relevant recipes to users: classic natural language processing (NLP) and topic modelling structure the data, and AI/BI Genie helps surface that value for end users. Both approaches work together rather than replacing one another.
I have always been interested in using NLP tools to analyse classical Arabic texts, but I had never built an end to end solution in Databricks that brings an NLP pipeline to life. This felt like the perfect opportunity to do exactly that.
What I Built
I built a recipe recommendation engine with the following components:
- Lakeflow Declarative Pipelines (LDP) to ingest and prepare text data using a medallion architecture
- PySpark ML with an LDA topic model to discover themes in recipes, packaged as a Databricks job
- AI/BI Genie on top, to explore the data and get recipe recommendations via natural language
All of this runs on Databricks Free Edition.
Data and preparation (LDP + tokenising)
The starting point was a Kaggle recipes dataset with titles, descriptions and ingredients.
Using LDP, I set up a simple pipeline:
- Bronze: ingest the raw data
- Silver: clean obvious issues such as duplicates and missing key fields
- Silver (NLP focused): apply the crucial tokenisation step to the text
- Gold: build aggregates and downstream tables linking each recipe title to the set of words (tokens) associated with it
By tokenising, I mean:
- Breaking text into meaningful words
- Example: “Spicy tomato pasta with fresh basil”
→ [“spicy”, “tomato”, “pasta”, “fresh”, “basil”]
- Example: “Spicy tomato pasta with fresh basil”
- Removing noise (stopwords)
- Stripping out filler words like “and”, “with”, “the”. These stopwords do not add much value for classification.
- Normalising similar word forms
- Treating “cook”, “cooks”, “cooked”, “cooking” as variants of the same underlying concept, so the model focuses on meaning rather than inflected forms.
After building the initial pipeline, I did not assume the data was finished. I kept monitoring and iterating:
- I used Genie to ask questions such as “Which very common words bring little or no value to the corpus?” to surface additional stopwords that were not helpful for modelling.
- I generated a word cloud over the tokens and discovered that some values were actually encoded fractions like u00bd (½) leaking into the text. That insight fed back into my LDP cleaning logic so these artefacts were stripped out.
This reinforced a key point: working with text data is rarely straightforward. You usually need an iterative loop of inspecting the data, tightening the cleaning and tokenisation, and re-running the pipeline until the corpus is genuinely meaningful for downstream machine learning.
Topic modelling with PySpark ML (LDA)
Before going into the details, it is worth explaining what LDA actually does in plain language.
Latent Dirichlet Allocation (LDA) is an algorithm that automatically groups similar items together based on the words they contain. In this case, the “items” are recipes, and LDA looks at the words used in each recipe to discover themes like “pasta dishes”, “curries” or “desserts” without anyone manually labelling them. Once the recipes were tokenised, I used PySpark ML to apply this classic NLP technique.
At a high level, LDA assumes that:
- Each recipe is a mixture of a few underlying topics, such as “pasta”, “curries” or “baking”
- Each topic is defined by a characteristic set of words that tend to appear together, for example a “pasta” topic might be dominated by words like pasta, tomato, garlic, olive oil
The approach I took was:
- Turn tokens into numeric features
From the token lists, I built a simple count based representation: for each recipe, how often each word appears. This gives the model a structured, numerical view of the language in the dataset instead of raw text. - Fit an LDA model over the whole corpus
LDA scans across all recipes and learns a fixed number of topics. It does not know about Italian food or desserts upfront. It discovers topics purely from patterns in which words tend to appear together. - Assign each recipe a topic profile
The model outputs a topic distribution per recipe. For example, a dish might be 70% “pasta/Italian”, 20% “quick midweek meals”, 10% “vegetarian”. This topic profile becomes a compact, semantic fingerprint for that recipe.
These topic profiles drive the recommendation engine. Recipes with similar topic distributions are treated as similar, so I can recommend dishes that share underlying themes and flavour profiles, not just recipes that happen to have identical ingredients.
Bringing it together with Genie
To make the solution useful beyond notebooks, I added AI/BI Genie on top of the curated Delta tables.
Genie is Databricks’ natural language interface for your data. In simple terms, it lets you ask questions about your tables in plain English, then automatically turns those questions into the right SQL queries and visualisations behind the scenes.
In this project:
- Genie understands the recipe attributes and the topic features created by LDA
- You can ask questions like:
- “Recommend three vegetarian recipes similar to Spicy Chickpea Curry.”
- “Show quick pasta dishes with a similar flavour profile to Garlic Shrimp Pasta.”
Genie converts these prompts into SQL over the Delta tables and uses the topic information to return tailored recommendations. From a user’s perspective, they just describe what they feel like cooking. Under the hood, traditional ML is doing the heavy lifting in a way that is still explainable and auditable.
Why traditional NLP still matters
For me, a big takeaway is that so called “traditional” NLP is still a very powerful way to derive meaning from text. For many client use cases, classic NLP pipelines can be a more efficient and controllable way to process large volumes of text than jumping straight to generative AI.
They are often:
- Cheaper to run at scale
- Easier to explain to stakeholders
- More transparent in terms of what the model is doing and why it produces a given result
NLP is a powerful tool to derive meaning from text, and we should be utilising it for our clients. In some scenarios, especially where you want clear rules and transparency, it can be a more efficient way to process text than a GenAI model, and the behaviour is much easier to reason about.
Overall, this project shows how traditional ML (LDA and feature engineering) and modern AI interfaces (Genie) can work together to deliver an end to end recipe recommendation engine on Databricks Free Edition, and how an iterative approach to data preparation is often key to getting good results.
Watch the full Hackathon demo entry here
What’s next
Moving forward, I would like to revisit classifying Arabic texts and use Databricks to analyse classical works. NLP behaves quite differently in Arabic compared with English, so the next step is to see whether I can build a truly end to end pipeline that respects the unique linguistic structure of Arabic while still taking advantage of the same Databricks building blocks.
Ultimately, I would like to reach a point where we can routinely apply these NLP patterns for clients, choosing between classic NLP and generative AI (or combining them) based on cost, transparency and the kind of insight they actually need, rather than defaulting to GenAI for every text problem.

