Introducing Umbraco AI Search

6 min read

If you’ve used the build in search in Umbraco, you’ll know it’s keyword-based. You type “deployment guide” and it finds pages containing those exact words. That works well enough for many cases, but it has some fundamental limitations.

What if someone searches for “how to release my site”? Same intent, completely different words. Keyword search won’t find your “deployment guide” page. And what about typos, synonyms, or queries in a different language? Traditional search struggles with all of these.

That’s the gap Umbraco.AI.Search fills. It’s a new add-on for Umbraco.AI that brings semantic vector search to Umbraco — meaning it searches by meaning, not just matching words.

Note: Umbraco.AI.Search is currently in beta, alongside the Umbraco.Cms.Search framework it builds on. The APIs and behaviour may evolve as both packages mature.

Why Embeddings Beat Keywords

To understand why this matters, it helps to know what’s happening under the hood.

When content is published, Umbraco.AI.Search takes the text from your content fields, splits it into manageable chunks, and sends each chunk to an embedding model (configured via an Umbraco.AI embedding profile). The model returns a vector — essentially a long list of numbers that represents the meaning of that text in a high-dimensional space.

These vectors get stored in a vector store, and when someone searches, their query goes through the same embedding process. The search then finds content whose vectors are closest to the query vector — a process called cosine similarity.

The practical result:

Keyword SearchSemantic Search
”deployment guide”✅ Exact match✅ Meaning match
”how to release my site”❌ No keyword overlap✅ Semantically similar
”hjälp med publicering” (Swedish)❌ Wrong language✅ Cross-lingual (with multilingual models)
Typos, abbreviations❌ Usually misses✅ Understands intent
”docs like this one”❌ Not possible✅ Find similar by document

This isn’t a replacement for keyword search — there are times when exact matching is what you want. But for content discovery, natural language queries, and “find more like this” scenarios, embeddings are significantly more capable.

Built on Umbraco.Cms.Search

Umbraco.Cms.Search is Umbraco’s new search framework — the future of how search will work in Umbraco. It defines a provider-agnostic abstraction over search with IIndexer and ISearcher interfaces, decoupling Umbraco from any specific search technology.

We made a deliberate decision to build Umbraco.AI.Search against Umbraco.Cms.Search rather than the existing Examine-based infrastructure. Umbraco.Cms.Search isn’t the default search provider yet — it’s still in beta — but by building on it now, semantic vector search will be ready to go as soon as the framework ships. That said, you don’t have to wait. You can install Umbraco.Cms.Search today alongside your existing setup — just be aware that as a beta package, things may shift under you.

The indexer (AIVectorIndexer) hooks into content publish events. When content or media is saved, it extracts text from your fields, chunks it using a recursive text splitter, batch-generates embeddings, and stores the vectors with metadata — including the object type, culture, and any access protection IDs.

The searcher (AIVectorSearcher) embeds your query text, performs a similarity search against the store, deduplicates results by document (since multiple chunks from the same page might match), filters by minimum score, and respects Umbraco’s content access protection. The result is a standard SearchResult — the same type you’d get from any Umbraco.Cms.Search provider.

Because it implements the standard interfaces, you query it through the Umbraco.Cms.Search API just like any other searcher:

// Resolve the AI vector searcher from the search framework
var searcher = searcherResolver.GetSearcher("UmbAI_Search");

// Perform a semantic search — same API as any other searcher
var results = await searcher.SearchAsync(
    indexAlias: "UmbAI_Search",
    query: "how to release my site",
    culture: "en-US",
    skip: 0,
    take: 10);

No special APIs to learn. It’s just another search provider — one that happens to understand what you mean, not just what you typed.

Search Index
Search Index

The Semantic Search Agent Tool

Beyond direct API usage, Umbraco.AI.Search also registers a semantic search tool that AI agents can use. This is where it gets interesting for anyone using the Umbraco.AI Agent add-on.

The tool supports two modes:

Text query — The agent searches by meaning. “Find me pages about getting started with Umbraco” will surface onboarding content, setup guides, and quickstart pages, even if none of them contain that exact phrase.

Document similarity — The agent can say “find content similar to this page.” It retrieves the existing embeddings for a document, averages them, and searches for nearby vectors. This enables “related content” and “you might also like” scenarios without any manual tagging.

The tool respects backoffice user permissions (start nodes) and content access protection, so agents can only surface content the current user is allowed to see.

Copilot Semantic Search
Copilot Semantic Search

The Vector Store

Vectors need to live somewhere, and Umbraco.AI.Search ships with a database-backed vector store out of the box. It uses EF Core and stores vectors as JSON arrays in your existing Umbraco database — no external services required.

For SQL Server 2025, the store automatically detects native vector support and uses the VECTOR_DISTANCE() function for server-side cosine similarity. On older SQL Server versions (or SQLite), it falls back to brute-force cosine similarity in .NET using TensorPrimitives. That fallback is fine for smaller sites, but it loads and compares every vector on each query — so as your content grows, query times grow with it. For anything beyond a small site, you’ll want either SQL Server 2025 for its native vector support, or a dedicated vector store implementation.

The good news is that swapping out the store is straightforward. The interface is intentionally simple:

public interface IAIVectorStore
{
    Task UpsertAsync(string indexName, string documentId, string? culture,
        int chunkIndex, ReadOnlyMemory<float> vector,
        IDictionary<string, object>? metadata = null,
        CancellationToken cancellationToken = default);

    Task DeleteAsync(string indexName, string documentId, string? culture,
        CancellationToken cancellationToken = default);

    Task DeleteDocumentAsync(string indexName, string documentId,
        CancellationToken cancellationToken = default);

    Task<IReadOnlyList<AIVectorSearchResult>> SearchAsync(string indexName,
        ReadOnlyMemory<float> queryVector, string? culture = null,
        int topK = 10, CancellationToken cancellationToken = default);

    Task<IReadOnlyList<AIVectorEntry>> GetVectorsByDocumentAsync(string indexName,
        string documentId, string? culture = null,
        CancellationToken cancellationToken = default);

    Task ResetAsync(string indexName, CancellationToken cancellationToken = default);

    Task<long> GetDocumentCountAsync(string indexName,
        CancellationToken cancellationToken = default);
}

Seven methods. Upsert, delete, search, retrieve, reset, and count. If you wanted to back this with Qdrant, Pinecone, Azure AI Search, or any other vector database, you just implement this interface and register it in the DI container.

Culture-Aware and Access-Protected

Two details worth calling out explicitly.

Culture support is baked in. Content variants are indexed separately by culture, so a search in da-DK returns Danish content, not English. When a culture is specified, the searcher includes both culture-specific and invariant (culture-neutral) results — matching the conventions established by Umbraco.Cms.Search.

Access protection is enforced at search time. When content is indexed, any access restriction IDs are stored as metadata alongside the vectors. At query time, these are checked against the current user’s access context. Protected content doesn’t leak into results for users who shouldn’t see it.

Configuration

The defaults are sensible, but everything is configurable via appsettings.json:

{
    "Umbraco": {
        "AI": {
            "Search": {
                "ChunkSize": 512,
                "ChunkOverlap": 50,
                "DefaultTopK": 100,
                "MinScore": 0.3
            }
        }
    }
}
  • ChunkSize — Maximum tokens per text chunk. Smaller chunks give more granular results but generate more vectors.
  • ChunkOverlap — Token overlap between chunks so context isn’t lost at boundaries.
  • DefaultTopK — How many candidates to retrieve from the vector store before filtering and deduplication.
  • MinScore — Minimum cosine similarity (0.0-1.0) to include a result. Raise this if you’re getting too many loosely related results.

Getting Started

Install the package, configure an embedding profile in the Umbraco.AI backoffice (any provider that supports embeddings will work), then head to the Umbraco.AI settings and assign it as the default embedding profile.

Once that’s in place, rebuild your search index. That’s it. Your content will start being indexed as vectors, and semantic search is available both through the Umbraco.Cms.Search API and as an agent tool.

As mentioned, this is beta software — both Umbraco.AI.Search and Umbraco.Cms.Search are still maturing. But the foundations are solid, and I’m keen to hear how people use it. If you hit rough edges, that feedback is valuable.

Until next time 👋