From RAG to Riches: A Practical Guide to Building Semantic Search Using Embeddings and the OpenSearch Vector Database

The shared technology foundation

Natural Language Processing (NLP) technologies have been growing in capability by leaps and bounds in recent years, as any witness to ChatGPT’s meteoric rise can attest. A dense vector based semantic search is increasingly used to address LLM/chatbot hallucinations; a Retrieval Augmented Generation (RAG) technique is typically first used to find a small set of close sentence embedding matches, for example, to a user question, and then an LLM is leveraged to only answer from the reduced set of provided materials.

One of the most important breakthroughs in the development of modern Large Language Models (LLMs) like ChatGPT is the transformer - GPT stands for Generative Pre-trained Transformer. For transformers, the first step of processing a text input is converting it into a text embedding, along with a positional encoding. These text embeddings are foundational to the transformer architecture’s key advantage over preceding techniques: parallel processing. 

Text embeddings are also the core of the semantic search approach we will show in this article, so these embeddings are a shared foundation of both semantic search and LLMs. In fact, the model we will use to generate embeddings, SBERT (Sentence Bidirectional Encoder Representations from Transformers) is itself an LLM based on transformers which are in turn trained through embeddings (dynamically generated during training), but we don’t need to follow that loop to build our semantic search.

The increasing need for Semantic Search

Data volumes have been growing massively year over year, and users now expect access to those huge stores of information at their fingertips. NLP algorithms have become much more powerful with transformers, embeddings, and large language models, in part because parallel processing algorithms such as those used in transformers can take advantage of today’s increasingly capable CPUs and GPUs. 

While chat interfaces have certainly grown rapidly in popularity, there is still a strong need to search document sets for specific documents (such as images or videos) which best match a user’s query. With the increasing pervasiveness and flexibility of LLMs, the brittle exact matching nature of keyword search is no longer adequate. Semantic search, which leverages contextual understanding to extend beyond exact matching, can fill this need.

Imagine you have some data which has text strings associated with it that you may want to search. Perhaps a database of chat messages, or text image captions that have been added to image metadata. How would you search this data? First, let’s take a closer look at the different types of search we might consider implementing.

How does Semantic Search differ from Keyword Search?

Keyword search and semantic search are two different approaches to retrieve information from databases or search engines. While both methods aim to provide relevant results, they differ significantly in how they handle language and interpret user queries, especially if the query text doesn’t exactly match the text in the item being searched.

  • Keyword Search: Keyword search is the traditional method of searching, where users input specific words or phrases relevant to their query. Search engines match these keywords against their indexed content and return results that contain the exact keywords provided by the user. Keyword search relies heavily on exact matches and does not consider the context or meaning behind the words used.

To prepare to quickly answer keyword queries across large sets of documents, keyword search engines create an inverted “Full-Text” index which allows quick discovery of documents which contain given keywords, but capture no “meaning” of the keyword itself. The popular open-source search platform Elasticsearch uses this kind of index by default.

As an example, if a user searches for "car," the search engine will return results that include the word "car" but may not consider results containing related terms like "automobile" or "vehicle.”

More sophisticated keyword search engines do include some handling of direct synonyms but still lack a greater contextual understanding.

  • Semantic Search: Semantic search, on the other hand, is an advanced search technique that aims to understand the intent behind the user's query and deliver more contextually relevant results. It goes beyond simple keyword matching and tries to comprehend the meaning of the query within its context, considering relationships between words and concepts.

In contrast to keyword search which matches exact strings, Semantic Search attempts to model concepts in a virtual “universe” and understand that (for example) “automobile” and “engine” are near “car” in that universe, but “cat” is somewhere else. Therefore, if a user searches for "automobile," a semantic search engine will not only return results containing the word "automobile" but also recognize that "car," "vehicle," and "auto" are related terms and include results using those words. It may even include results with “engine” or “tires” if there are few results with direct synonyms.

Overall, semantic search offers significant advantages over keyword search by providing more precise and contextually relevant results. It accounts for related concepts and considers the intent behind the user's query, enhancing the user's search experience and delivering more comprehensive information.

How modern Semantic Search leverages embeddings

In order to be able to create a context-aware search, a semantic search engine must be able to capture that context in a searchable way. To understand how this works, we need to understand the concept of embeddings.

Embeddings are a way to create a “map” of the conceptual distance between various items represented in a virtual space, which can be encoded numerically as coordinates in that space.

They are learned through techniques like deep learning and aim to encode similarity relationships between data points, enabling tasks such as information retrieval, recommendation systems, and natural language understanding. Embeddings have revolutionized various fields by transforming raw data into meaningful and compact representations, facilitating more efficient and accurate machine learning (ML) algorithms.

Embeddings can be applied to any type of data, including words, sentences or even images, audio recordings, or video. To form an embedding, a learning process will place the items in an n-dimensional space. If we simply embed words, which is called a “word embedding,” the “map” that a deep learning process transforms raw data into might look something like this in a 2-dimensional space for simple visualization.

Modern semantic search often uses a more sophisticated technique called a “sentence embedding,” which is a mathematical representation of a sentence that captures its semantic meaning and context in a high-dimensional vector space. In other words, it's a way to transform a sentence of variable length and content into a fixed-length vector of numbers that can be used in semantic search. Note that the concept of a “sentence” is different from a proper English language sentence - it can be any series of words, like “Romeo and Juliet”.

When you think of a vector space plot, know the embedding model places related concepts closer in proximity to one another on the plot, and concepts that differ more further apart. In order to determine the place of each sentence in embedding space, one would train a model, such as a neural network, against a corpus of sentences, placing each sentence from the corpus in the embedding space as seen below. Subsequently, new entries can be placed among that corpus in the embedding space on demand.

Source: Google

The space where these concepts are stored (called an “embedding space”) is not actually two dimensional as pictured, but typically a hundreds-dimensional space, for example 384 dimensions. This may be hard to visualize, but imagine the concept as a point our embedding model places in N-dimensional space. To locate that point in our conceptual space, the model can specify a set of N coordinates. Just as we need two (e.g. x and y) coordinates to locate “Romeo and Juliet” in the two dimensional space above, we can locate an embedding in N-dimensional space with N coordinates, which could look something like this:

[0.009473263, -0.02898928, -0.05890525, 0.0379… <380 more>]

To be able to search, conceptually we will need to:

  • Populate this N-dimensional space with points for all of the “sentences” we want to search against.

  • When a search query/sentence comes in, find the point where it would live in this N-dimensional space

  • From that new point, find the nearest neighbors that we had populated the space with, and return those as the closest semantic search results.

Calculating sentence embeddings with Sentence-BERT

Let’s walk through actually creating embeddings and storing them in a specialized database called a vector database, you’ll see it’s quite straightforward with the latest tools.

First, let’s learn how to actually place our sentences in an N-dimensional space with Python code. We will use the Sentence-BERT (SBERT) implementation in Hugging Face’s Sentence Transformers library. To install the library, run:

$ pip install sentence-transformers

While we’re at it, let’s also install the opensearch python library we’ll be using below:

$ pip install opensearch-py

We will walk through the code step by step so you may just want to read along, but full sample code is also available by cloning this github repo:

$ git clone https://github.com/xyonix/semantic-search-xyonix-blog
$ pip install jupyterlab # if you don’t already have it
$ cd semantic-search-xyonix-blog
$ jupyter notebook Semantic_Search_Article.ipynb

In a python script, you can create a model like this:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2', device='cpu')

If you have a supported GPU, you may be able to change ‘cpu’ to ‘cuda’ or ‘mps’ for better performance. To calculate our embeddings (placing our sentences in N-space), we are using the 'all-MiniLM-L6-v2' model, which provides a good balance between model size, speed, and performance. All of the pretrained models available in this library are listed here with size and performance information.

Next, let’s calculate our first embedding. It’s as simple as:

embedding = model.encode('The quick brown fox jumped over the lazy dog')
print(embedding)

Output:

[ 3.77293453e-02  9.12235305e-02  4.25090231e-02  7.49403313e-02
  6.44289255e-02 -1.82191860e-02  1.61316935e-02 -3.05981878e-02
 -1.42457336e-02  2.05015950e-02  4.40286398e-02  3.76272500e-02
…<368 more>]

This output can be thought of as our coordinates for the supplied sentence in N-space. The number of coordinates is something that was chosen when the model was trained. In this case, a 384 dimensional space was chosen.

To calculate positions in this space for all “sentences” in a python list named “descriptions”, we can simply run the following code. We are using only words here to map to our simple diagram earlier in the article, but each of these descriptions could be any number of words:

descriptions = ['car', 'bus', 'house', 'cat', 'dog']
embeddings = [model.encode(description) for description in descriptions]

The variable ‘embeddings’ now contains a list of embeddings for each input description. Note that if you’re doing a large number of these, you’ll want to parallelize the computation of embeddings. We have an example of how to do this in the sample code.

Making our embeddings searchable

Now that we have made these embeddings, we want to make them searchable so we can find their nearest neighbors in N-space. In order to do that, we need what is called a “vector database” which allows us to not only find their nearest neighbors, but do it efficiently. 

Vector databases have recently become popular among ML engineers due to their efficiency in handling dense vector representations of data, such as the embeddings we’ve been discussing. The embeddings they rely on allow storage of these representations in minimal memory and disk space. We won’t go into how vector databases work here, but suffice it to say that searching high dimensional vectors efficiently is a hard problem, and the ability to easily scale these searches to large data sets is a relatively recent development. These databases offer a range of advanced search functionalities, enabling tasks such as similarity matching, nearest neighbor search, and clustering, which are critical for recommendation systems, natural language processing, and computer vision applications.

There are several such databases out there including Chroma, Pinecone, Weaviate, Milvus, and Faiss. We will use OpenSearch, but the same approach could be applied to any vector database. You may be familiar with ElasticSearch, a project from which OpenSearch was forked in 2021 after Elastic (the company) moved to a more restrictive license. OpenSearch retains the Apache License 2.0. This project began as a traditional keyword indexing platform, but in recent years has seen development extending beyond that space, particularly in this k nearest neighbors (k-NN) functionality we’re interested in.

You will first need to install OpenSearch, which is outside the scope of this article, but easy to do as covered in the official documentation. We recommend a docker install: 

https://opensearch.org/docs/latest/install-and-configure/install-opensearch/index/

We will load the embeddings we calculated into an index that we create in OpenSearch. OpenSearch provides several options for k-NN search, in this case we will be using the nmslib vector search library with the Squared L2 (‘l2’) distance function, as you can see in the configuration below. Apache Lucene is also an engine option, which would be better for hybrid keyword and vector search, but that’s a discussion for another time. First, we define and create an index as seen in the code below.

# create index
index_name = "semantic_index"
mapping = \
{
  "settings": {
    "index": {
      "knn": True,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
        "description_emb": {
          "type": "knn_vector",
          "dimension": 384,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "nmslib",
            "parameters": {
              "ef_construction": 128,
              "m": 24
            }
          }
      },
      "media_url": {
        "type": "binary"
      },
      "description": {
        "type": "text"
      }
    }
  }
}

client = OpenSearch(<all relevant parameters for your server>)
client.indices.create(index=index_name, body=mapping)


You should get a response like:

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'semantic_index'}



Note that we set the ef_search parameter to 100, which is also the default. This parameter is the size of the dynamic list used during approximate k-NN searches and is a trade off between accuracy and speed. It can be tuned down to reduce latency and CPU usage. We use binary as the field type for the media_url because we want to be able to retrieve it, but are not interested in searching it. We use the “text” field type for the description, which will give us keyword search capabilities, and we create a knn_vector structure for the embeddings themselves.

Now that we have an index created, we need to put our descriptions and computed embeddings in it, which is quite easy. Note that for large data sets you will likely want to use bulk indexing:

for i in range(len(descriptions)):
    document = {
        'description': descriptions[i],
        'media_url': 'http://this.would/point/to/the/media.jpg',
        'description_emb': embeddings[i]
    }
    
    client.index(index=index_name, body=document)

Querying your indexed embeddings

So, now that we have our embeddings and descriptions indexed, let’s try this semantic search! First, from a user query (we’ll say the user searched for “motorcycle”) we will need to compute the embedding for that query the same way we did above with the descriptions.

# calculate embedding for query
user_query = "motorcycle"
user_query_emb = model.encode(user_query)


Once we have the embedding, we can formulate the database query and send it to OpenSearch. The value of k defines how many results are returned from each index shard, we choose a safe value equal to our desired results:

# construct opensearch query and submit
desired_results = 2
opensearch_query = { "size": desired_results, "query": {"knn": { "description_emb": {"vector": user_query_emb, "k": desired_results } } } }

results = client.search(index=index_name, body=opensearch_query)

for result in results['hits']['hits']:
    print(result['_source']['description'])


This will result in the following output, indeed the closest index items to “motorcycle”:

car
bus


So, what’s happening here? Embeddings are working their magic. The SBERT model that we used to compute our embeddings “knows” that, in its universe of training data, “motorcycle” is closer to “car” and, to a lesser extent, “bus” than it is to “house”, “cat”, or “dog”. The data we indexed including nothing at all about “motorcycle” and a keyword search would have simply returned no results. However, our semantic search, leveraging the SBERT training, can find the closest concepts in our data set. For simplicity we’ve used single words, but this is equally applicable for sentences and longer texts.

An example of semantic search in action

The flexibility to match related concepts can be particularly compelling in a case like our example where we have media that is captioned with descriptions, but users searching for that media may not use exactly the same keywords as the caption uses. 

For example, the following image was captioned with the description: “Traveling Man Urban Nomad Enjoy Beer Next To Warm Campfire During Hiking Trip Man And His Best Friend Dog In Trendy Outdoor Outfits Watch Sunset Over Lake In Forest Camping Ground Staycation Concept”

image: storyblocks.com

In a media search for “Man drinking beer near dog” the keyword search did not find this image, presumably since the keyword “drinking” was not included in the description. Instead, many images were returned from the index with men drinking beer, but no image including a dog to satisfy the user’s query.

On the other hand, the semantic search returned this as one of the top few results, and it was in fact the best match we were able to find in a large database. Semantic search found that the embedding of the description and the embedding of the query were closest in N-space, and was able to return a much better result than keyword search in this and many other cases.

You can also play with passing your set of results through to OpenAI or your favorite LLM API to answer a specific question. You can say something in your prompt like: “Only answer from these reference materials” – this is a common technique used to prevent chatbot hallucinations and put strong guardrails up when using an LLM.

Conclusion

In a world where users are increasingly expecting computers to seamlessly understand their imperfect text inputs due to the ubiquity of LLMs like ChatGPT, document search needs to be brought beyond keyword matching.

The transition to Semantic Search is not just an upgrade—it's a necessary evolution to meet the demands of modern data retrieval and analysis.

As you’ve seen, semantic search can leverage this LLM understanding of the world by using the latest sentence embedding models like SBERT, an LLM itself. Semantic Search is also often used to address LLM hallucinations by using a Retrieval Augmented Generation (RAG) technique to first find a small set of close sentence matches, for example, to a user question, and then leverage an LLM to only answer from the provided materials.

In order to get semantic results to users quickly, we need a specialized tool to find k-NN of large data sets fast. Fortunately, we’ve also seen a swift emergence and adoption of a variety of vector database options on the market due to the burgeoning demand for more sophisticated search and data analysis tools that can handle complex ML data types at the increasingly large scale we see in ML data sets. As we’ve shown, we can use a vector database like OpenSearch to find the indexed embedding representations that best match the embedding of a user query, even if there are no keyword matches. This can provide more relevant matches to user queries, especially in cases where language flexibility is important and exact matches are less common, like in our image caption example.

The transition from keyword to Semantic Search represents not just an advancement, but a transformation in how we interact with the vast reservoirs of data at our fingertips. We hope this article inspires you to embrace the challenge of designing more intuitive, responsive, and context-aware search experiences using these new technologies. Your contributions can help users discover key information they otherwise might never find. It’s an exciting time to be working in information retrieval, and we look forward to seeing what you’ll create!


SIGN UP FOR YOUR FREE AI RISK ASSESSMENT


Sources:

  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1810.04805v2

  • Announcing ScaNN: Efficient Vector Similarity Search: https://blog.research.google/2020/07/announcing-scann-efficient-vector.html

  • Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks https://arxiv.org/abs/1908.10084

  • SBERT pretrained models overview: https://www.sbert.net/docs/pretrained_models.html

  • Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs: https://arxiv.org/abs/1603.09320

  • OpenSearch Approximate k-NN documentation: https://opensearch.org/docs/latest/search-plugins/knn/approximate-knn/