AI

2024-10-14

Implementing Vector and Hybrid Search with OpenSearch and the Neural Plugin - Part 2

In this article, we will show you how to create an Ingest Pipeline and a kNN-capable index in OpenSearch to automate creating Vector Embeddings as much as possible.

Reading time: 4 minutes
Machine Learning Engineer
In the last part of this article, we have used the OpenSearch Neural Plugin to register and deploy a transformer model capable of creating Vector Embeddings from text. We have installed a free and openly available model from Huggingface. Now it’s time to use and test it. In this article, we will show you how to create an Ingest Pipeline and a kNN-capable index to automate this process as much as possible. Let’s dive in!

Create an Ingest Pipeline

With our model successfully registered and deployed, the next step is to set up an ingest pipeline. This pipeline is responsible for processing incoming documents and generating the vector embeddings for each ingested document automatically.
Implement the pipeline using the following API call:
copy
PUT /_ingest/pipeline/nlp-ingest-pipeline
{
  "description": "An NLP ingest pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "kJGw1pAB2ktECB55NhDC",
        "field_map": {
          "text": "passage_embedding"
        }
      }
    }
  ]
}
This ingest pipeline configuration accomplishes several key objectives:
    It defines a processor that will generate text embeddings using our deployed model. Note that we used the same model ID as in the last part of this article.
    It specifies which field in the input document (“text”) should be used to generate the embedding.
    It determines where the resulting vector embedding will be stored (“passage_embedding”).
By establishing this pipeline, we ensure that all documents indexed with this pipeline will automatically have their textual content transformed into vector embeddings, ready for neural and hybrid search operations.

Create a k-NN Index

Following the creation of our ingest pipeline, we must now create an index capable of storing and efficiently querying our vector data. This step is crucial for enabling k-Nearest Neighbors (k-NN) searches on our embeddings. Execute the following command to create the index:
copy
PUT /my-nlp-index
{
  "settings": {
    "index.knn": true,
    "default_pipeline": "nlp-ingest-pipeline"
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "passage_embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "engine": "lucene",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {}
        }
      },
      "text": {
        "type": "text"
      }
    }
  }
}
This index configuration is designed to support our vector search requirements:
    It enables k-NN capabilities for the index.
    It sets our previously created ingest pipeline (“nlp-ingest-pipeline”) as the default for all documents ingested to this index.
    It defines the structure of our documents, including the crucial “passage_embedding” field, which will store our vector data.
    It specifies the dimensionality of our embeddings (768 in this case, matching our chosen model’s output).
    It configures the k-NN algorithm to use the Lucene engine with HNSW (“Hierarchical Navigable Small World”) graph for efficient nearest neighbor search.
By setting up this index, we create an environment optimized for both traditional text search and vector-based similarity search.

Ingest Data into Index

With our index and pipeline in place, we can now begin ingesting data. This step involves inserting documents into our newly created index, where they will be automatically processed by our ingest pipeline. Here’s an example of how to insert a document:
copy
PUT /my-nlp-index/_doc/1
{
  "text": "[SOME TEXTUAL DATA]",
  "id": "1775029934.jpg"
}
When executing this command:
    The specified text will be automatically embedded using our registered model.
    The resulting embedding will be stored in the “passage_embedding” field.
    Additional metadata, such as the “id” field, will be indexed alongside the embedding.
This process allows us to build up a corpus of documents that are simultaneously searchable via traditional methods and via vector similarity.
Note that all the magic actually happens in our ingest pipeline that is connected to our kNN-capable index. Hence, indexing documents is no different to ingesting any other document in any other index. And that is one of the beauties of using the Neural plugin: It does all the heavy lifting under the hood, so you can concentrate on implementing your business requirements. Let’s give it a shot, ingest some text, and see what happens:
copy
PUT /my-nlp-index/_doc/1
{
  "text": "The cat sat on the mat.",
  "id": "1"
}
The response does not look any different from any other index operation:
copy
{
  "_index": "my-nlp-index",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}
However, if we query our index for the created document, you can see that the resulting document contains the original text, plus the generated vector embeddings (shortened for brevity):
copy
{
  "_index": "my-nlp-index",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "passage_embedding": [
      -0.19641775,
      0.116648935,
      0.14530957,
      0.12820506,
      ...
      -0.0011875179,
      0.29987383,
      0.13531543
    ],
    "text": "The cat sat on the mat.",
    "id": "1"
  }
As expected, the index pipeline uses our deployed model as a processor. The transformer model converts the text in the field “text” into vectors, which are then saved into the passage_embeddings field. Sweet, isn’t it?
In the next article we are ingesting more real-world data and try lexical, vector and hybrid search techniques. See you there!

Articles in this Series

Eliatra Newsletter
Sign up to the Eliatra Newsletter to keep updated about our Manages OpenSearch offerings and services!