Unleashing Great Potential for Your AI Applications With Vector Embedding Models

2024-12-02

MyScale(opens new window) has introduced the EmbedText function (opens new window)in the latest version of the integrated SQL vector database. This powerful feature brings together the efficiency of SQL querying and state-of-the-art AI-driven text embedding technology so that you can use familiar SQL syntax to do precise text matching and efficient semantic similarity computing.

With full integration of Jina Embeddings v2 (opens new window)models, MyScale EmbedText allows users to harness the capabilities of Jina AI within MyScale for processing text with an input length of up to 8K using the standard SQL syntax, which makes it possible to understand and process much longer texts than ever before. Whether processing complex multilingual data or creating advanced AI applications, developers can instantly take advantage of Jina AI's top embedding models through MyScale at every point in the development process.

What Is MyScale?

MyScale is a cloud-native SQL vector database that enables developers familiar with SQL to build production-quality generative AI applications. Built on top of ClickHouse (opens a new window, MyScale integrates vector search and storage with a scalable relational database, providing efficient storage and processing of structured and unstructured data and streamlining complex database engineering while ensuring the highest reliability and performance for AI applications.

MyScale's EmbedText Function leverages the familiar syntax of SQL to simplify the generation of text embedding vectors, enabling users to adopt popular AI models for their projects. Using EmbedText's automated batch processing, developers can greatly improve performance in processing large amounts of data without relying on external tools or doing any complex programming.

What Is Jina Embeddings?

Jina Embeddings v2 is the world's first-ever and, so far, only open-source text embedding model that supports 8192 token input sizes. It is available in three versions: English-only (opens new window), bilingual Chinese-English (opens new window), and bilingual German-English (opens new window.

Features:

Industry-leading performance comparable to OpenAI's closed-source Ada 2 model.
Support for texts of over 8 thousand tokens, breaking the barrier to long text vector representations and allowing developers to fully represent the semantics of texts at multiple scales.
Multilingual support, with a model that represents Chinese and English in one embedding space and another that does the same for German and English, with more languages to come. Jina Emebddings enables cross-language applications using models specialized in those specific languages rather than a massive, inefficient AI model with unequal and unclear performance for large numbers of different languages.
Ranked by LlamaIndex (opens new window) among the world's best embedding models for RAG (Retrieval-Augmented Generation) applications.

Using Jina Embeddings v2 in MyScale

Developers can use Jina Embeddings with EmbedText Function in MyScale for two operations: data insertion and embedding-based querying. This section will get into the details of both.

Create a Simplified Function

One practical strategy is to declare an SQL User-Defined Function (UDF) that creates text embeddings and contains the relevant model name, provider, and API key so that this information doesn't have to be repeated and can be easily changed when needed.

The SQL statement below declares the function JinaAIEmbedText for that purpose. Insert your own API key in the appropriate place.

      SQL 
    
    CREATE FUNCTION JinaAIEmbedText ON CLUSTER '{cluster}'
AS (x) -> EmbedText(x, 'Jina', '', 'YOUR_API_KEY', '{"model":"jina-embeddings-v2-base-en"}')

Now, to get an embedding for a text, you just have to call JinaAIEmbedText:

      SQL 
    
    SELECT JinaAIEmbedText('YOUR_TEXT')

Optimizing Vector Searches Using Jina Embeddings

Once you have created the simplified function, you can use Jina Embeddings in MyScale to optimize the vector search. Querying using embeddings follows standard SQL methods. It's very simple using JinaAIEmbedText:

      SQL 
    
    SELECT id, distance(vector_column_name, JinaAIEmbedText('YOUR_QUERY_TEXT')) AS dist
FROM table_name ORDER BY dist LIMIT 10

This will populate a table with the ten records that best match your query according to their embedding vectors.

Data Insertion

You can create an SQL table that converts text data into vectors using the JinaAIEmbedText function from above. For example:

      SQL 
    
 
 
    CREATE TABLE jina_embedding
(
  id UInt32,
  paragraph String,
  vector Array(Float32) DEFAULT JinaAIEmbedText(paragraph),
  CONSTRAINT check_length CHECK length(vector) = 768
)
ENGINE = MergeTree
ORDER BY id 
   

Then, insert data into this table to automatically generate embeddings:

      SQL 
    
    INSERT INTO jina_embedding (id, paragraph)
VALUES (1, 'YOUR_TEXT_1'), (2, 'YOUR_TEXT_2')

Benefits to AI Developers

MyScale's integration of Jina Embeddings v2 models offers developers a robust framework for building database-driven generative AI applications, saving time, effort and money bringing new applications to market.

Its specific benefits include:

Reduced computing costs: MyScale delivers superior database performance with a remarkable reduction in memory consumption compared to its competitors, making it a highly cost-effective choice to back an AI application. Jina Embeddings, by giving developers a choice between different model sizes and embedding vector sizes, offers them tools to manage their computing and storage costs.
Enhanced flexibility: The synergy between MyScale and Jina Embeddings provides developers with enhanced flexibility, particularly in challenging application scenarios like long documents and large document collections.
More accurate searching: MyScale achieves powerful metadata-filtered search through its unique MSTG algorithm (opens new window), while Jina Embeddings delivers more precise representations of text semantics, improving accuracy in information retrieval. This leads to more informed decision-making and superior application performance, especially in improving the accuracy of RAG applications. The combination of these two technologies elevates the search to new heights.

Combining MyScale with Jina Embeddings opens up practical applications, especially for RAG-enhanced chatbots. MyScale, enhanced with Jina Embeddings, can act as a single data source for your chatbot, ensuring data security, consistency, and integrity. MyScale also reduces data redundancy by storing references to records, improving accessibility, and offering you advanced access control.

Jina Embeddings v2's ability to process long texts makes it ideal for managing inputs to dialog systems. Chatbots made with Jina Embeddings have a greater understanding of conversational context, dramatically improving performance in long chats and complex scenarios.

Looking into the Future

The deep integration of MyScale and Jina Embeddings v2 empowers developers to bring AI into their projects. This includes the creation of intelligent customer service robots, developing more accurate cross-language search applications, and optimizing legal and business document analysis and management processes. Developers can explore a wider range of application scenarios with MyScale and Jina Embeddings and build more innovative and practical AI applications that provide users with greater value.