Understanding OpenAI Embeddings

Understanding OpenAI Embeddings

What Are Embeddings at Their Core?

OpenAI embeddings change text into numbers. They turn words, sentences, or documents into lists of numbers called vectors. These numbers show the meaning of the text in a way that computers can understand and use.

When text is turned into these number lists, texts with similar meanings get similar numbers. This is important because the numbers are not random. They show meaning in a mathematical way.

How Embeddings Work Fundamentally

When you send text to OpenAI's embedding API, the AI uses a neural network to understand the relationships between words and ideas. It has learned this from reading a lot of text.

The result is a vector, which is a list of numbers like -0.006929 and 0.005336. For text-embedding-3-small, the list has 1536 numbers, and for text-embedding-3-large, it has 3072 numbers.

The important thing is that the distance between two vectors shows how related the texts are. Smaller distances mean they are more related.

Why Embeddings Matter

Embeddings fix a basic issue: computers work with numbers, but we use text to communicate. Embeddings help by changing text into numbers that make sense.

This change lets computers:

  • Understand when two different phrases mean the same thing

  • Recognize related ideas even if they use different words

  • See how closely different pieces of text are related

Using Embeddings

To use OpenAI embeddings, you send your text through their API with your chosen model, like text-embedding-3-small. The API gives you back the vector version of your text.

You can store these vectors in special databases made for them, called vector databases, and do different things like:

  • Search by finding vectors that are closest to your query vector

  • Group similar texts by finding clusters of vectors

  • Spot unusual texts by finding isolated vectors

  • Classify texts by comparing them to labeled examples

OpenAI's Current Embedding Models

OpenAI has two main third-generation embedding models:

text-embedding-3-small:

  • It's faster and cheaper

  • Has 1536 dimensions by default

  • Works well for most uses

  • Can process about 62,500 pages for each dollar spent

text-embedding-3-large:

  • Offers better performance with slightly improved results

  • Comes with 3072 dimensions by default

  • Best for when you need the highest accuracy

  • Can process about 9,615 pages for each dollar spent

Both models can handle up to 8191 tokens, which is about 6000 words, at once.

In Summary

OpenAI embeddings turn text into numbers while keeping the meaning. These vectors help computers understand and work with text in ways that wouldn't be possible otherwise. They are the basis for modern search, recommendations, and many other AI applications that use text.