In this blog post, we will explore the significant advantages of utilizing semantic search in conjunction with GPT QnA, as opposed to relying solely on fine-tuned GPT models. By harnessing the combined power of semantic search and GPT QnA, we can achieve superior question-answering capabilities that are more accurate, contextually aware, and adaptable. Further we will explore the creation of an advanced document-based question-answering system using LangChain and Pinecone. By capitalizing on the latest advancements in large language models (LLMs) like OpenAI GPT-4 and ChatGPT, we'll construct a cutting-edge Document question-answer system that pushes the boundaries of AI technology.
LangChain and Pinecone are two powerful tools that enable the development of applications driven by language models and provide efficient vector search capabilities, respectively. With LangChain's framework and Pinecone's vector database, we can harness the full potential of LLMs to build a highly accurate and context-aware QnA system.
The objective of this blog post is to guide developers and AI enthusiasts in constructing an advanced Document QnA system. We will delve into the integration of LangChain and Pinecone, demonstrating how to leverage these tools to generate precise answers based on specific documents. We will get into the terminology and build core concepts before we start understanding how to build the QnA system with OPL(OpenAI, Pinecone and LangChain)
- Models: LangChain supports various model types and model integrations, allowing users to leverage different language models for their tasks.
- Indexes: LangChain facilitates the combination of language models with user-provided text data, offering guidance on best practices for incorporating such data effectively.
- Chains: LangChain goes beyond a single call to a Language Model (LLM) and enables sequences of calls, either to an LLM or other utilities. It provides a standardized interface for creating chains, integrates with multiple tools, and offers end-to-end chains for common applications.
In this post, we explore a reference architecture for the Language Model (LLM) app stack, with a special focus on the utilization of Pinecone and Langchain. We will outline the steps required to create vector embeddings, which enable efficient storage and retrieval of textual data.
In the era of data-driven applications and artificial intelligence, converting text into numerical representations called vector embeddings has become a crucial step for various natural language processing tasks. Pinecone, a powerful service designed for storing and querying vector embeddings at scale, offers a robust infrastructure to streamline this process. In this blog post, we will explore the journey of text as it undergoes the vectorization process and how Pinecone fits into this workflow.
Text to vector embeddings is the process of converting textual data into numerical representations that capture semantic meaning or contextual information. Guiding developers in natural language processing, text to vector embeddings involves these steps to enable efficient processing and analysis of textual data in various NLP tasks:
The steps to convert text to vector embeddings using Pinecone are as follows:
# create a file requirements.txt
openai
langchain
pinecone-client
python-dotenv
tiktoken
wikipedia
pypdf
docx2txt
python-dotenv
# install all requirements
!pip install -r requirements.txt -q
#The load_document function is a utility function that allows loading text data from different document formats such as PDF and DOCX.
# It determines the file format based on the file extension and uses the appropriate document loader to extract the text content.
import os
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader
def load_document(file):
name, extension = os.path.splitext(file)
if extension == '.pdf':
print(f'Loading {file}')
loader = PyPDFLoader(file)
elif extension == '.docx':
print(f'Loading {file}')
loader = Docx2txtLoader(file)
else:
print('Document format is not supported!')
return None
data = loader.load()
return data
# wikipedia docs
#The load_from_wikipedia function is a utility function that allows loading text data from Wikipedia articles.
# It uses the WikipediaLoader class from the langchain.document_loaders module to retrieve Wikipedia content based on the
# provided query and language.
#Parameters
#query (string): The query or topic for which the Wikipedia content needs to be retrieved.
#lang (string, optional): The language code for the Wikipedia content. Defaults to 'en' (English).
#load_max_docs (int, optional): The maximum number of Wikipedia documents to load. Defaults to 2.
def load_from_wikipedia(query, lang='en', load_max_docs=2):
from langchain.document_loaders import WikipediaLoader
loader = WikipediaLoader(query=query, lang=lang, load_max_docs=load_max_docs)
data = loader.load()
return data
To transform text into vector representations, it is necessary to break it down into smaller meaningful units called tokens. Tokenization can be performed at the word level, character level, or even using subword units, depending on the chosen approach. By splitting the text into tokens, we create the building blocks for generating vector embeddings.
#The chunk_data function is a utility function that allows splitting text data into smaller chunks or segments.
# It uses the RecursiveCharacterTextSplitter class from the langchain.text_splitter module to split the provided text data.
# Parameters
# data (string): The text data that needs to be chunked or split into smaller segments.
# chunk_size (int, optional): The desired size of each chunk in terms of the number of characters. Defaults to 256.
# The chunking process is based on the specified chunk_size parameter, which determines the size of each chunk. Adjust the chunk_size parameter according to your requirements.
# The function splits the text data into chunks without any overlap (chunk_overlap=0). If you need overlapping chunks, you can modify the chunk_overlap parameter in the RecursiveCharacterTextSplitter initialization.
def chunk_data(data, chunk_size=256):
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0)
chunks = text_splitter.split_documents(data)
return chunks
For tasks that focus on individual words, pre-trained word embeddings come into play. Models like Word2Vec, GloVe, or FastText provide vector representations for words based on their contextual meaning within a large corpus of text. These pre-trained word embeddings capture semantic relationships and enable the creation of high-dimensional representations for each word.
#The print_embedding_cost function is a utility function that calculates and prints the embedding cost for a given list of texts.
# It utilizes the tiktoken library to encode the texts and calculate the total number of tokens.
# It then calculates the corresponding embedding cost in USD based on the total number of tokens.
# texts (list): A list of text objects or strings for which the embedding cost needs to be calculated.
# The function uses the 'text-embedding-ada-002' model for token encoding. If you want to use a different model, you can modify the model name in the encoding_for_model function call.
# The embedding cost calculation assumes a rate of 0.0004 USD per 1000 tokens (* 0.0004). You can adjust this rate according to your requirements.
def print_embedding_cost(texts):
import tiktoken
enc = tiktoken.encoding_for_model('text-embedding-ada-002')
total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
print(f'Total Tokens: {total_tokens}')
print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}')
Sentence Embeddings: Unveiling the Semantic Meaning In scenarios where we aim to represent entire sentences or documents, pre-trained models like BERT, Universal Sentence Encoder, or Doc2Vec step in. These models generate vector representations for complete sentences or paragraphs, capturing the semantic meaning of the text. Sentence embeddings unlock the potential for understanding the overall context and enable more comprehensive analysis.
Training your own Embeddings: Tailoring to Your Needs Alternatively, you have the option to train your own embeddings using deep learning models equipped with an embedding layer. This approach involves constructing a neural network architecture, feeding it with labeled or unlabeled text data, and optimizing the model to learn meaningful vector representations during training. Training your embeddings allows you to fine-tune the embeddings specifically for your application or domain.
At this stage, each token, word, or sentence is converted into a corresponding vector representation based on the chosen embedding model or approach. By assigning numerical vectors, the text data transforms into a format that can be processed by machine learning algorithms. Vector representations encode the semantic meaning or contextual information of the text, enabling efficient analysis and comparison.
With the vector embeddings generated, Pinecone comes into the picture. Pinecone is designed to handle the storage, indexing, and retrieval of vector embeddings efficiently. By leveraging Pinecone, you can store and manage your vector embeddings in a distributed and scalable manner. Pinecone's infrastructure allows for fast indexing, incremental updates, and seamless retrieval of embeddings.
#Please ensure you have the necessary libraries (pinecone, langchain.vectorstores, and langchain.embeddings.openai) installed and
# the required environment variables (PINECONE_API_KEY and PINECONE_ENV) properly set for this code to work correctly.
# The insert_or_fetch_embeddings function is a utility function that inserts or fetches embeddings using the Pinecone Vector Indexing service. It leverages the OpenAIEmbeddings class from the langchain.embeddings.openai module to generate embeddings for the provided chunks of text.
# Parameters
# index_name (string): The name of the index where the embeddings will be stored or fetched.
# chunks (list): A list of text chunks for which embeddings will be generated and inserted into the index.
# Return Value
# The function returns a Pinecone vector store object that can be used to perform operations such as similarity search and retrieval using the generated embeddings.
# The function assumes that you have set up the Pinecone API key and environment variables (PINECONE_API_KEY and PINECONE_ENV) for authentication and configuration. Ensure that you have properly set these variables in your environment.
# The function checks if the specified index_name already exists in the Pinecone service. If it does, it loads the existing embeddings into a Pinecone vector store object. If not, it creates a new index and generates embeddings for the provided chunks.
# The function assumes a dimension of 1536 for the embeddings and uses the cosine metric for similarity calculations. You can modify these parameters according to your requirements.
import pinecone
import os
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
def insert_or_fetch_embeddings(index_name, chunks):
embeddings = OpenAIEmbeddings()
pinecone.init(api_key=os.environ.get('PINECONE_API_KEY'), environment=os.environ.get('PINECONE_ENV'))
if index_name in pinecone.list_indexes():
print(f'Index {index_name} already exists. Loading embeddings ...', end='')
vector_store = Pinecone.from_existing_index(index_name, embeddings)
print(' Ok')
else:
print(f'Creating index {index_name} and embeddings ...', end='')
pinecone.create_index(index_name, dimension=1536, metric='cosine')
vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)
print(' Ok')
return vector_store
Pinecone Console
#delete index and cleanup as needed
import pinecone
import os
def delete_pinecone_index(index_name='all'):
pinecone.init(api_key=os.environ.get('PINECONE_API_KEY'), environment=os.environ.get('PINECONE_ENV'))
if index_name == 'all':
indexes = pinecone.list_indexes()
print('Deleting all indexes ...')
for index in indexes:
pinecone.delete_index(index)
print('All indexes deleted successfully.')
else:
print(f'Deleting index {index_name} ...', end='')
pinecone.delete_index(index_name)
print('Index deleted successfully.')
Once the vector embeddings are stored and indexed in Pinecone, you can leverage its capabilities for querying. By submitting a query vector, Pinecone enables similarity searches and nearest neighbor retrieval based on vector similarity. This functionality is crucial for tasks such as recommendation systems, content matching, and clustering, where finding similar embeddings is paramount.
# The ask_and_get_answer function is a utility function that performs question-answering using a retrieval-based approach.
# It uses the provided vector_store to retrieve relevant documents based on the input question q.
# It then utilizes the ChatGPT model from OpenAI to generate an answer using the retrieved documents.
# Parameters
# vector_store (Pinecone vector store): The Pinecone vector store object containing the indexed embeddings.
# q (string): The question for which an answer needs to be generated.
# Return Value
# The function returns the generated answer as a string
# The function assumes that you have already set up the necessary retrieval and language models (vector_store, llm, retriever, chain) for the question-answering process.
# The function utilizes the RetrievalQA class from the langchain.chains module to run the question-answering process. Modify the parameters and configurations of the models according to your requirements.
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
def ask_and_get_answer(vector_store, q):
llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=1)
retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 3})
chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
answer = chain.run(q)
return answer
def ask_with_memory(vector_store, question, chat_history=None):
if chat_history is None:
chat_history = []
llm = ChatOpenAI(temperature=1)
retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 3})
crc = ConversationalRetrievalChain.from_llm(llm, retriever)
result = crc({'question': question, 'chat_history': chat_history})
chat_history.append((question, result['answer']))
return result, chat_history
#https://constitutioncenter.org/the-constitution/full-text
#create a pdf version of the above
data = load_document('constitution.pdf')
print(f'You have {len(data)} pages in your data')
print(f'There are {len(data[20].page_content)} characters in the page')
Loading constitution.pdf You have 41 pages in your data There are 1137 characters in the page
chunks = chunk_data(data)
print(len(chunks))
# print(chunks[10].page_content)
190
print_embedding_cost(chunks)
Total Tokens: 16711 Embedding Cost in USD: 0.006684
#delete_pinecone_index()
index_name = 'askadocument'
vector_store = insert_or_fetch_embeddings(index_name)
Creating index askadocument and embeddings ...Ok
q = 'What is the document about?'
answer = ask_and_get_answer(vector_store, q)
print(answer)
# The ask_questions function is a utility function that allows users to ask questions interactively and get answers in real-time. It repeatedly prompts the user for questions until the user enters "Quit" or "Exit" to end the session. For each question, it calls the ask_and_get_answer function to generate an answer using the provided vector_store and displays the answer.
# Parameters
# vector_store (Pinecone vector store): The Pinecone vector store object containing the indexed embeddings.
# Return Value
# The function does not return any value. It continuously prompts the user for questions and displays the corresponding answers until the user chooses to quit.
# The function relies on the ask_and_get_answer function to generate answers for each question. Ensure that you have defined and imported the ask_and_get_answer function properly.
# The function continuously prompts the user for questions in an interactive manner. It checks if the user has entered "Quit" or "Exit" to terminate the session and then exits after a short delay.
# Modify the prompt, question numbering, and other user interaction aspects of the function according to your requirements.
import time
def ask_questions(vector_store):
print('Write Quit or Exit to quit.')
i = 1
while True:
q = input(f'Question #{i}: ')
i += 1
if q.lower() in ['quit', 'exit']:
print('Quitting ... bye bye!')
time.sleep(2)
break
answer = ask_and_get_answer(vector_store, q)
print(f'\nAnswer: {answer}')
print(f'\n{"-" * 50}\n')
ask_questions(vector_store)
We will briefly discuss different summarization techniques, including LangChain's unique chain types: stuff, map_reduce, and refine.
from langchain.chat_models import ChatOpenAI
from langchain.schema import(
AIMessage,
HumanMessage,
SystemMessage
)
text= """
he Infinite Game explores leadership choices, and provides guidelines to implement an "infinite game" plan. Finite mindsets focus on winning, whereas infinite mindsets develop a more significant cause than ourselves or our business. Collaborating with others, around a shared purpose, builds a better resourced, and more caring world.\
Sinek wrote The Infinite Game in 2019, as part of his own infinite game. His previous books, Start with Why, and Leaders Eat Last, lay the foundation for this, and his inspiration came from a book called Finite and Infinite Games. The author of Finite and Infinite Games, James Carse, explains how finite-minded leaders play to win, whereas infinite-minded leaders play to keep playing, ultimately for the good of the game. Building from this, Sinek evaluates finite and infinite leadership within different institutions, and he illustrates how remarkably different the results of each mindset can be.\
This summary briefly guides us through the choice to adopt a finite or infinite mindset. As the players in the game of life, this is a crucial choice that we all have to make. Whether you're in business, politics, academics, or a parent, we all need to realize that an infinite or finite mindset will impact others, long after we're gone. So if you're interested in understanding the qualities required for leadership, The Infinite Game provides crucial insights.\
"""
messages = [
SystemMessage(content='You are an expert copywriter with expertize in summarizing documents'),
HumanMessage(content=f'Please provide a short and concise summary of the following text:\n TEXT: {text}')
]
llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')
llm.get_num_tokens(text)
276
summary_output = llm(messages)
print(summary_output.content)
"The Infinite Game" by Simon Sinek explores the concept of adopting an infinite mindset in leadership. It contrasts finite mindsets focused on winning with infinite mindsets that prioritize a greater cause. Sinek draws inspiration from James Carse's book "Finite and Infinite Games" and evaluates the impact of these mindsets in various institutions. This book offers valuable insights for anyone interested in understanding the qualities required for effective leadership.
from langchain import PromptTemplate
from langchain.chains import LLMChain
template = '''
Write a concise and short summary of the following text:
TEXT: `{text}`
Translate the summary to {language}.
'''
prompt = PromptTemplate(
input_variables=['text', 'language'],
template=template
)
llm.get_num_tokens(prompt.format(text=text, language='English'))
297
chain = LLMChain(llm=llm, prompt=prompt)
summary = chain.run({'text': text, 'language':'hindi'})
print(summary)
The provided code demonstrates how to use a prompt template to generate a concise and short summary of a given text using a summarization chain. It utilizes various components from the langchain library, including prompt templates, chat models, and the summarize chain.
# Usage
# Ensure that you have the required dependencies installed and import the necessary modules.
# Load the input text that needs to be summarized.
# Define a prompt template with the desired structure, incorporating the text placeholder.
# Create an instance of the PromptTemplate class, specifying the input variables and the template.
# Load the summarization chain using the load_summarize_chain function, providing the chat model, chain type, prompt, and any other relevant parameters.
# Run the summarization chain on the input text using the run method.
# Access the output summary generated by the chain.
from langchain import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
with open('simonsinek.txt', encoding='utf-8') as f:
text = f.read()
# text
docs = [Document(page_content=text)]
llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')
template = '''Write a concise and short summary of the following text.
TEXT: `{text}`
'''
prompt = PromptTemplate(
input_variables=['text'],
template=template
)
chain = load_summarize_chain(
llm,
chain_type='stuff',
prompt=prompt,
verbose=False
)
output_summary = chain.run(docs)
print(output_summary)
The provided code demonstrates how to perform summarization on a large text using text splitting and a map-reduce summarization chain. It utilizes components from the langchain library, including prompt templates, chat models, text splitters, and the summarize chain.
# Usage
# Ensure that you have the required dependencies installed and import the necessary modules.
# Load the input text that needs to be summarized.
# Create an instance of the chat model (ChatOpenAI in this case) for generating summaries.
# Determine the number of tokens in the input text using the get_num_tokens method of the chat model.
# Configure the text splitting process by creating an instance of RecursiveCharacterTextSplitter, specifying the desired chunk_size and chunk_overlap.
# Split the input text into smaller chunks using the create_documents method of the text splitter.
# Obtain the number of chunks generated.
# Load the summarization chain using the load_summarize_chain function, providing the chat model, chain type, and any other relevant parameters.
# Run the summarization chain on the chunks using the run method.
# Access the output summary generated by the chain.
from langchain import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
with open('simonsinek.txt', encoding='utf-8') as f:
text = f.read()
llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')
num_tokens = llm.get_num_tokens(text)
print(num_tokens) # 3435
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=50)
chunks = text_splitter.create_documents([text])
num_chunks = len(chunks)
print(num_chunks) # 2
chain = load_summarize_chain(llm, chain_type='map_reduce', verbose=False)
output_summary = chain.run(chunks)
print(output_summary)
chain.llm_chain.prompt.template
chain.combine_document_chain.llm_chain.prompt.template
'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'
Custom Prompts
map_prompt = '''
Write a short and concise summary of the following:
Text: `{text}`
CONCISE SUMMARY:
'''
map_prompt_template = PromptTemplate(
input_variables=['text'],
template=map_prompt
)
combine_prompt = '''
Write a concise summary of the following text that covers the key points.
Add a title to the summary.
Start your summary with an INTRODUCTION PARAGRAPH that gives an overview of the topic FOLLOWED
by BULLET POINTS if possible AND end the summary with a CONCLUSION PHRASE.
Text: `{text}`
'''
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=['text'])
summary_chain = load_summarize_chain(
llm=llm,
chain_type='map_reduce',
map_prompt=map_prompt_template,
combine_prompt=combine_prompt_template,
verbose=False
)
output = summary_chain.run(chunks)
print(output)
Title: "The Infinite Game" by Simon Sinek: Embracing an Infinite Mindset in Leadership
Introduction:
"The Infinite Game" by Simon Sinek explores the concept of adopting an infinite mindset in leadership, contrasting it with the finite-minded approach. Sinek emphasizes the importance of prioritizing a greater cause and long-term sustainability over short-term gains.
Key Points:
- Finite-minded leaders focus on winning and short-term gains, while infinite-minded leaders prioritize a greater cause and long-term sustainability.
- Guidelines for implementing an infinite game plan include advancing a just cause, building trusting teams, studying worthy rivals, preparing for flexibility, and demonstrating courage.
- Clarity of purpose is crucial, and the current form of capitalism is criticized for prioritizing profits over employee well-being and long-term success.
- Examples of organizations successfully embracing an infinite mindset include Lego and Patagonia.
Conclusion:
"The Infinite Game" highlights the importance of business leaders focusing on a Just Cause and having a clear vision. It emphasizes the need for building trusting teams, studying rivals, embracing existential flexibility, and demonstrating courage. By challenging convention and creating lasting companies, leaders can contribute to a better world.
# Ensure that you have the required dependencies installed and import the necessary modules.
# Load the document using the desired document loader (in this case, UnstructuredPDFLoader).
# Split the loaded document into smaller chunks using the RecursiveCharacterTextSplitter with specified chunk_size and chunk_overlap parameters.
# Determine the number of chunks generated.
# Create an instance of the chat model (ChatOpenAI in this case) for generating summaries.
# Define a function (print_embedding_cost in this case) to calculate and print the embedding cost for the chunks.
# Calculate and print the embedding cost using the defined function.
# Load the summarization chain using the load_summarize_chain function, providing the chat model, chain type, and any other relevant parameters.
# Run the summarization chain on the chunks, producing an initial summary.
# Print the initial summary.
# Define prompt templates for refinement-based summarization, including an initial prompt template and a refinement prompt template.
# Load a new summarization chain using the load_summarize_chain function, specifying the prompt templates and other relevant parameters.
# Run the new summarization chain on the chunks, producing a refined summary.
# Print the refined summary.
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredPDFLoader
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)
!pip install unstructured -q
!pip install pdf2image
loader = UnstructuredPDFLoader('data-catalogs.pdf')
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=100)
chunks = text_splitter.split_documents(data)
num_chunks = len(chunks)
print(num_chunks)
llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')
def print_embedding_cost(texts):
import tiktoken
enc = tiktoken.encoding_for_model('gpt-3.5-turbo')
total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
print(f'Total Tokens: {total_tokens}')
print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.002:.6f}')
print_embedding_cost(chunks)
chain = load_summarize_chain(
llm=llm,
chain_type='refine',
verbose=True
)
output_summary = chain.run(chunks)
print(output_summary)
prompt_template = """Write a concise summary of the following extracting the key information:
Text: `{text}`
CONCISE SUMMARY:"""
initial_prompt = PromptTemplate(template=prompt_template, input_variables=['text'])
refine_template = '''
Your job is to produce a final summary.
I have provided an existing summary up to a certain point: {existing_answer}.
Please refine the existing summary with some more context below.
------------
{text}
------------
Start the final summary with an INTRODUCTION PARAGRAPH that gives an overview of the topic FOLLOWED
by BULLET POINTS if possible AND end the summary with a CONCLUSION PHRASE.
'''
refine_prompt = PromptTemplate(
template=refine_template,
input_variables=['existing_answer', 'text']
)
chain = load_summarize_chain(
llm=llm,
chain_type='refine',
question_prompt=initial_prompt,
refine_prompt=refine_prompt,
return_intermediate_steps=False
)
output_summary = chain.run(chunks)
print(output_summary)
Introduction:
The demand for data catalogs is increasing as organizations struggle to manage and analyze diverse and distributed data assets. Data and analytics leaders are advised to adopt machine learning (ML)-augmented data catalogs as part of their data management strategy. Implementing data catalogs without a strategic plan can lead to metadata silos. Data catalogs offer a solution to inventory and classify data assets, but selecting the right solution can be challenging. Without a comprehensive strategy, organizations may struggle to manage and monetize data assets and comply with regulations. By 2021, organizations with curated data catalogs will realize twice the business value compared to those without. Over 80% of data lake projects will fail due to difficulties in finding and curating data. Traditional IT-led data catalog projects without ML assistance will fail to be delivered on time. ML-augmented data catalogs simplify metadata management tasks and reduce time to insight. They assist with discovering metadata, understanding and enriching data, contributing and governing data, and consuming data. Modern ML-enabled data catalogs allow for a crowdsourced model to be developed, enabling users to rate datasets and tag risky attributes. Data catalogs also help create a trust-based governance model and provide a graphical representation of data lineage. Business users can consume data through embedded query editors and REST-based APIs. ML-augmented data catalogs have evolved to collect and share all forms of metadata, enabling automation of data management and integration tasks. There are various types of data catalog tools available, including stand-alone data catalogs and broader metadata management tools with data catalog capabilities. Organizations should identify use cases for metadata management and inventory all forms of metadata to improve data management activities. The abundance of data catalog tool choices can create confusion, so selecting the most appropriate tool category is crucial. Some metadata management tools are simply being rebranded as data catalogs, so organizations need to be careful when selecting a vendor. Data integration and data virtualization tools, data lake enablement tools, data preparation tools, and modern analytics/BI and data science platforms are embedding data catalog capabilities. Cloud infrastructure providers also offer data catalogs for their environments.
Bullet Points:
- Passive metadata is static in nature and requires manual updates, while active metadata provides continuous access, analysis, and feedback on all metadata parameters.
- Customer interest in data catalogs has increased significantly, making it a lucrative market for vendors.
- Organizations should consider ML-augmented data catalogs to simplify metadata management tasks and reduce time to insight.
- ML-enabled data catalogs allow for a crowdsourced model, enabling users to rate datasets and tag risky attributes.
- Data catalogs help create a trust-based governance model and provide a graphical representation of data lineage.
- Various types of data catalog tools are available, including stand-alone data catalogs and broader metadata management tools with data catalog capabilities.
- Organizations should identify use cases for metadata management and inventory all forms of metadata to improve data management activities.
- Careful consideration is needed when selecting a vendor, as some metadata management tools are simply rebranded as data catalogs.
- Data integration and data virtualization tools, data lake enablement tools, data preparation tools, and modern analytics/BI and data science platforms are embedding data catalog capabilities.
- Cloud infrastructure providers also offer data catalogs for their environments.
Conclusion:
In conclusion, adopting ML-augmented data catalogs and selecting the right tool category are essential for organizations to effectively manage and analyze their data assets. The increasing demand for data catalogs and the potential business value they offer make it a lucrative market for vendors. Organizations should carefully consider their metadata management needs and select a data catalog tool that best suits their requirements.
The provided code demonstrates how to use the langchain library to initialize an agent that can provide a short summary of a specific topic using zero-shot reaction and Wikipedia information retrieval. It utilizes components such as chat models, agents, and utility functions from the langchain library.
# Usage
# Ensure that you have the required dependencies installed and import the necessary modules.
# Create an instance of the chat model (ChatOpenAI in this case) to provide language generation capabilities.
# Create an instance of the WikipediaAPIWrapper utility class to retrieve information from Wikipedia.
# Define a list of Tool objects, where each object represents a specific tool or information source that the agent can use.
# Initialize the agent using the initialize_agent function, providing the list of tools, chat model, and agent type.
# Run the agent by calling the run method of the agent executor, passing the query or topic for which you want a short summary.
# Access the output generated by the agent.
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.utilities import WikipediaAPIWrapper
llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')
wikipedia = WikipediaAPIWrapper()
tools = [
Tool(
name="Wikipedia",
func=wikipedia.run,
description="Useful for when you need to get information from Wikipedia about a single topic"
)
]
agent_executor = initialize_agent(tools, llm, agent='zero-shot-react-description', verbose=True)
output = agent_executor.run('Can you please provide a short summary of Simon Sinek?')
print(output)
> Entering new chain...
I should use Wikipedia to get information about Simon Sinek.
Action: Wikipedia
Action Input: Simon Sinek
Observation: Page: Simon Sinek
Summary: Simon Oliver Sinek (born October 9, 1973) is a British-born American author and inspirational speaker. He is the author of five books, including Start With Why (2009) and The Infinite Game (2019).
Page: Sinek
Summary: Sinek is a surname. Notable people with the surname include:
Charles Sinek (born 1968), American ice dancer
Simon Sinek (born 1973), British-American author
Page: The Infinite Game
Summary: The Infinite Game is a 2019 book by Simon Sinek, applying ideas from James P. Carse's similarly titled book, Finite and Infinite Games to topics of business and leadership.The book is based on Carse's distinction between two types of games: finite games and infinite games. As Sinek explains, finite games (e.g. chess and football) are played with the goal of getting to the end of the game and winning, while following static rules. Every game has a beginning, middle, and end, and a final winner is distinctly recognizable. In contrast, infinite games (e.g. business and politics) are played for the purpose of continuing play rather than to win. Sinek claims that leaders who embrace an infinite mindset, aligned with infinite play, will build stronger, more innovative, inspiring, resilient organizations, though these benefits may accrue over larger timescales than benefits associated with a finite mindset.
Thought:I now know the final answer.
Final Answer: Simon Sinek is a British-born American author and inspirational speaker. He is the author of five books, including "Start With Why" and "The Infinite Game."
> Finished chain.
output:
Simon Sinek is a British-born American author and inspirational speaker. He is the author of five books, including "Start With Why" and "The Infinite Game."
Integrating Pinecone, a high-performance vector database, with LangChain, a framework for LLM-powered applications, enables the development of sophisticated applications. This integration adds "long-term memory" to LLMs, enhancing autonomous agents, chatbots, and question answering systems by leveraging scalable, real-time recommendation and search systems based on vector similarity search. The process of converting text to vector embeddings involves several stages, including preprocessing, tokenization, utilizing pre-trained word and sentence embeddings, or training your own embeddings. Pinecone plays a vital role in this journey by offering a scalable infrastructure for storing, indexing, and querying vector embeddings. By leveraging Pinecone's capabilities, developers can efficiently manage and retrieve embeddings, enabling tasks such as recommendation systems and content matching.
🔗 Read more about Pinecone here
🔗 Read more about Langchain here