emergingtrajectories.factsrag3¶
This is an experimental approach to tracking information regardless of source type. It will also power more than recent updates. Here’s how it works…
All “Content Sources” (a new class type that obtains content) will send content directly to the Facts DB.
The “Facts DB” will then extract all relevant facts for a prediction or research theme. It will keep cache the original content, will track the sources, and will also input all the facts into a RAG database.
We can then query the DB for relevant facts on an ad hoc basis, rather than only for new content.
Attributes¶
Classes¶
Initialize the database. |
|
This is a RAG-based fact database. We build a database of facts available in JSON and via RAG and use this as a basic search engine for information. We use our own DB to index all facts, but also maintain a list of facts, sources, etc. in a JSON file. Finally, we keep a cache of all content and assume URLs do not get updated; we'll change this process in the future. |
|
The FactBot is like a ChatBot but enables you to ask questions that reference an underlying RAG database (KnowledgeBaseFileCache), which then enables the chatbot to cite sourcable facts. |
Functions¶
|
Convert a URI to a local file name. In this case, we typically will use an MD5 sum. |
|
Converts fact IDs referenced in a piece of text to relevant source links, appending sources as end notes in the document/text. |
Module Contents¶
- emergingtrajectories.factsrag3._DEFAULT_NUM_SEARCH_RESULTS = 10¶
- emergingtrajectories.factsrag3.facts_base_system_prompt = Multiline-String¶
Show Value
"""You are a researcher tasked with helping forecast economic and social trends. The title of our research project is: {statement_title}. The project description is as follows... {statement_description} We will provide you with content from reports and web pages that is meant to help with the above. We will ask you to review these documents, create a set of bullet points to inform your thinking. Rather than using bullet points, please list each as F1, F2, F3, etc... So that we can reference it. The content we provided you contains source numbers in the format 'SOURCE: #'. When you extract facts, please include the citation in square brackets, with the #, like [#], but replace "#" with the actual Source # from the crawled content we are providing you. For example, if you are referring to a fact that came under --- SOURCE: 3 ---, you would write something like: "Data is already trending to hotter temperatures [3]." Do not include the "#" in the brackets, just the number. Thus, a bullet point would look like this: F1: (information) [1] F2: (information) [1] F3: (information) [2] ... and so on, where F1, F2, F3, etc. are facts, and [1], [2] are the source documents you are extracting the facts from. """
- emergingtrajectories.factsrag3.facts_base_user_prompt = Multiline-String¶
Show Value
"""Today's date is {the_date}. We will now provide you with all the content we've managed to collect. ---------------------- {scraped_content} ---------------------- Please think step-by-step by (a) extracting critical bullet points from the above, and (b) share any insights you might have based on the facts. The content we provided you contains source numbers in the format 'SOURCE: #'. When you extract facts, please include the citation in square brackets, with the #, like [#], but replace "#" with the actual Source # from the crawled content we are providing you. For example, if you are referring to a fact that came under --- SOURCE: 3 ---, you would write something like: "Data is already trending to hotter temperatures [3]." Do not include the "#" in the brackets, just the actual number. DO NOT PROVIDE A FORECAST, BUT SIMPLY STATE AND SHARE THE FACTS AND INSIGHTS YOU HAVE GATHERED. """
- emergingtrajectories.factsrag3.uri_to_local(uri: str) str¶
Convert a URI to a local file name. In this case, we typically will use an MD5 sum.
- Parameters:
uri (str) – The URI to convert.
- Returns:
The MD5 sum of the URI.
- Return type:
str
- class emergingtrajectories.factsrag3.VectorDBDict(db_file_path: str, openai_api_key: str, error_out_on_conflict: bool = False)¶
Initialize the database.
- Parameters:
db_file_path (str) – The path to the database file. Will be created if it does not exist.
openai_api_key (str) – The OpenAI API key.
error_out_on_conflict (bool) – If True, we will error out if the database tries to write to a file that doesn’t align with the SHA hash of the DB file when it was loaded. Basically like a lock without actually being a lock. When it’s set to False, it will simply print an error.
- VECTOR_SIZE = 1536¶
- MAX_BATCH_SIZE = 100¶
- db_file_path¶
- error_out_on_conflict = False¶
- openai_api_key¶
- openai_client¶
- encoding¶
- db¶
- db_hash = ''¶
- add_vectors(vectors: numpy.array, texts: list, metadata: list = None) list¶
Add vectors to the database.
- Parameters:
vectors (np.array) – The vectors to add.
texts (list) – The text for each vector.
metadata (list, optional) – The metadata for each vector.
- Returns:
The IDs of the vectors.
- Return type:
list
- add_vector(vector: numpy.array, text: str, metadata: dict = None) int¶
Adds a vector to the database.
- Parameters:
vector (np.array) – The vector to add.
text (str) – The text for the vector.
metadata (dict) – The metadata for the vector.
- Returns:
The ID of the vector.
- Return type:
int
- shorten_text(text, max_token_length: int = 8000) str¶
Shortens text to a maximum token length. This is useful for OpenAI API calls, which have a token limit.
- Parameters:
text (str) – The text to shorten.
max_token_length (int) – The maximum token length.
- Returns:
The shortened text; should ONLY be used for encoding.
- Return type:
str
- add_texts(texts: list, metadata: list = None) list¶
Adds text to the database. Calls an embedding function and then adds via add_vectors().
- Parameters:
texts (list) – The texts to add.
metadata (list) – Metadata to add.
- Returns:
The IDs of the texts.
- Return type:
list
- add_text(text: str, metadata: dict = None) int¶
Adds text to the database. Calls an embedding function and then adds via add_vectors().
- Parameters:
text (str) – The text to add.
metadata (dict) – Metadata to add.
- Returns:
The ID of the text.
- Return type:
str
- get_file_sha256(file_path: str) str¶
Get the SHA256 hash of a file. We use this to warn the user if/when the database is being saved and potetially conflicts with the underlying file. Written by GitHub copilot! 🙌
- Parameters:
file_path (str) – The path to the file.
- Returns:
The SHA256 hash of the file.
- Return type:
str
- save()¶
Save the database to disk.
- count() int¶
Returns the size of the DB.
- get(index: int) dict¶
Returns the text and metadata for a given index.
- Parameters:
index (int) – The index to get.
- Returns:
The text and metadata for the index.
- Return type:
dict
- query(text: str, n: int = 10) list¶
Returns the closest vector IDs to a specific query/text.
- Parameters:
text (str) – The text to search for.
n (int) – The number of results to return.
- Returns:
The IDs of the vectors (in order of closest to farthest).
- Return type:
list
- query_min_date(text: str, min_date: datetime.datetime, n: int = 10, date_field: str = 'datetime') list¶
Returns the closest vector IDs to a specific query/text, with a minimum date filter.
- Parameters:
text (str) – The text to search for.
min_date (datetime) – The minimum date to search from.
n (int) – The number of results to return.
date_field – The field in the metadata to use for the date.
- Returns:
The IDs of the vectors (in order of closest to farthest).
- Return type:
list
- class emergingtrajectories.factsrag3.FactRAGFileCache(folder_path: str, openai_api_key: str, cache_file: str = 'cache.json', rag_db_file='vector_db.pickle', crawler=None, chunker=None)¶
This is a RAG-based fact database. We build a database of facts available in JSON and via RAG and use this as a basic search engine for information. We use our own DB to index all facts, but also maintain a list of facts, sources, etc. in a JSON file. Finally, we keep a cache of all content and assume URLs do not get updated; we’ll change this process in the future.
- Parameters:
folder_path (str) – The folder where everything will be stored.
openai_api_key (str) – The OpenAI API key. Used for RAG embeddings.
cache_file (str, optional) – The name of the cache file. Defaults to “cache.json”.
rag_db_folder (str, optional) – The folder where the database will be stored. Defaults to “cdb”.
crawler (optional) – The crawler to use. Defaults to None, in which case a Playwright crawler will be used.
chunker (optional) – The sort of chunker to use. Defaults to None, in which case a GPT-4 chunker will be used.
- root_path¶
- root_parsed¶
- root_original¶
- cache_file¶
- rag_db_file¶
- openai_api_key¶
- cache¶
- vector_db¶
- get_facts_as_dict(n_results=-1, min_date: datetime.datetime = None) list¶
Get all facts as a list.
- Parameters:
n_results (int, optional) – The number of results to return. Defaults to -1, in which case all results are returned.
- Returns:
A list of fact dictionaries containing content, source, and added (the date string for when the fact was added).
- Return type:
list
- get_facts_as_list() list¶
Get all facts as a list.
- Parameters:
None
- Returns:
A list of facts (as strings).
- Return type:
list
- count_facts() int¶
Returns the number of facts in the knowledge database.
- Returns:
The number of facts in the knowledge database.
- Return type:
int
- get_fact_details(fact_id: str) dict¶
Returns similar structure as query_to_fact_list() but for a specific fact ID. Returns NONE otherwise.
- Parameters:
fact_id (str) – The fact ID to get.
- Returns:
A dictionary with the content, source, added date, and added timestamp.
- Return type:
dict
- query_to_fact_list(query: str, n_results: int = 10, since_date: datetime.datetime = None) dict¶
Takes a query and finds the closest semantic matches to the query in the knowledge base.
- Parameters:
query (str) – The query to search for.
n_results (int, optional) – The number of results to return. Defaults to 10.
since_date (datetime, optional) – The date to search from. Defaults to None, in which case all dates are searched.
- Returns:
A list of the facts found, with the key being the fact ID and each fact having its source, add date, and content info.
- Return type:
dict
- query_to_fact_content(query: str, n_results: int = 10, since_date=None, skip_separator=False) str¶
Takes a query and finds the closest semantic matches to the query in the knowledge base.
- Parameters:
query (str) – The query to search for.
n_results (int, optional) – The number of results to return. Defaults to 10.
since_date ([type], optional) – The date to search from. Defaults to None, in which case all dates are searched.
skip_separator (bool, optional) – Whether to prepend and append a note horizontal line and title to the string being returned. Defaults to False.
- Returns:
The content of the facts found, along with the fact IDs.
- Return type:
str
- get_all_recent_facts(days: float = 1, skip_separator=False) str¶
Returns a list of all facts and sources added in the last n days.
- Parameters:
days (float, optional) – The number of days to search back. Defaults to 1. Can be fractional as well.
skip_separator (bool, optional) – Whether to prepend and append a note horizontal line and title to the string being returned. Defaults to False.
- Returns:
The content of the facts found, along with the fact IDs.
- Return type:
str
- get_fact_source(fact_id: str) str¶
Returns the source of a fact given its ID.
- Parameters:
fact_id – The fact ID to get.
- Returns:
The source of the fact.
- Return type:
str
- get_fact_content(fact_id: str) str¶
Returns the content of a fact given its ID.
- Parameters:
fact_id – The fact ID to get.
- Returns:
The content of the fact.
- Return type:
str
- add_fact(fact: str, url: str) bool¶
Adds a fact to the knowledge base.
- Parameters:
fact (str) – The fact to add.
url (str) – The URL source of the fact.
- Returns:
True if the fact was added, False otherwise.
- Return type:
bool
- add_facts(facts: list, sources: list) bool¶
Adds a facts to the knowledge base.
- Parameters:
facts (list) – List of strings. Each string is a fact.
sources (list) – List of sources (e.g., URLs) for the facts.
- Returns:
True if the facts were added, False otherwise.
- Return type:
bool
- facts_from_url(url: str, topic: str) None¶
Given a URL, extract facts from it and save them to our DB and the facts dictionary. Also returns the facts in an array, in case one wants to analyze new facts.
- Parameters:
url (str) – Location of the content.
topic (str) – a brief description of the research you are undertaking.
- new_get_rss_links(rss_url, topic) None¶
Crawls an RSS feed and its posts.
- Parameters:
rss_url (str) – The URL of the RSS feed.
topic (str) – a brief description of the research you are undertaking.
- new_get_new_bing_news(api_key, subscription_endpoint, topic, queries) None¶
Uses Bing to get recent news on the queries and associated topics.
- Parameters:
api_key (str) – The Bing API key.
subscription_endpoint (str) – The Bing subscription endpoint.
topic (str) – a brief description of the research you are undertaking.
queries (list[str]) – A list of queries to search for.
- new_get_new_info_news(newsapi_api_key, topic, queries, top_headlines=False) None¶
Uses the News API to find new information and extract facts from it.
- Parameters:
newsapi_api_key (str) – The News API key.
topic (str) – a brief description of the research you are undertaking.
queries (list[str]) – A list of queries to search for.
top_headlines (bool, optional) – Whether to search for top headlines. Defaults to False.
- get_ft_news(ft_user, ft_pass, topic) None¶
Uses the Financial Times Agent to find new information and extract facts from it.
- Parameters:
ft_user (str) – The Financial Times username.
ft_pass (str) – The Financial Times password.
topic (str) – a brief description of the research you are undertaking.
- new_get_new_info_google(google_api_key, google_search_id, google_search_queries, topic) None¶
Uses Google search to find new information and extract facts from it.
- Parameters:
google_api_key (str) – The Google API key.
google_search_id (str) – The Google search ID.
google_search_queries (list[str]) – A list of queries to search for.
topic (str) – a brief description of the research you are undertaking.
- save_state() None¶
Saves the in-memory changes to the knowledge base to the JSON cache file.
- load_cache() None¶
Loads the cache from the cache file, or creates the relevant files and folders if one does not exist.
- in_cache(uri: str) bool¶
Checks if a URI is in the cache already.
- Parameters:
uri (str) – The URI to check.
- Returns:
True if the URI is in the cache, False otherwise.
- Return type:
bool
- update_cache(uri: str, obtained_on: datetime.datetime, last_accessed: datetime.datetime) None¶
Updates the cache file for a given URI, specifically when it was obtained and last accessed.
- Parameters:
uri (str) – The URI to update.
obtained_on (datetime) – The date and time when the content was obtained.
last_accessed (datetime) – The date and time when the content was last accessed.
- log_access(uri: str) None¶
Saves the last accessed time and updates the accessed tracker for a given URI.
- Parameters:
uri (str) – The URI to update.
- get_unaccessed_content() list[str]¶
Returns a list of URIs that have not been accessed by the agent.
- Returns:
A list of URIs that have not been accessed by the agent.
- Return type:
list[str]
- force_content(uri: str, content: str, check_exists: bool = True) bool¶
Forces a specific URI to have specific content (both HTML and text content). Used to fill old links that we don’t actually want to crawl.
- Parameters:
uri (str) – The URI to force content for.
content (str) – The content to force.
check_exists (bool) – checks if content has already been included in the cache before forcing the new content.
- Returns:
True if the content was forced, False otherwise.
- Return type:
bool
- get(uri: str) str¶
Returns the content for a given URI. If the content is not in the cache, it will be scraped and added to the cache.
- Parameters:
uri (str) – The URI to get the content for.
- Returns:
The content for the given URI.
- Return type:
str
- add_content(content: str, uri: str = None) None¶
Adds content to cache.
- Parameters:
content (str) – The content to add to the cache.
uri (str, optional) – The URI to use for the content. Defaults to None, in which case an MD5 sum of the content will be used.
- add_content_from_file(filepath: str, uri: str = None) None¶
Adds content from a text file to the cache.
- Parameters:
filepath (str) – The path to the file to add to the cache.
uri (str, optional) – The URI to use for the content. Defaults to None, in which case an MD5 sum of the content will be used.
- class emergingtrajectories.factsrag3.FactBot(knowledge_db: FactRAGFileCache, openai_api_key: str = None, chatbot: emergingtrajectories.chunkers.ChatBot = None)¶
The FactBot is like a ChatBot but enables you to ask questions that reference an underlying RAG database (KnowledgeBaseFileCache), which then enables the chatbot to cite sourcable facts.
- Parameters:
knowledge_db (FactRAGFileCache) – The knowledge database to use.
openai_api_key (str, optional) – The OpenAI API key. Defaults to None.
chatbot (ChatBot, optional) – The PhaseLLB chatbot to use. Defaults to None, in which case an OpenAI chatbot is used (and the OpenAI API key must be provided).
- knowledge_db¶
- ask(question: str, clean_sources: bool = True) str¶
Ask a question to the FactBot. This will query the underlying knowledge database and use the returned facts to answer the question.
- Parameters:
question (str) – The question to ask.
clean_sources (bool, optional) – Whether to clean the sources in the response. Defaults to True; in this case, it will replace fact IDs with relevant source links at the end of the response.
- Returns:
The response to the question.
- Return type:
str
- source(fact_id: str) str¶
Returns the URL source for a given fact ID.
- Parameters:
fact_id (str) – The fact ID to get the source for.
- Returns:
The URL source for the given fact ID.
- Return type:
str
- clean_and_source_to_html(text_to_clean: str, start_count: int = 0) list¶
Returns a formatted response with sourced HTML. This is used for emergingtrajectories.com and acts as a base for anyone else wanting to build similar features.
- Parameters:
text_to_clean – The text to clean/cite/source.
start_count – The starting count for the sources.
- Returns:
two strings – the actual response in the first case, and the sources in the second case, and an integer representing the new source count.
- Return type:
list
- emergingtrajectories.factsrag3.clean_fact_citations(knowledge_db: FactRAGFileCache, text_to_clean: str) str¶
Converts fact IDs referenced in a piece of text to relevant source links, appending sources as end notes in the document/text.
- Parameters:
knowledge_db (FactRAGFileCache) – The knowledge database to use for fact lookups.
text_to_clean (str) – The text to clean.
- Returns:
The cleaned text.
- Return type:
str