emergingtrajectories.factsrag
=============================

.. py:module:: emergingtrajectories.factsrag

.. autoapi-nested-parse::

   This is an experimental approach to tracking information regardless of source type. It will also power more than recent updates. Here's how it works...

   1. All "Content Sources" (a new class type that obtains content) will send content directly to the Facts DB.
   2. The "Facts DB" will then extract all relevant facts for a prediction or research theme. It will keep cache the original content, will track the sources, and will also input all the facts into a RAG database.
   3. We can then query the DB for relevant facts on an ad hoc basis, rather than only for new content.


Attributes
----------

.. autoapisummary::

   emergingtrajectories.factsrag._DEFAULT_NUM_SEARCH_RESULTS
   emergingtrajectories.factsrag.facts_base_system_prompt
   emergingtrajectories.factsrag.facts_base_user_prompt
   emergingtrajectories.factsrag.fact_system_prompt


Classes
-------

.. autoapisummary::

   emergingtrajectories.factsrag.FactRAGFileCache
   emergingtrajectories.factsrag.FactBot


Functions
---------

.. autoapisummary::

   emergingtrajectories.factsrag.uri_to_local
   emergingtrajectories.factsrag.clean_fact_citations


Module Contents
---------------

.. py:data:: _DEFAULT_NUM_SEARCH_RESULTS
   :value: 10


.. py:data:: facts_base_system_prompt
   :value: Multiline-String

   .. raw:: html

      <details><summary>Show Value</summary>

   .. code-block:: python

      """You are a researcher tasked with helping forecast economic and social trends. The title of our research project is: {statement_title}.
      
      The project description is as follows...
      {statement_description}
      
      We will provide you with content from reports and web pages that is meant to help with the above. We will ask you to review these documents, create a set of bullet points to inform your thinking. Rather than using bullet points, please list each as F1, F2, F3, etc... So that we can reference it.
      
      The content we provided you contains source numbers in the format 'SOURCE: #'. When you extract facts, please include the citation in square brackets, with the #, like [#], but replace "#" with the actual Source # from the crawled content we are providing you.
      
      For example, if you are referring to a fact that came under --- SOURCE: 3 ---, you would write something like: "Data is already trending to hotter temperatures [3]." Do not include the "#" in the brackets, just the number.
      
      Thus, a bullet point would look like this:
      F1: (information) [1]
      F2: (information) [1]
      F3: (information) [2]
      
      ... and so on, where F1, F2, F3, etc. are facts, and [1], [2] are the source documents you are extracting the facts from.
      """

   .. raw:: html

      </details>


.. py:data:: facts_base_user_prompt
   :value: Multiline-String

   .. raw:: html

      <details><summary>Show Value</summary>

   .. code-block:: python

      """Today's date is {the_date}. We will now provide you with all the content we've managed to collect. 
      
      ----------------------
      {scraped_content}
      ----------------------
      
      Please think step-by-step by (a) extracting critical bullet points from the above, and (b) share any insights you might have based on the facts.
      
      The content we provided you contains source numbers in the format 'SOURCE: #'. When you extract facts, please include the citation in square brackets, with the #, like [#], but replace "#" with the actual Source # from the crawled content we are providing you.
      
      For example, if you are referring to a fact that came under --- SOURCE: 3 ---, you would write something like: "Data is already trending to hotter temperatures [3]." Do not include the "#" in the brackets, just the actual number.
      
      DO NOT PROVIDE A FORECAST, BUT SIMPLY STATE AND SHARE THE FACTS AND INSIGHTS YOU HAVE GATHERED.
      """

   .. raw:: html

      </details>


.. py:data:: fact_system_prompt
   :value: Multiline-String

   .. raw:: html

      <details><summary>Show Value</summary>

   .. code-block:: python

      """You are a researcher helping extract facts about {topic}, trends, and related observations. We will give you a piece of content scraped on the web. Please extract facts from this. Each fact should stand on its own, and can be several sentences long if need be. You can have as many facts as needed. For each fact, please start it as a new line with "---" as the bullet point. For example:
      
      --- Fact 1... This is the fact.
      --- Here is a second fact.
      --- And a third fact.
      
      Please do not include new lines between bullet points. Make sure you write your facts in ENGLISH. Translate any foreign language content/facts/observations into ENGLISH.
      
      We will simply provide you with content and you will just provide facts."""

   .. raw:: html

      </details>


.. py:function:: uri_to_local(uri: str) -> str

   Convert a URI to a local file name. In this case, we typically will use an MD5 sum.

   :param uri: The URI to convert.
   :type uri: str

   :returns: The MD5 sum of the URI.
   :rtype: str


.. py:class:: FactRAGFileCache(folder_path: str, openai_api_key: str, cache_file: str = 'cache.json', sources_file: str = 'sources.json', facts_file: str = 'facts.json', rag_db_folder='cdb', crawler=None)

   This is a RAG-based fact database. We build a database of facts available in JSON and via RAG and use this as a basic search engine for information. We use ChromaDB to index all facts, but also maintain a list of facts, sources, etc. in a JSON file. Finally, we keep a cache of all content and assume URLs do not get updated; we'll change this process in the future.

   :param folder_path: The folder where everything will be stored.
   :type folder_path: str
   :param openai_api_key: The OpenAI API key. Used for RAG embeddings.
   :type openai_api_key: str
   :param cache_file: The name of the cache file. Defaults to "cache.json".
   :type cache_file: str, optional
   :param sources_file: The name of the sources file. Defaults to "sources.json".
   :type sources_file: str, optional
   :param facts_file: The name of the facts file. Defaults to "facts.json".
   :type facts_file: str, optional
   :param rag_db_folder: The folder where the ChromaDB database will be stored. Defaults to "cdb".
   :type rag_db_folder: str, optional
   :param crawler: The crawler to use. Defaults to None, in which case a Playwright crawler will be used.
   :type crawler: optional


   .. py:attribute:: root_path


   .. py:attribute:: root_parsed


   .. py:attribute:: root_original


   .. py:attribute:: cache_file


   .. py:attribute:: sources_file


   .. py:attribute:: facts_file


   .. py:attribute:: rag_db_folder


   .. py:attribute:: openai_api_key


   .. py:attribute:: chromadb_client


   .. py:attribute:: facts_rag_collection


   .. py:attribute:: cache


   .. py:attribute:: facts


   .. py:attribute:: sources


   .. py:method:: query_to_fact_list(query: str, n_results: int = 10, since_date: datetime.datetime = None) -> dict

      Takes a query and finds the closest semantic matches to the query in the knowledge base.

      :param query: The query to search for.
      :type query: str
      :param n_results: The number of results to return. Defaults to 10.
      :type n_results: int, optional
      :param since_date: The date to search from. Defaults to None, in which case all dates are searched.
      :type since_date: datetime, optional

      :returns: A list of the facts found, with the key being the fact ID and each fact having its source, add date, and content info.
      :rtype: dict


   .. py:method:: query_to_fact_content(query: str, n_results: int = 10, since_date=None, skip_separator=False) -> str

      Takes a query and finds the closest semantic matches to the query in the knowledge base.

      :param query: The query to search for.
      :type query: str
      :param n_results: The number of results to return. Defaults to 10.
      :type n_results: int, optional
      :param since_date: The date to search from. Defaults to None, in which case all dates are searched.
      :type since_date: [type], optional
      :param skip_separator: Whether to prepend and append a note horizontal line and title to the string being returned. Defaults to False.
      :type skip_separator: bool, optional

      :returns: The content of the facts found, along with the fact IDs.
      :rtype: str


   .. py:method:: get_all_recent_facts(days: float = 1, skip_separator=False) -> str

      Returns a list of all facts and sources added in the last n days.

      :param days: The number of days to search back. Defaults to 1. Can be fractional as well.
      :type days: float, optional
      :param skip_separator: Whether to prepend and append a note horizontal line and title to the string being returned. Defaults to False.
      :type skip_separator: bool, optional

      :returns: The content of the facts found, along with the fact IDs.
      :rtype: str


   .. py:method:: save_facts_and_sources() -> None

      Saves facts and sources to their respective files.


   .. py:method:: add_fact(fact: str, url: str) -> bool

      Adds a fact to the knowledge base.

      :param fact: The fact to add.
      :type fact: str
      :param url: The URL source of the fact.
      :type url: str

      :returns: True if the fact was added, False otherwise.
      :rtype: bool


   .. py:method:: facts_from_url(url: str, topic: str) -> None

      Given a URL, extract facts from it and save them to ChromaDB and the facts dictionary. Also returns the facts in an array, in case one wants to analyze new facts.

      :param url: Location of the content.
      :type url: str
      :param topic: a brief description of the research you are undertaking.
      :type topic: str


   .. py:method:: new_get_rss_links(rss_url, topic) -> None

      Crawls an RSS feed and its posts.

      :param rss_url: The URL of the RSS feed.
      :type rss_url: str
      :param topic: a brief description of the research you are undertaking.
      :type topic: str


   .. py:method:: new_get_new_info_news(newsapi_api_key, topic, queries, top_headlines=False) -> None

      Uses the News API to find new information and extract facts from it.

      :param newsapi_api_key: The News API key.
      :type newsapi_api_key: str
      :param topic: a brief description of the research you are undertaking.
      :type topic: str
      :param queries: A list of queries to search for.
      :type queries: list[str]
      :param top_headlines: Whether to search for top headlines. Defaults to False.
      :type top_headlines: bool, optional


   .. py:method:: get_ft_news(ft_user, ft_pass, topic) -> None

      Uses the Financial Times Agent to find new information and extract facts from it.

      :param ft_user: The Financial Times username.
      :type ft_user: str
      :param ft_pass: The Financial Times password.
      :type ft_pass: str
      :param topic: a brief description of the research you are undertaking.
      :type topic: str


   .. py:method:: new_get_new_info_google(google_api_key, google_search_id, google_search_queries, topic) -> None

      Uses Google search to find new information and extract facts from it.

      :param google_api_key: The Google API key.
      :type google_api_key: str
      :param google_search_id: The Google search ID.
      :type google_search_id: str
      :param google_search_queries: A list of queries to search for.
      :type google_search_queries: list[str]
      :param topic: a brief description of the research you are undertaking.
      :type topic: str


   .. py:method:: save_state() -> None

      Saves the in-memory changes to the knowledge base to the JSON cache file.


   .. py:method:: load_facts() -> dict

      Loads the facts from the facts file.


   .. py:method:: load_sources() -> dict

      Loads the sources from the sources file.


   .. py:method:: load_cache() -> None

      Loads the cache from the cache file, or creates the relevant files and folders if one does not exist.


   .. py:method:: in_cache(uri: str) -> bool

      Checks if a URI is in the cache already.

      :param uri: The URI to check.
      :type uri: str

      :returns: True if the URI is in the cache, False otherwise.
      :rtype: bool


   .. py:method:: update_cache(uri: str, obtained_on: datetime.datetime, last_accessed: datetime.datetime) -> None

      Updates the cache file for a given URI, specifically when it was obtained and last accessed.

      :param uri: The URI to update.
      :type uri: str
      :param obtained_on: The date and time when the content was obtained.
      :type obtained_on: datetime
      :param last_accessed: The date and time when the content was last accessed.
      :type last_accessed: datetime


   .. py:method:: log_access(uri: str) -> None

      Saves the last accessed time and updates the accessed tracker for a given URI.

      :param uri: The URI to update.
      :type uri: str


   .. py:method:: get_unaccessed_content() -> list[str]

      Returns a list of URIs that have not been accessed by the agent.

      :returns: A list of URIs that have not been accessed by the agent.
      :rtype: list[str]


   .. py:method:: force_content(uri: str, content: str, check_exists: bool = True) -> bool

      Forces a specific URI to have specific content (both HTML and text content). Used to fill old links that we don't actually want to crawl.

      :param uri: The URI to force content for.
      :type uri: str
      :param content: The content to force.
      :type content: str
      :param check_exists: checks if content has already been included in the cache before forcing the new content.
      :type check_exists: bool

      :returns: True if the content was forced, False otherwise.
      :rtype: bool


   .. py:method:: get(uri: str) -> str

      Returns the content for a given URI. If the content is not in the cache, it will be scraped and added to the cache.

      :param uri: The URI to get the content for.
      :type uri: str

      :returns: The content for the given URI.
      :rtype: str


   .. py:method:: add_content(content: str, uri: str = None) -> None

      Adds content to cache.

      :param content: The content to add to the cache.
      :type content: str
      :param uri: The URI to use for the content. Defaults to None, in which case an MD5 sum of the content will be used.
      :type uri: str, optional


   .. py:method:: add_content_from_file(filepath: str, uri: str = None) -> None

      Adds content from a text file to the cache.

      :param filepath: The path to the file to add to the cache.
      :type filepath: str
      :param uri: The URI to use for the content. Defaults to None, in which case an MD5 sum of the content will be used.
      :type uri: str, optional


.. py:class:: FactBot(knowledge_db: FactRAGFileCache, openai_api_key: str = None, chatbot: phasellm.llms.ChatBot = None)

   The FactBot is like a ChatBot but enables you to ask questions that reference an underlying RAG database (KnowledgeBaseFileCache), which then enables the chatbot to cite sourcable facts.

   :param knowledge_db: The knowledge database to use.
   :type knowledge_db: FactRAGFileCache
   :param openai_api_key: The OpenAI API key. Defaults to None.
   :type openai_api_key: str, optional
   :param chatbot: The PhaseLLB chatbot to use. Defaults to None, in which case an OpenAI chatbot is used (and the OpenAI API key must be provided).
   :type chatbot: ChatBot, optional


   .. py:attribute:: knowledge_db


   .. py:method:: ask(question: str, clean_sources: bool = True) -> str

      Ask a question to the FactBot. This will query the underlying knowledge database and use the returned facts to answer the question.

      :param question: The question to ask.
      :type question: str
      :param clean_sources: Whether to clean the sources in the response. Defaults to True; in this case, it will replace fact IDs with relevant source links at the end of the response.
      :type clean_sources: bool, optional

      :returns: The response to the question.
      :rtype: str


   .. py:method:: source(fact_id: str) -> str

      Returns the URL source for a given fact ID.

      :param fact_id: The fact ID to get the source for.
      :type fact_id: str

      :returns: The URL source for the given fact ID.
      :rtype: str


   .. py:method:: clean_and_source_to_html(text_to_clean: str, start_count: int = 0) -> list

      Returns a formatted response with sourced HTML. This is used for emergingtrajectories.com and acts as a base for anyone else wanting to build similar features.

      :param text_to_clean: The text to clean/cite/source.
      :param start_count: The starting count for the sources.

      :returns: two strings -- the actual response in the first case, and the sources in the second case, and an integer representing the new source count.
      :rtype: list


.. py:function:: clean_fact_citations(knowledge_db: FactRAGFileCache, text_to_clean: str) -> str

   Converts fact IDs referenced in a piece of text to relevant source links, appending sources as end notes in the document/text.

   :param knowledge_db: The knowledge database to use for fact lookups.
   :type knowledge_db: FactRAGFileCache
   :param text_to_clean: The text to clean.
   :type text_to_clean: str

   :returns: The cleaned text.
   :rtype: str