emergingtrajectories.facts
==========================

.. py:module:: emergingtrajectories.facts

.. autoapi-nested-parse::

   Facts agent. Similar to knowledge agent but simply provides a list of facts and associated sources.

   This abstracts away the fact generation from forecast creation, thus allowing us to test different prompting strategies and LLMs.


Attributes
----------

.. autoapisummary::

   emergingtrajectories.facts._DEFAULT_NUM_SEARCH_RESULTS
   emergingtrajectories.facts.facts_base_system_prompt
   emergingtrajectories.facts.facts_base_user_prompt


Classes
-------

.. autoapisummary::

   emergingtrajectories.facts.FactBaseFileCache


Functions
---------

.. autoapisummary::

   emergingtrajectories.facts.uri_to_local
   emergingtrajectories.facts.clean_citations


Module Contents
---------------

.. py:data:: _DEFAULT_NUM_SEARCH_RESULTS
   :value: 10


.. py:data:: facts_base_system_prompt
   :value: Multiline-String

   .. raw:: html

      <details><summary>Show Value</summary>

   .. code-block:: python

      """You are a researcher tasked with helping forecast economic and social trends. The title of our research project is: {statement_title}.
      
      The project description is as follows...
      {statement_description}
      
      We will provide you with content from reports and web pages that is meant to help with the above. We will ask you to review these documents, create a set of bullet points to inform your thinking. Rather than using bullet points, please list each as F1, F2, F3, etc... So that we can reference it.
      
      The content we provided you contains source numbers in the format 'SOURCE: #'. When you extract facts, please include the citation in square brackets, with the #, like [#], but replace "#" with the actual Source # from the crawled content we are providing you.
      
      For example, if you are referring to a fact that came under --- SOURCE: 3 ---, you would write something like: "Data is already trending to hotter temperatures [3]." Do not include the "#" in the brackets, just the number.
      
      Thus, a bullet point would look like this:
      F1: (information) [1]
      F2: (information) [1]
      F3: (information) [2]
      
      ... and so on, where F1, F2, F3, etc. are facts, and [1], [2] are the source documents you are extracting the facts from.
      """

   .. raw:: html

      </details>


.. py:data:: facts_base_user_prompt
   :value: Multiline-String

   .. raw:: html

      <details><summary>Show Value</summary>

   .. code-block:: python

      """Today's date is {the_date}. We will now provide you with all the content we've managed to collect. 
      
      ----------------------
      {scraped_content}
      ----------------------
      
      Please think step-by-step by (a) extracting critical bullet points from the above, and (b) share any insights you might have based on the facts.
      
      The content we provided you contains source numbers in the format 'SOURCE: #'. When you extract facts, please include the citation in square brackets, with the #, like [#], but replace "#" with the actual Source # from the crawled content we are providing you.
      
      For example, if you are referring to a fact that came under --- SOURCE: 3 ---, you would write something like: "Data is already trending to hotter temperatures [3]." Do not include the "#" in the brackets, just the actual number.
      
      DO NOT PROVIDE A FORECAST, BUT SIMPLY STATE AND SHARE THE FACTS AND INSIGHTS YOU HAVE GATHERED.
      """

   .. raw:: html

      </details>


.. py:function:: uri_to_local(uri: str) -> str

   Convert a URI to a local file name. In this case, we typically will use an MD5 sum.

   :param uri: The URI to convert.
   :type uri: str

   :returns: The MD5 sum of the URI.
   :rtype: str


.. py:function:: clean_citations(assistant_analysis: str, ctr_to_source: dict) -> str

   The analysis currently contains numerical citations that are likely not in order, or in some cases are not used. We will update the cituations to follow the proper numerical order, and also include the URLs at the very end.

   :param assistant_analysis: the analysis text from the assistant
   :param ctr_to_source: the mapping of citation number to source URL

   :returns: the cleaned analysis text, with citations following a proper numerical format and URIs at the end of the analysis
   :rtype: str


.. py:class:: FactBaseFileCache(folder_path: str, cache_file: str = 'cache.json', crawler=None)

   The KnowledgeBaseFileCache is a simple file-based cache for web content and local files. The cache stores the original HTML, PDF, or TXT content and tracks when (if ever) an agent actually accessed the content.

   :param folder_path: The folder where the cache will be stored.
   :type folder_path: str
   :param cache_file: The name of the cache file. Defaults to "cache.json".
   :type cache_file: str, optional


   .. py:attribute:: root_path


   .. py:attribute:: root_parsed


   .. py:attribute:: root_original


   .. py:attribute:: cache_file


   .. py:attribute:: cache


   .. py:method:: summarize_new_info_multiple_queries(statement, chatbot, google_api_key, google_search_id, google_search_queries, fileout=None) -> str


   .. py:method:: summarize_new_info(statement, chatbot, google_api_key, google_search_id, google_search_query, fileout=None) -> str


   .. py:method:: save_state() -> None

      Saves the in-memory changes to the knowledge base to the JSON cache file.


   .. py:method:: load_cache() -> None

      Loads the cache from the cache file, or creates the relevant files and folders if one does not exist.


   .. py:method:: in_cache(uri: str) -> bool

      Checks if a URI is in the cache already.

      :param uri: The URI to check.
      :type uri: str

      :returns: True if the URI is in the cache, False otherwise.
      :rtype: bool


   .. py:method:: update_cache(uri: str, obtained_on: datetime.datetime, last_accessed: datetime.datetime) -> None

      Updates the cache file for a given URI, specifically when it was obtained and last accessed.

      :param uri: The URI to update.
      :type uri: str
      :param obtained_on: The date and time when the content was obtained.
      :type obtained_on: datetime
      :param last_accessed: The date and time when the content was last accessed.
      :type last_accessed: datetime


   .. py:method:: log_access(uri: str) -> None

      Saves the last accessed time and updates the accessed tracker for a given URI.

      :param uri: The URI to update.
      :type uri: str


   .. py:method:: get_unaccessed_content() -> list[str]

      Returns a list of URIs that have not been accessed by the agent.

      :returns: A list of URIs that have not been accessed by the agent.
      :rtype: list[str]


   .. py:method:: force_empty(uri: str) -> None

      Saves an empty file for a given URI. Used when the page is erroring out.

      :param uri: The URI to empty the cache for.
      :type uri: str


   .. py:method:: get(uri: str) -> str

      Returns the content for a given URI. If the content is not in the cache, it will be scraped and added to the cache.

      :param uri: The URI to get the content for.
      :type uri: str

      :returns: The content for the given URI.
      :rtype: str


   .. py:method:: add_content(content: str, uri: str = None) -> None

      Adds content to cache.

      :param content: The content to add to the cache.
      :type content: str
      :param uri: The URI to use for the content. Defaults to None, in which case an MD5 sum of the content will be used.
      :type uri: str, optional


   .. py:method:: add_content_from_file(filepath: str, uri: str = None) -> None

      Adds content from a text file to the cache.

      :param filepath: The path to the file to add to the cache.
      :type filepath: str
      :param uri: The URI to use for the content. Defaults to None, in which case an MD5 sum of the content will be used.
      :type uri: str, optional