emergingtrajectories.crawlers
=============================

.. py:module:: emergingtrajectories.crawlers

.. autoapi-nested-parse::

   Crawlers provide a standardized approach to interacting with with web pages and extracting information. We have a number of crawlers based on PhaseLLM (Python requests) and ones using Playwright (headlessly and with a front-end) to enable flexible scraping.

   All scraping agents return the raw HTML content and the extracted text content.


Classes
-------

.. autoapisummary::

   emergingtrajectories.crawlers.crawlerPlaywright
   emergingtrajectories.crawlers.crawlerPhaseLLM
   emergingtrajectories.crawlers.crawlerScrapingBee


Functions
---------

.. autoapisummary::

   emergingtrajectories.crawlers._bs4_childtraversal
   emergingtrajectories.crawlers._get_text_bs4


Module Contents
---------------

.. py:function:: _bs4_childtraversal(html: str) -> str

   Recursively travserse the DOM to extract content.

   :param html: HTML content
   :type html: str

   :returns: Extracted content
   :rtype: str


.. py:function:: _get_text_bs4(html: str) -> str

   Extract text content from HTML using BeautifulSoup.

   :param html: HTML content
   :type html: str

   :returns: Extracted text content
   :rtype: str


.. py:class:: crawlerPlaywright(headless: bool = True)

   Crawler that uses Playwright to scrape web pages.

   :param headless: Run the browser in headless mode. Defaults to True.
   :type headless: bool, optional


   .. py:attribute:: headless
      :value: True


   .. py:method:: get_content(url: str) -> tuple[str, str]

      Gets content for a specific URL.

      :param url: URL to scrape
      :type url: str

      :returns: Raw HTML content and extracted text content (in this order)
      :rtype: tuple[str, str]


.. py:class:: crawlerPhaseLLM

   PhaseLLM scraper. Uses Python requests and does not execute JS.


   .. py:attribute:: scraper


   .. py:method:: get_content(url)

      Gets content for a specific URL.

      :param url: URL to scrape
      :type url: str

      :returns: Raw HTML content and extracted text content (in this order)
      :rtype: tuple[str, str]


.. py:class:: crawlerScrapingBee(api_key: str)

   Crawler that uses ScrapingBee to scrape web pages.


   .. py:attribute:: client


   .. py:method:: get_content(url)

      Gets content for a specific URL.

      :param url: URL to scrape
      :type url: str

      :returns: Raw HTML content and extracted text content (in this order)
      :rtype: tuple[str, str]