emergingtrajectories.crawlers ============================= .. py:module:: emergingtrajectories.crawlers .. autoapi-nested-parse:: Crawlers provide a standardized approach to interacting with with web pages and extracting information. We have a number of crawlers based on PhaseLLM (Python requests) and ones using Playwright (headlessly and with a front-end) to enable flexible scraping. All scraping agents return the raw HTML content and the extracted text content. Classes ------- .. autoapisummary:: emergingtrajectories.crawlers.crawlerPlaywright emergingtrajectories.crawlers.crawlerPhaseLLM emergingtrajectories.crawlers.crawlerScrapingBee Functions --------- .. autoapisummary:: emergingtrajectories.crawlers._bs4_childtraversal emergingtrajectories.crawlers._get_text_bs4 Module Contents --------------- .. py:function:: _bs4_childtraversal(html: str) -> str Recursively travserse the DOM to extract content. :param html: HTML content :type html: str :returns: Extracted content :rtype: str .. py:function:: _get_text_bs4(html: str) -> str Extract text content from HTML using BeautifulSoup. :param html: HTML content :type html: str :returns: Extracted text content :rtype: str .. py:class:: crawlerPlaywright(headless: bool = True) Crawler that uses Playwright to scrape web pages. :param headless: Run the browser in headless mode. Defaults to True. :type headless: bool, optional .. py:attribute:: headless :value: True .. py:method:: get_content(url: str) -> tuple[str, str] Gets content for a specific URL. :param url: URL to scrape :type url: str :returns: Raw HTML content and extracted text content (in this order) :rtype: tuple[str, str] .. py:class:: crawlerPhaseLLM PhaseLLM scraper. Uses Python requests and does not execute JS. .. py:attribute:: scraper .. py:method:: get_content(url) Gets content for a specific URL. :param url: URL to scrape :type url: str :returns: Raw HTML content and extracted text content (in this order) :rtype: tuple[str, str] .. py:class:: crawlerScrapingBee(api_key: str) Crawler that uses ScrapingBee to scrape web pages. .. py:attribute:: client .. py:method:: get_content(url) Gets content for a specific URL. :param url: URL to scrape :type url: str :returns: Raw HTML content and extracted text content (in this order) :rtype: tuple[str, str]