emergingtrajectories.crawlers¶

Crawlers provide a standardized approach to interacting with with web pages and extracting information. We have a number of crawlers based on PhaseLLM (Python requests) and ones using Playwright (headlessly and with a front-end) to enable flexible scraping.

All scraping agents return the raw HTML content and the extracted text content.

Classes¶

`crawlerPlaywright`	Crawler that uses Playwright to scrape web pages.
`crawlerPhaseLLM`	PhaseLLM scraper. Uses Python requests and does not execute JS.
`crawlerScrapingBee`	Crawler that uses ScrapingBee to scrape web pages.

Functions¶

`_bs4_childtraversal`(→ str)	Recursively travserse the DOM to extract content.
`_get_text_bs4`(→ str)	Extract text content from HTML using BeautifulSoup.

Module Contents¶

emergingtrajectories.crawlers._bs4_childtraversal(html: str) → str¶

Recursively travserse the DOM to extract content.

Parameters:: html (str) – HTML content
Returns:: Extracted content
Return type:: str

emergingtrajectories.crawlers._get_text_bs4(html: str) → str¶

Extract text content from HTML using BeautifulSoup.

Parameters:: html (str) – HTML content
Returns:: Extracted text content
Return type:: str

class emergingtrajectories.crawlers.crawlerPlaywright(headless: bool = True)¶

Crawler that uses Playwright to scrape web pages.

Parameters:: headless (bool, optional) – Run the browser in headless mode. Defaults to True.

headless = True¶

get_content(url: str) → tuple[str, str]¶

Gets content for a specific URL.

Parameters:: url (str) – URL to scrape
Returns:: Raw HTML content and extracted text content (in this order)
Return type:: tuple[str, str]

class emergingtrajectories.crawlers.crawlerPhaseLLM¶

PhaseLLM scraper. Uses Python requests and does not execute JS.

scraper¶

get_content(url)¶

Gets content for a specific URL.

Parameters:: url (str) – URL to scrape
Returns:: Raw HTML content and extracted text content (in this order)
Return type:: tuple[str, str]

class emergingtrajectories.crawlers.crawlerScrapingBee(api_key: str)¶

Crawler that uses ScrapingBee to scrape web pages.

client¶

get_content(url)¶

Gets content for a specific URL.

Parameters:: url (str) – URL to scrape
Returns:: Raw HTML content and extracted text content (in this order)
Return type:: tuple[str, str]