emergingtrajectories.crawlers¶
Crawlers provide a standardized approach to interacting with with web pages and extracting information. We have a number of crawlers based on PhaseLLM (Python requests) and ones using Playwright (headlessly and with a front-end) to enable flexible scraping.
All scraping agents return the raw HTML content and the extracted text content.
Classes¶
Crawler that uses Playwright to scrape web pages. |
|
PhaseLLM scraper. Uses Python requests and does not execute JS. |
|
Crawler that uses ScrapingBee to scrape web pages. |
Functions¶
|
Recursively travserse the DOM to extract content. |
|
Extract text content from HTML using BeautifulSoup. |
Module Contents¶
- emergingtrajectories.crawlers._bs4_childtraversal(html: str) str¶
Recursively travserse the DOM to extract content.
- Parameters:
html (str) – HTML content
- Returns:
Extracted content
- Return type:
str
- emergingtrajectories.crawlers._get_text_bs4(html: str) str¶
Extract text content from HTML using BeautifulSoup.
- Parameters:
html (str) – HTML content
- Returns:
Extracted text content
- Return type:
str
- class emergingtrajectories.crawlers.crawlerPlaywright(headless: bool = True)¶
Crawler that uses Playwright to scrape web pages.
- Parameters:
headless (bool, optional) – Run the browser in headless mode. Defaults to True.
- headless = True¶
- get_content(url: str) tuple[str, str]¶
Gets content for a specific URL.
- Parameters:
url (str) – URL to scrape
- Returns:
Raw HTML content and extracted text content (in this order)
- Return type:
tuple[str, str]
- class emergingtrajectories.crawlers.crawlerPhaseLLM¶
PhaseLLM scraper. Uses Python requests and does not execute JS.
- scraper¶
- get_content(url)¶
Gets content for a specific URL.
- Parameters:
url (str) – URL to scrape
- Returns:
Raw HTML content and extracted text content (in this order)
- Return type:
tuple[str, str]
- class emergingtrajectories.crawlers.crawlerScrapingBee(api_key: str)¶
Crawler that uses ScrapingBee to scrape web pages.
- client¶
- get_content(url)¶
Gets content for a specific URL.
- Parameters:
url (str) – URL to scrape
- Returns:
Raw HTML content and extracted text content (in this order)
- Return type:
tuple[str, str]