emergingtrajectories.news ========================= .. py:module:: emergingtrajectories.news Classes ------- .. autoapisummary:: emergingtrajectories.news.RSSAgent emergingtrajectories.news.NewsBingAgent emergingtrajectories.news.NewsAPIAgent emergingtrajectories.news.FinancialTimesAgent Functions --------- .. autoapisummary:: emergingtrajectories.news.force_empty_content Module Contents --------------- .. py:function:: force_empty_content(rss_url: str, content, cache_function) -> None Force the crawler to visit every URL in the RSS feed and save it as a blank content file. We do this because some RSS feeds have a lot of old URLs we do not need to crawl, and only want to crawl the delta over some period. :param rss_url: The URL of the RSS feed. :type rss_url: str :param content: the content string to save. :param cache_function: the specific function to call the rss_url and content to save. .. py:class:: RSSAgent(rss_url, crawler=None) A simple wrapper for an RSS feed, so we can query it for URLs. :param rss_url: The URL of the RSS feed. :type rss_url: str :param crawler: The crawler to use. Defaults to None, in which case we will use crawlerPlaywright in headless mode. :type crawler: Crawler, optional .. py:attribute:: rss_url .. py:method:: get_news_as_list() -> list Query the RSS feed for news articles, and return them as a list of dictionaries. :returns: A list of URLs. :rtype: list .. py:class:: NewsBingAgent(api_key: str, endpoint: str) Creates a new Bing News API agent. To learn more, see: https://github.com/microsoft/bing-search-sdk-for-python/ :param api_key: The Bing News API key. :type api_key: str :param endpoint: The Bing News API endpoint. :type endpoint: str .. py:attribute:: api_key .. py:attribute:: endpoint .. py:method:: get_news_as_list(query: str, market: str = 'en-us') -> list Gets a list of URLS from the Bing News API. For more information on markets, see: https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/market-codes :param query: The query to search for. :type query: str :param market: The market to search in. Defaults to "en-us". (US English :type market: str, optional :returns: A list of URLs. :rtype: list .. py:class:: NewsAPIAgent(api_key, top_headlines=False, crawler=None) A simple wrapper for the News API, so we can query it for URLs. :param api_key: The News API key. :type api_key: str :param top_headlines: Whether to get top headlines. Defaults to False. :type top_headlines: bool, optional :param crawler: The crawler to use. Defaults to None, in which case we will use crawlerPlaywright in headless mode. :type crawler: Crawler, optional .. py:attribute:: api_key .. py:attribute:: top_headlines :value: False .. py:method:: get_news_as_list(query: str) -> list Query the News API for news articles, and return them as a list of dictionaries. :param query: The query to search for. :type query: str :returns: A list of dictionaries, where each dictionary represents a news article. :rtype: list .. py:class:: FinancialTimesAgent(user_email, user_password) This is a POC agent that uses Playwright to crawl the Financial Times articles you are interested in. Note that you *NEED* to be a subscriber to the FT to make this work, and thus need to provide your FT user name and password. :param user_email: Your FT email. :type user_email: str :param user_password: Your FT password. :type user_password: str .. py:attribute:: ft_rss_feed_urls :value: ['https://www.ft.com/rss/home', 'https://www.ft.com/world?format=rss',... .. py:attribute:: ft_login_url :value: 'https://ft.com/login' .. py:attribute:: ft_main_url :value: 'https://ft.com/' .. py:attribute:: user_email .. py:attribute:: user_password .. py:method:: get_news(urls: list[str] = None) -> list Get the news from the Financial Times as a list of tuples, where each tuple contains the URL and the extracted text content. :param urls: a list of URLs to get content for. :returns: A list of lists -- urls, html, and text content