Newspaper API
Function calls
- newspaper.article(url: str, language: str | None = None, **kwargs) Article
Shortcut function to fetch and parse a newspaper article from a URL.
- Parameters:
url (str) – The URL of the article to download and parse.
language (str) – The language of the article to download and parse.
input_html (str) – The HTML of the article to parse. This is used for pre-downloaded articles. If this is set, then there will be no download requests made.
kwargs – Any other keyword arguments to pass to the Article constructor.
- Returns:
The article downloaded and parsed.
- Return type:
- Raises:
ArticleException – If the article could not be downloaded or parsed.
- newspaper.build(url='', dry=False, only_homepage=False, only_in_path=False, input_html=None, config=None, **kwargs) Source
Returns a constructed
Sourceobject without downloading or parsing the articles- Parameters:
url (str) – The url of the source (news website) to build. For example, https://www.cnn.com.
dry (bool) – If true, the source object will be constructed but not downloaded or parsed.
only_homepage (bool) – If true, the source object will only parse the homepage of the source.
only_in_path (bool) – If true, the source object will only parse the articles that are in the same path as the source’s homepage. You can scrape a specific category this way. Defaults to False.
input_html (str) – The HTML of the source to parse. Use this to pass cached HTML to the source object.
config (Configuration) – A configuration object to use for the source.
kwargs – Any other keyword arguments to pass to the Source constructor. If you omit the config object, you can add any configuration options here.
- Returns:
The constructed
Sourceobject.- Return type:
- newspaper.mthreading.fetch_news(news_list: list[str | Article | Source], threads: int = 5) list[Article | Source]
Fetch news from a list of sources, articles, or both. Threads will be allocated to download and parse the sources or articles. If urls are passed into the list, then a new Article object will be created for it and downloaded + parsed. There will be no nlp done on the articles. If there is a problem in detecting the language of the urls, then instantiate the Article object yourself with the language parameter and pass it in.
- Parameters:
news_list (list[Union[str, Article, Source]]) – List of sources, articles, urls or a mix of them.
threads (int) – Number of threads to use for fetching. This affects how many items from the news_list are fetched at once. In order to control how many threads are used in a Source object, use the Configuration.`number_threads` setting. This could result in a high number of threads. Maximum number of threads would be threads * Configuration.`number_threads`.
- Returns:
List of articles or sources.
- Return type:
- newspaper.hot()
Returns a list of hit terms via google trends
- newspaper.languages()
Prints a list of the supported languages
Configuration
- class newspaper.configuration.Configuration
Modifies Article / Source properties.
- min_word_count
minimum number of word tokens in an article text. When building a list of articles for a Source (using parse_articles), any article with fewer words than this will be ignored. Default 300.
- Type:
- min_sent_count
minimum number of sentences in an article text. When building a list of articles for a Source (using parse_articles), any article with fewer sentences than this will be ignored. Default 7.
- Type:
- max_title
Article.titlemax number of chars.titleis truncated to this length- Type:
- max_text
Article.textmax number of chars.textis truncated to this length- Type:
- max_keywords
maximum number of keywords inferred by
Article.nlp()- Type:
- max_authors
maximum number of authors returned in
Article.authors- Type:
- max_summary
max number of chars in
Article.summary, truncated to this length- Type:
- max_summary_sent
maximum number of sentences in
Article.summary- Type:
- top_image_settings
settings for finding top image. You can set the following:
min_width: minimum width of image (default 300) inorder to be considered top image
min_height: minimum height of image (default 200) inorder to be considered top image
min_area: minimum area of image (default 10000) inorder to be considered top image
max_retries: maximum number of retries to downloadthe image (default 2)
- Type:
- memorize_articles
If True, it will cache and save articles run between runs. The articles are NOT cached. It will save the parsed article urls between different
Source.generate_articles()runs. default True.- Type:
- fetch_images
If False, it will not download images to verify if they obide by the settings in top_image_settings. Default True.
- Type:
- follow_meta_refresh
if True, it will follow meta refresh redirect when downloading an article. default False.
- Type:
- clean_article_html
if True it will clean ‘unnecessary’ tags from the article body html. Affected property is
Article.article_html. Default True.- Type:
- http_success_only
if True, it will raise an
ArticleExceptionif the html status_code is >= 400 (e.g. 404 page). Default True.- Type:
- verbose
if True, it will output debugging information deprecated: Use the standard python logging module instead
- Type:
- allow_binary_content
if True, it will allow binary content to be downloaded and stored in
Article.html. Allowing this for Source building can lead to longer processing times and could hang the process due to huge binary files (such as movies) default False.- Type:
- ignored_content_types_defaults
dictionary of content-types and a default stub content. These content type will not be downloaded.
Note: If
allow_binary_contentis False, binary content will lead toArticleBinaryDataExceptionforArticle.download()and will be skipped inSource.build(). This will override the defaults inignored_content_types_defaultsif these match binary files.- Type:
- use_cached_categories
if set to False, the cached categories will be ignored and a the
Sourcewill recompute the categorylist every time you build it.
- Type:
- MIN_WORD_COUNT
Deprecated since version 0.9.2: use
Configuration.min_word_countinstead- Type:
- MIN_SENT_COUNT
Deprecated since version 0.9.2: use
Configuration.min_sent_countinstead- Type:
- MAX_TITLE
Deprecated since version 0.9.2: use
Configuration.max_titleinstead- Type:
- MAX_TEXT
Deprecated since version 0.9.2: use
Configuration.max_textinstead- Type:
- MAX_KEYWORDS
Deprecated since version 0.9.2: use
Configuration.max_keywordsinstead- Type:
- MAX_AUTHORS
Deprecated since version 0.9.2: use
Configuration.max_authorsinstead- Type:
- MAX_SUMMARY
Deprecated since version 0.9.2: use
Configuration.max_summaryinstead- Type:
- MAX_SUMMARY_SENT
Deprecated since version 0.9.2: use
Configuration.max_summary_sentinstead- Type:
- MAX_FILE_MEMO
Deprecated since version 0.9.2: use
Configuration.max_file_memoinstead- Type:
- __getstate__()
Return state values to be pickled.
- __setstate__(state)
Restore state from the unpickled state values.
- property browser_user_agent
The user agent string sent to web servers when downloading articles. If not set, it will default to the following: newspaper/x.x.x i.e. newspaper/0.9.1
- Type:
- property headers
The headers sent to web servers when downloading articles. It will set the headers for the get call from
requestslibrary. Note: If you set thebrowser_user_agentproperty, it will override theUser-Agentheader.- Type:
- property language
the iso-639-1 two letter code of the language. If not set,
Articlewill try to use the meta information of the webite to get the language. english is the fallback- Type:
- property proxies
The proxies for the get call from
requestslibrary. If not set, it will default to no proxies.- Type:
Optional[dict]
- property request_timeout
The timeout for the get call from
requestslibrary. If not set, it will default to 7 seconds.
- update(**kwargs)
Update the configuration object with the given keyword arguments.
- Parameters:
**kwargs – The keyword arguments to update.
Article
Article objects can also be created with the shortcut method:
a = newspaper.article(url, language='en', ...)
which is equivalent to:
a = newspaper.Article(url, language='en', ...)
a.download()
a.parse()
You can pass any of the Article constructor arguments to the shortcut method.
- class newspaper.Article(url: str, title: str = '', source_url: str = '', read_more_link: str = '', config: Configuration | None = None, **kwargs: Any)
Article abstraction for newspaper.
This object fetches and holds information for a single article. In order to download the article, call download(). Then call parse() to extract the information.
- config
the active configuration for this article instance. You can use different settings for any article instance.
- Type:
- extractor
Content parsing object.
- Type:
ContentExtractor
- url
The article link. This was used to download the current article. In case of a redirect(through meta refresh or read more link), this will be different from the original url.
- Type:
- original_url
The original url of the article. This is the url that was passed to the constructor. It will not change in case of a redirect.
- Type:
- title
Parset title of the article. It can be forced/overridden by providing a title in the constructor.
- Type:
- read_more_link
An xpath selector for the link to the full article. make sure that the selector works for all casese, not only for one specific article. If needed, you can use several xpath selectors separated by
|.- Type:
- top_image
The top image url of the article. It will try to guess the best fit for a main image from the images found in the article.
- Type:
- text
a parsed version of the article body. It will be truncated to the first config.max_text characters.
- Type:
- text_cleaned
a parsed version of the clean_top_node content. It will be truncated to the first config.max_text characters. .. deprecated:: 0.9.3
is now same as
Article.textclean_top_node is removed- Type:
- keywords
An inferred list of keywords for this article. This will be generated by the nlp method. It will be truncated to the first config.max_keywords keywords.
- meta_keywords
A list of keywords provided by the meta data. It will be truncated to the first config.max_keywords keywords.
- authors
The author list parsed from the article. It will be truncated to the first config.max_authors authors.
- publish_date
The parsed publishing date from the article. If no valid date is found, it will be an empty string.
- Type:
- summary
The summarization of the article as generated by the nlp method. It will be truncated to the first config.max_summary_sent sentences.
- Type:
- download_state
AticleDownloadState.SUCCESS if download() was successful, ArticleDownloadState.FAILED_RESPONSE if download() failed, ArticleDownloadState.NOT_STARTED if download() was not called.
- Type:
- meta_lang
The language extracted from the meta data. If config.language is not set, this value will be used to parse the article instead of the config.language value.
- Type:
- top_node
Top node of the original DOM tree. It contains the text nodes for the detected article body. This node is on the doc DOM tree.
- Type:
lxml.html.HtmlElement
- doc
the full DOM of the downloaded html. It is the original DOM tree.
- Type:
lxml.html.HtmlElement
- clean_doc
a cleaned version of the DOM tree .. deprecated:: 0.9.3
is now same as
Article.doc- Type:
lxml.html.HtmlElement
- Article.__init__(url: str, title: str = '', source_url: str = '', read_more_link: str = '', config: Configuration | None = None, **kwargs: Any)
Constructs the article class. Will not download or parse the article
- Parameters:
url (str) – The input url to parse. Can be a URL or a file path.
title (str, optional) – Default title if none can be extracted from the webpage. Defaults to empty string.
source_url (str, optional) – URL of the main website that originates the article. If left empty, it will be inferred from the url. Defaults to “”.
read_more_link (str, optional) – A xpath selector for the link to the full article, in case there is a ‘preview’ with a read-more button that leads to another url (even on another domain). make sure that the selector works for all cases, not only for one specific article. If needed, you can use several xpath selectors separated by |. Defaults to “”.
config (Configuration, optional) – Configuration settings for
empty (this article's download/parsing/nlp. If left)
will (it)
None. (use the default settingsDefaults to)
- Keyword Arguments:
**kwargs – Any Configuration class property can be overwritten through init keyword params. Additionally, you can specify any of the following
requests.``get`` parameters: headers, cookies, auth, timeout, allow_redirects, proxies, verify, cert For otherrequestsparameters, you can use theConfiguration.``requests_params`` dictionary.- Raises:
ArticleException – Error parsing and preparing the article
- Article.download()
Downloads the link’s HTML content, don’t use if you are batch async downloading articles
- Parameters:
input_html (str, optional) – A cached version of the article to parse. It will load the html from this string without attempting to access the article url. If you have a read_more_link xpath set up in the constructor, and do not set ignore_read_more to true, it will attempt to follow the found read_more link (if any). Defaults to None.
title (str, optional) – Force an article title. Defaults to None.
recursion_counter (int, optional) – Used to prevent infinite recursions
0. (due to meta_refresh. Defaults to)
ignore_read_more (bool, optional) – If true, the download process will
constructor. (ignore any kind of "read_more" xpath set up in the)
False. (Defaults to)
- Returns:
self
- Return type:
- Article.parse()
Parse the previously downloaded article. If download() wasn’t called, it will raise a ArticleException exception. Populates the article properties such as:
title,authors,publish_date,text,top_image, etc.- Returns:
self
- Return type:
- Article.nlp()
Method expects download() and parse() to have been run. It will perform the keyword extraction and summarization
Source
- class newspaper.Source(url: str, read_more_link: str = '', config: Configuration | None = None, **kwargs)
Sources are abstractions of online news websites such as huffpost or cnn. The object will create a list of article urls that belong to the source and the list of categories of news (world, politics, etc.) that the source has. These categories are inferred from the source’s homepage structure.
- url
The url of the source’s homepage. e.g. https://www.cnn.com
- Type:
- config
The configuration object for this source.
- Type:
- feeds
A list of
Feedobjects that belong to the source containing information about the source’s RSS feeds.- Type:
- doc
The parsed lxml root of the source’s homepage.
- Type:
lxml.html.HtmlElement
- Source.__init__(url: str, read_more_link: str = '', config: Configuration | None = None, **kwargs)
The config object for this source will be passed into all of this source’s children articles unless specified otherwise or re-set.
- Parameters:
url (str) – The url of the source’s homepage. e.g. https://www.cnn.com
read_more_link (str, optional) – A xpath selector for the link to the full article. make sure that the selector works for all casese, not only for one specific article. If needed, you can use several xpath selectors separated by |. Defaults to “”.
config (
Configuration, optional) – The configuration object for this source. Defaults to None.
- Keyword Arguments:
**kwargs – Any Configuration class propriety can be overwritten through init keyword params. Additionally, you can specify any of the following requests parameters: headers, cookies, auth, timeout, allow_redirects, proxies, verify, cert
- Source.build()
Encapsulates download and basic parsing with lxml. Executes download, parse, gets categories and article links, parses rss feeds and finally creates a list of
Articleobjects. Articles are not yet downloaded.- Parameters:
input_html (str, optional) – The cached html of the source to parse. Leave None to download the html. Defaults to None.
only_homepage (bool, optional) – If true, the source object will only parse the homepage of the source. Defaults to False.
only_in_path (bool, optional) – If true, the source object will only parse the articles that are in the same path as the source’s homepage. You can scrape a specific category this way. Defaults to False.
- Source.feeds_to_articles()
Returns a list of
Articleobjects based on articles found in the Source’s RSS feeds
- Source.categories_to_articles()
Takes the categories, splays them into a big list of urls and churns the articles out of each url with the url_to_article method
- Source.generate_articles()
Creates the
Source.articlesList ofArticleobjects. It gets the Urls from all detected categories and RSS feeds, checks them for plausibility based on their URL (using some heuristics defined in theurls.valid_urlfunction). These can be further downloaded usingSource.download_articles()- Parameters:
- Source.download_articles()
Starts the
download()for allArticleobjects in theSource.articlesproperty. It can run single threaded or multi-threaded.- Returns:
A list of downloaded articles.
- Return type:
list[
Article]
- Source.download()
Downloads html of source, i.e. the news site homppage
- Source.size()
Returns the number of articles linked to this news source
Category
- class newspaper.source.Category(url: str, html: str | None = None, doc: Element | None = None)
A category object is a representation of a category of news on a news source’s homepage. For example, on cnn.com, the category “world” would be a category object.
- url
The url of the category’s homepage. e.g. https://www.cnn.com/world
- Type:
- doc
The parsed lxml root of the category’s homepage.
- Type:
lxml.html.HtmlElement
Feed
- class newspaper.source.Feed(url: str, rss: str | None = None)
A feed object is a representation of an RSS feed on a news source’s homepage. For example, on cnn.com, the feed “http://rss.cnn.com/rss/edition_world.rss” would represent a feed object.
- url
The url of the feed’s homepage. e.g. http://rss.cnn.com/rss/edition_world.rss
- Type:
Exceptions
- class newspaper.ArticleException
Generic Article Exception thrown by the article package.
- class newspaper.ArticleBinaryDataException
Exception raised for binary data in urls. will be raised if
Configuration.allow_binary_contentis False.