Newspaper API

Function calls

newspaper.article(url: str, language: str | None = None, **kwargs) → Article

Shortcut function to fetch and parse a newspaper article from a URL.

Parameters:

url (str) – The URL of the article to download and parse.
language (str) – The language of the article to download and parse.
input_html (str) – The HTML of the article to parse. This is used for pre-downloaded articles. If this is set, then there will be no download requests made.
kwargs – Any other keyword arguments to pass to the Article constructor.

Returns:

The article downloaded and parsed.

Return type:

Article

Raises:

ArticleException – If the article could not be downloaded or parsed.

newspaper.build(url='', dry=False, only_homepage=False, only_in_path=False, input_html=None, config=None, **kwargs) → Source

Returns a constructed Source object without downloading or parsing the articles

Parameters:

url (str) – The url of the source (news website) to build. For example, https://www.cnn.com.
dry (bool) – If true, the source object will be constructed but not downloaded or parsed.
only_homepage (bool) – If true, the source object will only parse the homepage of the source.
only_in_path (bool) – If true, the source object will only parse the articles that are in the same path as the source’s homepage. You can scrape a specific category this way. Defaults to False.
input_html (str) – The HTML of the source to parse. Use this to pass cached HTML to the source object.
config (Configuration) – A configuration object to use for the source.
kwargs – Any other keyword arguments to pass to the Source constructor. If you omit the config object, you can add any configuration options here.

Returns:

The constructed Source object.

Return type:

Source

newspaper.mthreading.fetch_news(news_list: list[str | Article | Source], threads: int = 5) → list[Article | Source]

Fetch news from a list of sources, articles, or both. Threads will be allocated to download and parse the sources or articles. If urls are passed into the list, then a new Article object will be created for it and downloaded + parsed. There will be no nlp done on the articles. If there is a problem in detecting the language of the urls, then instantiate the Article object yourself with the language parameter and pass it in.

Parameters:

news_list (list[Union[str, Article, Source]]) – List of sources, articles, urls or a mix of them.
threads (int) – Number of threads to use for fetching. This affects how many items from the news_list are fetched at once. In order to control how many threads are used in a Source object, use the Configuration.`number_threads` setting. This could result in a high number of threads. Maximum number of threads would be threads * Configuration.`number_threads`.

Returns:

List of articles or sources.

Return type:

list[Union[Article, Source]]

newspaper.hot(): Returns a list of hit terms via google trends

newspaper.languages(): Prints a list of the supported languages

Configuration

class newspaper.configuration.Configuration

Modifies Article / Source properties.

min_word_count

minimum number of word tokens in an article text. When building a list of articles for a Source (using parse_articles), any article with fewer words than this will be ignored. Default 300.

Type:: int

min_sent_count

minimum number of sentences in an article text. When building a list of articles for a Source (using parse_articles), any article with fewer sentences than this will be ignored. Default 7.

Type:: int

max_title

Article.title max number of chars. title is truncated to this length

Type:: int

max_text

Article.text max number of chars. text is truncated to this length

Type:: int

max_keywords

maximum number of keywords inferred by Article.nlp()

Type:: int

max_authors

maximum number of authors returned in Article.authors

Type:: int

max_summary

max number of chars in Article.summary, truncated to this length

Type:: int

max_summary_sent

maximum number of sentences in Article.summary

Type:: int

max_file_memo

max number of urls we cache for each news source

Type:: int

top_image_settings

settings for finding top image. You can set the following:

min_width: minimum width of image (default 300) in
order to be considered top image

min_height: minimum height of image (default 200) in
order to be considered top image

min_area: minimum area of image (default 10000) in
order to be considered top image

max_retries: maximum number of retries to download
the image (default 2)

Type:: dict

memorize_articles

If True, it will cache and save articles run between runs. The articles are NOT cached. It will save the parsed article urls between different Source.generate_articles() runs. default True.

Type:: bool

disable_category_cache

If True, it will not cache the Source category urls. default False.

Type:: bool

fetch_images

If False, it will not download images to verify if they obide by the settings in top_image_settings. Default True.

Type:: bool

follow_meta_refresh

if True, it will follow meta refresh redirect when downloading an article. default False.

Type:: bool

clean_article_html

if True it will clean ‘unnecessary’ tags from the article body html. Affected property is Article.article_html. Default True.

Type:: bool

http_success_only

if True, it will raise an ArticleException if the html status_code is >= 400 (e.g. 404 page). Default True.

Type:: bool

requests_params

Any of the params for the get call from requests library

Type:: dict

number_threads

number of threads to use for multi-threaded downloads

Type:: int

verbose

if True, it will output debugging information deprecated: Use the standard python logging module instead

Type:: bool

thread_timeout_seconds

timeout for threads

Type:: int

allow_binary_content

if True, it will allow binary content to be downloaded and stored in Article.html. Allowing this for Source building can lead to longer processing times and could hang the process due to huge binary files (such as movies) default False.

Type:: bool

ignored_content_types_defaults

dictionary of content-types and a default stub content. These content type will not be downloaded.

Note: If allow_binary_content is False, binary content will lead to ArticleBinaryDataException for Article.download() and will be skipped in Source.build(). This will override the defaults in ignored_content_types_defaults if these match binary files.

Type:: dict

use_cached_categories

if set to False, the cached categories will be ignored and a the Source will recompute the category

list every time you build it.

Type:: bool

MIN_WORD_COUNT

Deprecated since version 0.9.2: use Configuration.min_word_count instead

Type:: int

MIN_SENT_COUNT

Deprecated since version 0.9.2: use Configuration.min_sent_count instead

Type:: int

MAX_TITLE

Deprecated since version 0.9.2: use Configuration.max_title instead

Type:: int

MAX_TEXT

Deprecated since version 0.9.2: use Configuration.max_text instead

Type:: int

MAX_KEYWORDS

Deprecated since version 0.9.2: use Configuration.max_keywords instead

Type:: int

MAX_AUTHORS

Deprecated since version 0.9.2: use Configuration.max_authors instead

Type:: int

MAX_SUMMARY

Deprecated since version 0.9.2: use Configuration.max_summary instead

Type:: int

MAX_SUMMARY_SENT

Deprecated since version 0.9.2: use Configuration.max_summary_sent instead

Type:: int

MAX_FILE_MEMO

Deprecated since version 0.9.2: use Configuration.max_file_memo instead

Type:: int

__getstate__(): Return state values to be pickled.

__setstate__(state): Restore state from the unpickled state values.

property browser_user_agent

The user agent string sent to web servers when downloading articles. If not set, it will default to the following: newspaper/x.x.x i.e. newspaper/0.9.1

Type:: str

property headers

The headers sent to web servers when downloading articles. It will set the headers for the get call from requests library. Note: If you set the browser_user_agent property, it will override the User-Agent header.

Type:: str

property language

the iso-639-1 two letter code of the language. If not set, Article will try to use the meta information of the webite to get the language. english is the fallback

Type:: str

property proxies

The proxies for the get call from requests library. If not set, it will default to no proxies.

Type:: Optional[dict]

property request_timeout

The timeout for the get call from requests library. If not set, it will default to 7 seconds.

Type:: Optional[int,Tuple[int,int]]

update(**kwargs)

Update the configuration object with the given keyword arguments.

Parameters:: **kwargs – The keyword arguments to update.

property use_meta_language

Read-only property that indicates whether the meta language read from the website was used or the language was explicitly set.

Returns:: True if the meta language was used, False if the language was explicitly set.
Return type:: bool

Article

Article objects can also be created with the shortcut method:

a = newspaper.article(url, language='en', ...)

which is equivalent to:

a = newspaper.Article(url, language='en', ...)
a.download()
a.parse()

You can pass any of the Article constructor arguments to the shortcut method.

class newspaper.Article(url: str, title: str = '', source_url: str = '', read_more_link: str = '', config: Configuration | None = None, **kwargs: Any)

Article abstraction for newspaper.

This object fetches and holds information for a single article. In order to download the article, call download(). Then call parse() to extract the information.

config

the active configuration for this article instance. You can use different settings for any article instance.

Type:: Configuration

extractor

Content parsing object.

Type:: ContentExtractor

source_url

URL to the main page of the news source which owns this article

Type:: str

url

The article link. This was used to download the current article. In case of a redirect(through meta refresh or read more link), this will be different from the original url.

Type:: str

original_url

The original url of the article. This is the url that was passed to the constructor. It will not change in case of a redirect.

Type:: str

title

Parset title of the article. It can be forced/overridden by providing a title in the constructor.

Type:: str

read_more_link

An xpath selector for the link to the full article. make sure that the selector works for all casese, not only for one specific article. If needed, you can use several xpath selectors separated by |.

Type:: str

top_image

The top image url of the article. It will try to guess the best fit for a main image from the images found in the article.

Type:: str

meta_img

Image url provided by metadata

Type:: str

images

List of all image urls in the current article

Type:: list[str]

movies

List of video links in the article body

Type:: list[str]

text

a parsed version of the article body. It will be truncated to the first config.max_text characters.

Type:: str

text_cleaned

a parsed version of the clean_top_node content. It will be truncated to the first config.max_text characters. .. deprecated:: 0.9.3

is now same as Article.text clean_top_node is removed

Type:: str

keywords

An inferred list of keywords for this article. This will be generated by the nlp method. It will be truncated to the first config.max_keywords keywords.

Type:: list[str]

keyword_scores

A dictionary of keywords and their scores.

Type:: dict[str, float]

meta_keywords

A list of keywords provided by the meta data. It will be truncated to the first config.max_keywords keywords.

Type:: list[str]

tags

Extracted tag list from the article body

Type:: Set[str]

authors

The author list parsed from the article. It will be truncated to the first config.max_authors authors.

Type:: list[str]

publish_date

The parsed publishing date from the article. If no valid date is found, it will be an empty string.

Type:: str

summary

The summarization of the article as generated by the nlp method. It will be truncated to the first config.max_summary_sent sentences.

Type:: str

html

The raw html of the article page.

Type:: str

article_html

The raw html of the article body.

Type:: str

is_parsed

True if parse() has been called.

Type:: bool

download_state

AticleDownloadState.SUCCESS if download() was successful, ArticleDownloadState.FAILED_RESPONSE if download() failed, ArticleDownloadState.NOT_STARTED if download() was not called.

Type:: int

download_exception_msg

The exception message if download() failed.

Type:: str

history

Redirection history from the requests.``get`` call.

Type:: list[str]

meta_description

The description extracted from the meta data.

Type:: str

meta_lang

The language extracted from the meta data. If config.language is not set, this value will be used to parse the article instead of the config.language value.

Type:: str

meta_favicon

Website’s favicon url extracted from the meta data.

Type:: str

meta_site_name

Website’s name extracted from the meta data.

Type:: str

meta_data

additional meta data extracted from the meta tags.

Type:: dict[str, str]

canonical_link

Canonical URL for the article extracted from the metadata

Type:: str

top_node

Top node of the original DOM tree. It contains the text nodes for the detected article body. This node is on the doc DOM tree.

Type:: lxml.html.HtmlElement

doc

the full DOM of the downloaded html. It is the original DOM tree.

Type:: lxml.html.HtmlElement

clean_doc

a cleaned version of the DOM tree .. deprecated:: 0.9.3

is now same as Article.doc

Type:: lxml.html.HtmlElement

Article.__init__(url: str, title: str = '', source_url: str = '', read_more_link: str = '', config: Configuration | None = None, **kwargs: Any)

Constructs the article class. Will not download or parse the article

Parameters:

url (str) – The input url to parse. Can be a URL or a file path.
title (str, optional) – Default title if none can be extracted from the webpage. Defaults to empty string.
source_url (str, optional) – URL of the main website that originates the article. If left empty, it will be inferred from the url. Defaults to “”.
read_more_link (str, optional) – A xpath selector for the link to the full article, in case there is a ‘preview’ with a read-more button that leads to another url (even on another domain). make sure that the selector works for all cases, not only for one specific article. If needed, you can use several xpath selectors separated by |. Defaults to “”.
config (Configuration, optional) – Configuration settings for
empty (this article's download/parsing/nlp. If left)
will (it)
None. (use the default settingsDefaults to)

Keyword Arguments:

**kwargs – Any Configuration class property can be overwritten through init keyword params. Additionally, you can specify any of the following requests.``get`` parameters: headers, cookies, auth, timeout, allow_redirects, proxies, verify, cert For other requests parameters, you can use the Configuration.``requests_params`` dictionary.

Raises:

ArticleException – Error parsing and preparing the article

Article.download()

Downloads the link’s HTML content, don’t use if you are batch async downloading articles

Parameters:

input_html (str, optional) – A cached version of the article to parse. It will load the html from this string without attempting to access the article url. If you have a read_more_link xpath set up in the constructor, and do not set ignore_read_more to true, it will attempt to follow the found read_more link (if any). Defaults to None.
title (str, optional) – Force an article title. Defaults to None.
recursion_counter (int, optional) – Used to prevent infinite recursions
0. (due to meta_refresh. Defaults to)
ignore_read_more (bool, optional) – If true, the download process will
constructor. (ignore any kind of "read_more" xpath set up in the)
False. (Defaults to)

Returns:

self

Return type:

Article

Article.parse()

Parse the previously downloaded article. If download() wasn’t called, it will raise a ArticleException exception. Populates the article properties such as: title, authors, publish_date, text, top_image, etc.

Returns:: self
Return type:: Article

Article.nlp(): Method expects download() and parse() to have been run. It will perform the keyword extraction and summarization

Source

class newspaper.Source(url: str, read_more_link: str = '', config: Configuration | None = None, **kwargs)

Sources are abstractions of online news websites such as huffpost or cnn. The object will create a list of article urls that belong to the source and the list of categories of news (world, politics, etc.) that the source has. These categories are inferred from the source’s homepage structure.

url

The url of the source’s homepage. e.g. https://www.cnn.com

Type:: str

config

The configuration object for this source.

Type:: Configuration

domain

The domain of the source’s homepage. e.g. cnn.com

Type:: str

scheme

The scheme of the source’s homepage. e.g. https

Type:: str

categories

A list of Category objects that belong to the source.

Type:: list

feeds

A list of Feed objects that belong to the source containing information about the source’s RSS feeds.

Type:: list

articles

A list of Article objects that belong to the source.

Type:: list

brand

The domain name root of the source. e.g. cnn

Type:: str

description

The description of the source as found in the source’s meta tags

Type:: str

doc

The parsed lxml root of the source’s homepage.

Type:: lxml.html.HtmlElement

html

The html of the source’s homepage as downloaded by requests.

Type:: str

favicon

The url of the source’s favicon.

Type:: str

logo_url

The url of the source’s logo.

Type:: str

Source.__init__(url: str, read_more_link: str = '', config: Configuration | None = None, **kwargs)

The config object for this source will be passed into all of this source’s children articles unless specified otherwise or re-set.

Parameters:

url (str) – The url of the source’s homepage. e.g. https://www.cnn.com
read_more_link (str, optional) – A xpath selector for the link to the full article. make sure that the selector works for all casese, not only for one specific article. If needed, you can use several xpath selectors separated by |. Defaults to “”.
config (Configuration, optional) – The configuration object for this source. Defaults to None.

Keyword Arguments:

**kwargs – Any Configuration class propriety can be overwritten through init keyword params. Additionally, you can specify any of the following requests parameters: headers, cookies, auth, timeout, allow_redirects, proxies, verify, cert

Source.build()

Encapsulates download and basic parsing with lxml. Executes download, parse, gets categories and article links, parses rss feeds and finally creates a list of Article objects. Articles are not yet downloaded.

Parameters:

input_html (str, optional) – The cached html of the source to parse. Leave None to download the html. Defaults to None.
only_homepage (bool, optional) – If true, the source object will only parse the homepage of the source. Defaults to False.
only_in_path (bool, optional) – If true, the source object will only parse the articles that are in the same path as the source’s homepage. You can scrape a specific category this way. Defaults to False.

Source.feeds_to_articles(): Returns a list of Article objects based on articles found in the Source’s RSS feeds

Source.categories_to_articles(): Takes the categories, splays them into a big list of urls and churns the articles out of each url with the url_to_article method

Source.generate_articles()

Creates the Source.articles List of Article objects. It gets the Urls from all detected categories and RSS feeds, checks them for plausibility based on their URL (using some heuristics defined in the urls.valid_url function). These can be further downloaded using Source.download_articles()

Parameters:

limit (int, optional) – The maximum number of articles to generate. Defaults to 5000.
only_in_path (bool, optional) – If true, the source object will only parse the articles that are in the same path as the source’s homepage. You can scrape a specific category this way. Defaults to False.

Source.download_articles()

Starts the download() for all Article objects in the Source.articles property. It can run single threaded or multi-threaded.

Returns:: A list of downloaded articles.
Return type:: list[Article]

Source.download(): Downloads html of source, i.e. the news site homppage

Source.size(): Returns the number of articles linked to this news source

Category

class newspaper.source.Category(url: str, html: str | None = None, doc: Element | None = None)

A category object is a representation of a category of news on a news source’s homepage. For example, on cnn.com, the category “world” would be a category object.

url

The url of the category’s homepage. e.g. https://www.cnn.com/world

Type:: str

html

The html of the category’s homepage as downloaded by requests.

Type:: str

doc

The parsed lxml root of the category’s homepage.

Type:: lxml.html.HtmlElement

Feed

class newspaper.source.Feed(url: str, rss: str | None = None)

A feed object is a representation of an RSS feed on a news source’s homepage. For example, on cnn.com, the feed “http://rss.cnn.com/rss/edition_world.rss” would represent a feed object.

url

The url of the feed’s homepage. e.g. http://rss.cnn.com/rss/edition_world.rss

Type:: str

rss

The rss of the feed’s content (xml) as downloaded by requests.

Type:: str

Exceptions

class newspaper.ArticleException: Generic Article Exception thrown by the article package.

class newspaper.ArticleBinaryDataException: Exception raised for binary data in urls. will be raised if Configuration.allow_binary_content is False.