Quickstart

Eager to get started? This page gives a good introduction in how to get started with newspaper. This assumes you already have newspaper installed. If you do not, head over to the Installation section.

Building a news source

Source objects are an abstraction of online news media websites like CNN or ESPN. You can initialize them in two different ways.

Building a Source object for a news site will extract its categories, feeds, articles, brand, and description for you.

You may also provide configuration parameters like language, browser_user_agent, and etc seamlessly. Navigate to the advanced section for details.

import newspaper
cnn_paper = newspaper.build('http://cnn.com')

other_paper = newspaper.build('http://www.lemonde.fr/', language='fr')

However, if needed, you can be more specific in your implementation by using some advanced features and parameters of the Source object as described in the advanced section.

Extracting articles from a news source

Every news source has a set of recent articles, mainly present in the homepage and category pages. The Source object will extract references to these articles and store them in the as a list of Article objects in its Source.articles property.

The following examples assume that a news source has been initialized and built.

import newspaper
cnn_paper = newspaper.build('http://cnn.com')

for article in cnn_paper.articles:
    print(article.url)

# 'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
# 'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
...

print(cnn_paper.size()) # cnn has 3100 articles
# 3100

Article caching

By default, newspaper caches all previously extracted articles and will not redownload any article which it has already extracted.

This feature exists to prevent duplicate articles and to increase extraction speed. For instance, if you run the build command twice on a news source, the second time it will only download and parse only the new articles:

cbs_paper = newspaper.build('http://cbs.com')
cbs_paper.size()
# 1030

cbs_paper = newspaper.build('http://cbs.com')
cbs_paper.size()
# 2

The return value of cbs_paper.size() changes from 1030 to 2 because when we first crawled cbs we found 1030 articles. However, on our second crawl, we eliminate all articles which have already been crawled.

This means only 2 new articles have been published since our first extraction.

You can disable this feature with setting the memorize_articles parameter to False.

This can also be achieved by setting the memorize_articles property of the Configuration object to False. More examples are available in the advanced section.

import newspaper

cbs_paper = newspaper.build('http://cbs.com', memorize_articles=False)
cbs_paper.size()
# 1030

cbs_paper = newspaper.build('http://cbs.com', memorize_articles=False)
cbs_paper.size()
# 1030

Extracting Source categories

One important feature of the Source object is the ability to extract the website categories from the main page of a news source. This way you can extract articles from a specific category.

for category in cnn_paper.category_urls():
     print(category)

# 'http://lifestyle.cnn.com'
# 'http://cnn.com/world'
# 'http://tech.cnn.com'
...

Extracting Source feeds

RSS feeds play an important role in the news ecosystem. They allow news to propagate and be shared across the web. The Source object will extract the RSS feeds

for feed_url in cnn_paper.feed_urls():
    print(feed_url)

# 'http://rss.cnn.com/rss/cnn_crime.rss'
# 'http://rss.cnn.com/rss/cnn_tech.rss'
...

Extracting Source brand & description

You can use the Source object to extract the souce’s website base name (e.g. bbc from bbc.co.uk) and its description from known metatags

print(cnn_paper.brand)
# 'cnn'

print(cnn_paper.description)
# 'CNN.com delivers the latest breaking news and information on the latest...'

Extracting individual News Articles

Article objects are abstractions of news articles (news stories). For example, a news Source is CNN (cnn.com), a news article is a specific link containing a news story, like https://edition.cnn.com/2023/11/09/tech/… You can use any Article from an existing (and initialized) news Source or use the Article object by itself. Just pass in the url to the article, and call Article.download() and Article.parse(). You can also use the shortcut call from newspaper newspaper.article() that will create the Article object for you, and call Article.download() and Article.parse().

Referencing an article from a Source object:

first_article = cnn_paper.articles[0]

Alternatively, initializing an Article object on its own:

first_article = newspaper.Article(url="http://www.lemonde.fr/...", language='fr')

All the initialization parameters that work for Source objects also work for Article objects. There are some differences, however. For example, the title parameter is available only for Article objects.

Ignorig particular content-types for `Source` objects and `Article` objects

Using the ignored_content_types_defaults parameter, it is possible to ignore particular content-types for Source objects and Article objects. This parameter is also available as a property of the Configuration object.

You cam provide a dictionary of content-types and their placeholder value. Any articles having that content-type will be ignored and the placeholder value will be used instead of the actual content.

import newspaper
pdf_defaults = {"application/pdf": "%PDF-",
                  "application/x-pdf": "%PDF-",
                  "application/x-bzpdf": "%PDF-",
                  "application/x-gzpdf": "%PDF-"}
pdf_article = newspaper.article(url='https://www.adobe.com/pdf/pdfs/ISO32000-1PublicPatentLicense.pdf',
                                        ignored_content_types_defaults=pdf_defaults)
print(pdf_article.html)
# %PDF-

Most important `Article` methods

The stages of an Article extraction are as follows:

Downloading an Article

An Article freshly initialized will have no html, title, text. You first must call download(). Downloading can be called also in a multi-threading fashion. Check out the advanced section for more details.

first_article = cnn_paper.articles[0]

first_article.download()

print(first_article.html)
# '<!DOCTYPE HTML><html itemscope itemtype="http://...'

print(cnn_paper.articles[7].html)
# will fail, since article is not downloaded yet

Parsing an Article

In order to parse the meaningful plain text from an article, extract its title, publication date, authors, top image, etc. we must call parse() on it. If you call parse() before a download() it will throw an ArticleException.

first_article.parse()

print(first_article.text)
# 'Three sisters who were imprisoned for possibly...'

print(first_article.top_image)
# 'http://some.cdn.com/3424hfd4565sdfgdg436/

print(first_article.authors)
# ['Eliott C. McLaughlin', 'Some CoAuthor']

print(first_article.title)
# u'Police: 3 sisters imprisoned in Tucson home'

print(first_article.images)
# ['url_to_img_1', 'url_to_img_2', 'url_to_img_3', ...]

print(first_article.movies)
# ['url_to_youtube_link_1', ...] # youtube, vimeo, etc

Performing NLP on an Article

Finally, you can process the text obtained above and extract some natural language features using the nlp() method. This will populate the summary and keywords properties of the article.

Note: nlp() is a computationally expensive operation. It is recommended to use it only when needed and not recommended to run on all articles in a Source object

You must have called both download() and parse() on the article before calling nlp().

As of the current build, nlp() features only work on western languages.

first_article.nlp()

print(first_article.summary)
# '...imprisoned for possibly a constant barrage...'

print(first_article.keywords)
# ['music', 'Tucson', ... ]

print(cnn_paper.articles[100].nlp()) # fail, not been downloaded yet
# Traceback (...
# ArticleException: You must parse an article before you try to..

Additional methods

Here are random but hopefully useful features! hot() returns a list of the top trending terms on Google using a public api. popular_urls() returns a list of popular news source urls.. In case you need help choosing a news source!

import newspaper

newspaper.hot()
# ['Ned Vizzini', Brian Boitano', Crossword Inventor', 'Alex & Sierra', ... ]

newspaper.popular_urls()
# ['http://slate.com', 'http://cnn.com', 'http://huffingtonpost.com', ... ]