Installation
Pip
You can simply install newspaper4k with pip:
pip install newspaper4k
Best practice is to use a virtual environment, such as virtualenv:
virtualenv venv
source venv/bin/activate
pip install newspaper4k
Latest version from Github
If you want to install the latest version from Github, you can do so:
pip install git+https://github.com/AndyTheFactory/newspaper4k.git
Requirements
newspaper4k requires Python 3.8 and above to run. It was not tested on
lower versions.
The newspaper4k package has the following dependencies:
beautifulsoup4
Pillow
PyYAML
lxml[html_clean]
Pillow
PyYAML
lxml[html_clean]
nltk
requests
feedparser
feedparser
tldextract
python-dateutil
typing-extensions
brotli
Additionally, for extended language support, you may need to install the following:
Chinese: jieba
Thai: pythainlp
Japanese: tinysegmenter
Bengali, Hindi, Nepali, Tamil: indic-nlp-library
Other optional dependencies include: - Cloudflare-protected sites: cloudscraper - Google News API: gnews
To install with specific optional dependencies, you can use extras in pip. For example, to install with Chinese and Thai support:
pip install newspaper4k[zh,th]
To install cloudscraper for Cloudflare support:
pip install newspaper4k[cloudflare]
To install all optional dependencies:
pip install newspaper4k[all]
Usage
The fastest way to get started is to import the newspaper module and to call
the article function:
import newspaper
a = newspaper.article('https://edition.cnn.com/2023/11/08/china/china-blizzard-disruption-intl-hnk/index.html')
print(a.title)
The article function creates an Article object, downloads the article
and parses it. The Article object has several attributes, such as
title, authors, text and top_image.
The same can be achieved by using the following code:
import newspaper
a = newspaper.article('https://edition.cnn.com/2023/11/08/china/china-blizzard-disruption-intl-hnk/index.html')
a.download()
a.parse()
print(a.title)