Command Line Interface (CLI)
Newspaper4k provides a command-line interface (CLI) that lets you download and parse news articles without writing any Python code. You can process a single URL, a list of URLs from a file, or a stream of URLs piped from another command, and output the results as JSON, CSV, or plain text.
Usage:
python -m newspaper [OPTIONS] (--url URL | --urls-from-file FILE | --urls-from-stdin)
Download and parse news articles.
usage: python -m newspaper [-h]
(--url URL | --urls-from-file URLS_FROM_FILE | --urls-from-stdin)
[--html-from-file HTML_FROM_FILE]
[--language LANGUAGE]
[--output-format {csv,json,text}]
[--output-file OUTPUT_FILE]
[--read-more-link READ_MORE_LINK]
[--skip-fetch-images] [--follow-meta-refresh]
[--browser-user-agent BROWSER_USER_AGENT]
[--proxy PROXY] [--request-timeout REQUEST_TIMEOUT]
[--cookies COOKIES] [--skip-ssl-verify]
[--max-nr-keywords MAX_NR_KEYWORDS] [--skip-nlp]
Named Arguments
- --url, -u
The URL of the article to download and parse.
- --urls-from-file, -uf
The file containing the URLs of the articles to download and parse.
- --urls-from-stdin, -us
Read URLs from stdin.
Default:
False- --html-from-file, -hf
The HTML file to parse. This will not download the article, it will parse the HTML file directly.
- --language, -l
The language of the article to download and parse.
Default:
'en'- --output-format, -of
Possible choices: csv, json, text
The output format of the parsed article.
Default:
'json'- --output-file, -o
The file to write the parsed article to.
- --read-more-link
A xpath selector for the link to the full article for the case where the article is only a summary, and you need to press “read-more” to read the full text.
- --skip-fetch-images
Whether to skip fetching images when identifying the top image. This option speeds up parsing, but can lead to erroneous top image identification.
Default:
False- --follow-meta-refresh
Whether to follow meta refresh links when downloading the article.
Default:
False- --browser-user-agent, -ua
The user agent string to use when downloading the article.
- --proxy
The proxy to use when downloading the article. The format is: http://<proxy_host>:<proxy_port> e.g.: http://10.10.1.1:8080
- --request-timeout
The timeout to use when downloading the article.
Default:
7- --cookies
The cookies to use when downloading the article. The format is: cookie1=value1; cookie2=value2; …
- --skip-ssl-verify
Whether to skip the certificate verification for the article URL
Default:
False- --max-nr-keywords
The maximum number of keywords to extract from the article.
Default:
10- --skip-nlp
Whether to skip the NLP step.
Default:
False
Arguments Reference
Input Source (mutually exclusive)
Exactly one of the following three arguments must be provided.
--url URL,-u URLThe URL of the article to download and parse. Can be a standard
http:///https://address or afile://path to a local HTML file.Example:
python -m newspaper --url=https://www.bbc.com/news/world-12345
--urls-from-file FILE,-uf FILEPath to a plain-text file that contains one URL per line. Every URL in the file will be downloaded and parsed in order.
Example:
python -m newspaper --urls-from-file=url_list.txt --output-format=csv
--urls-from-stdin,-usRead URLs from standard input (one per line). This is useful for piping the output of another command directly into Newspaper4k.
Example:
grep "bbc.com" my_urls.txt | python -m newspaper --urls-from-stdin --output-format=json
HTML Content
--html-from-file FILE,-hf FILEInstead of downloading the article from the network, read the HTML from a local file. The URL supplied via
--urlis still used as the canonical address of the article (for metadata and relative-URL resolution), but no HTTP request is made. When processing multiple URLs (--urls-from-file/--urls-from-stdin), this file is applied only to the first URL; subsequent URLs are downloaded normally.Example:
python -m newspaper \ --url=https://www.bbc.com/news/world-12345 \ --html-from-file=/tmp/cached_page.html \ --output-format=json
Output
--output-format {json,csv,text},-of {json,csv,text}Format used to write the parsed article data. Default:
json.json– a JSON array where each element is an article object containing all extracted fields (title, text, authors, publish date, keywords, summary, images, …).csv– a CSV file with a header row followed by one row per article.text– plain text containing the article title followed by the article body, suitable for quick human reading.
--output-file FILE,-o FILEWrite the output to FILE instead of printing to standard output. If the file already exists it will be overwritten. If omitted, results are printed to stdout.
Example:
python -m newspaper --url=https://www.bbc.com/news/world-12345 \ --output-format=json --output-file=article.json
Network & HTTP
--browser-user-agent STRING,-ua STRINGOverride the HTTP
User-Agentheader sent when downloading articles. Some websites block requests with the default user-agent; setting a browser-like string can help bypass those restrictions.Example:
--browser-user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
--proxy URLRoute all HTTP/HTTPS traffic through this proxy. The expected format is
http://<host>:<port>, for examplehttp://10.10.1.1:8080.--request-timeout SECONDSMaximum number of seconds to wait for an HTTP response. Default:
7. Increase this value for slow or rate-limited websites.--cookies STRINGCookies to include in every HTTP request, formatted as a semicolon- separated list of
name=valuepairs, e.g.session=abc123; consent=true.--skip-ssl-verifyDisable SSL/TLS certificate verification. Use with caution — this makes connections vulnerable to man-in-the-middle attacks. Useful when accessing internal or self-signed HTTPS endpoints.
--follow-meta-refreshFollow
<meta http-equiv="refresh">redirects when downloading an article. Some pages use this tag to redirect visitors to the actual content page.
Content Parsing
--language LANG,-l LANGTwo-letter ISO 639-1 language code of the article (e.g.
en,fr,de). Default:en. Providing the correct language improves text segmentation, stopword filtering, and NLP quality. See languages for the full list of supported languages.--read-more-link XPATHAn XPath expression that identifies the “read more” or “full article” link present on summary/teaser pages. When supplied, Newspaper4k will click through that link and parse the full article instead of the truncated preview.
Example:
--read-more-link="//a[@class='read-more']"
--skip-fetch-imagesDo not download images when selecting the article’s top image. This speeds up parsing because no additional HTTP requests are made, but may result in a less accurate top-image selection.
--max-nr-keywords NMaximum number of keywords to extract from the article during NLP processing. Default:
10.--skip-nlpSkip the Natural Language Processing step entirely. When set, the
keywordsandsummaryfields in the output will be empty. Use this flag when NLP is not needed and you want faster processing.
Examples
Download a single article and save it as JSON:
python -m newspaper \
--url=https://edition.cnn.com/2023/11/16/politics/ethics-committee-releases-santos-report/index.html \
--output-format=json \
--output-file=cli_cnn_article.json
Process a list of URLs from a text file (one URL per line) and save all results as CSV:
python -m newspaper --urls-from-file=url_list.txt --output-format=csv --output-file=articles.csv
Use pipe redirection to read URLs from stdin:
grep "cnn" huge_url_list.txt | python -m newspaper --urls-from-stdin --output-format=csv --output-file=articles.csv
Parse a locally cached HTML file while preserving the original article URL:
python -m newspaper \
--url=https://edition.cnn.com/2023/11/16/politics/ethics-committee-releases-santos-report/index.html \
--html-from-file=/home/user/myfile.html \
--output-format=json
Read a local HTML file directly using a file:// URL (the canonical article
URL will be derived from the file path):
python -m newspaper --url=file:///home/user/myfile.html --output-format=json
The command above prints the JSON representation of the article parsed from
/home/user/myfile.html.
Download a French article and skip NLP processing:
python -m newspaper \
--url=https://www.lemonde.fr/international/article/2023/11/01/example \
--language=fr \
--skip-nlp \
--output-format=text