Command Line Interface (CLI)

Download and parse news articles.

usage: python -m newspaper [-h]
                           (--url URL | --urls-from-file URLS_FROM_FILE | --urls-from-stdin)
                           [--html-from-file HTML_FROM_FILE]
                           [--language LANGUAGE]
                           [--output-format {csv,json,text}]
                           [--output-file OUTPUT_FILE]
                           [--read-more-link READ_MORE_LINK]
                           [--skip-fetch-images] [--follow-meta-refresh]
                           [--browser-user-agent BROWSER_USER_AGENT]
                           [--proxy PROXY] [--request-timeout REQUEST_TIMEOUT]
                           [--cookies COOKIES] [--skip-ssl-verify]
                           [--max-nr-keywords MAX_NR_KEYWORDS] [--skip-nlp]

Named Arguments

--url, -u

The URL of the article to download and parse.

--urls-from-file, -uf

The file containing the URLs of the articles to download and parse.

--urls-from-stdin, -us

Read URLs from stdin.

Default: False

--html-from-file, -hf

The HTML file to parse. This will not download the article, it will parse the HTML file directly.

--language, -l

The language of the article to download and parse.

Default: “en”

--output-format, -of

Possible choices: csv, json, text

The output format of the parsed article.

Default: “json”

--output-file, -o

The file to write the parsed article to.

--read-more-link

A xpath selector for the link to the full article for the case where the article is only a summary, and you need to press “read-more” to read the full text.

--skip-fetch-images

Whether to skip fetching images when identifying the top image. This option speeds up parsing, but can lead to erroneous top image identification.

Default: False

--follow-meta-refresh

Whether to follow meta refresh links when downloading the article.

Default: False

--browser-user-agent, -ua

The user agent string to use when downloading the article.

--proxy

The proxy to use when downloading the article. The format is: http://<proxy_host>:<proxy_port> e.g.: http://10.10.1.1:8080

--request-timeout

The timeout to use when downloading the article.

Default: 7

--cookies

The cookies to use when downloading the article. The format is: cookie1=value1; cookie2=value2; …

--skip-ssl-verify

Whether to skip the certificate verification for the article URL

Default: False

--max-nr-keywords

The maximum number of keywords to extract from the article.

Default: 10

--skip-nlp

Whether to skip the NLP step.

Default: False

Examples

For instance, you can download an article from cnn and save it as a json file:

python -m newspaper --url=https://edition.cnn.com/2023/11/16/politics/ethics-committee-releases-santos-report/index.html  --output-format=json --output-file=cli_cnn_article.json

Or use a list of urls from a text file (one url on each line), and store all results as a csv:

python -m newspaper --urls-from-file=url_list.txt  --output-format=csv --output-file=articles.csv

You can also use pipe redirection to read urls from stdin:

grep "cnn" huge_url_list.txt | python -m newspaper --urls-from-stdin  --output-format=csv --output-file=articles.csv

To read the content of a local html file, use the –html-from-file option:

python -m newspaper --url=https://edition.cnn.com/2023/11/16/politics/ethics-committee-releases-santos-report/index.html --html-from-file=/home/user/myfile.html  --output-format=json

Files can be read as file:// urls. If you want to preserver the original webpage url, use the previous example with –html-from-file :

python -m newspaper --url=file:///home/user/myfile.html  --output-format=json

will print out the json representation of the article, for the html file stored in /home/user/myfile.html.