Command Line Interface (CLI)
Download and parse news articles.
usage: python -m newspaper [-h]
(--url URL | --urls-from-file URLS_FROM_FILE | --urls-from-stdin)
[--html-from-file HTML_FROM_FILE]
[--language LANGUAGE]
[--output-format {csv,json,text}]
[--output-file OUTPUT_FILE]
[--read-more-link READ_MORE_LINK]
[--skip-fetch-images] [--follow-meta-refresh]
[--browser-user-agent BROWSER_USER_AGENT]
[--proxy PROXY] [--request-timeout REQUEST_TIMEOUT]
[--cookies COOKIES] [--skip-ssl-verify]
[--max-nr-keywords MAX_NR_KEYWORDS] [--skip-nlp]
Named Arguments
- --url, -u
The URL of the article to download and parse.
- --urls-from-file, -uf
The file containing the URLs of the articles to download and parse.
- --urls-from-stdin, -us
Read URLs from stdin.
Default: False
- --html-from-file, -hf
The HTML file to parse. This will not download the article, it will parse the HTML file directly.
- --language, -l
The language of the article to download and parse.
Default: “en”
- --output-format, -of
Possible choices: csv, json, text
The output format of the parsed article.
Default: “json”
- --output-file, -o
The file to write the parsed article to.
- --read-more-link
A xpath selector for the link to the full article for the case where the article is only a summary, and you need to press “read-more” to read the full text.
- --skip-fetch-images
Whether to skip fetching images when identifying the top image. This option speeds up parsing, but can lead to erroneous top image identification.
Default: False
- --follow-meta-refresh
Whether to follow meta refresh links when downloading the article.
Default: False
- --browser-user-agent, -ua
The user agent string to use when downloading the article.
- --proxy
The proxy to use when downloading the article. The format is: http://<proxy_host>:<proxy_port> e.g.: http://10.10.1.1:8080
- --request-timeout
The timeout to use when downloading the article.
Default: 7
- --cookies
The cookies to use when downloading the article. The format is: cookie1=value1; cookie2=value2; …
- --skip-ssl-verify
Whether to skip the certificate verification for the article URL
Default: False
- --max-nr-keywords
The maximum number of keywords to extract from the article.
Default: 10
- --skip-nlp
Whether to skip the NLP step.
Default: False
Examples
For instance, you can download an article from cnn and save it as a json file:
python -m newspaper --url=https://edition.cnn.com/2023/11/16/politics/ethics-committee-releases-santos-report/index.html --output-format=json --output-file=cli_cnn_article.json
Or use a list of urls from a text file (one url on each line), and store all results as a csv:
python -m newspaper --urls-from-file=url_list.txt --output-format=csv --output-file=articles.csv
You can also use pipe redirection to read urls from stdin:
grep "cnn" huge_url_list.txt | python -m newspaper --urls-from-stdin --output-format=csv --output-file=articles.csv
To read the content of a local html file, use the –html-from-file option:
python -m newspaper --url=https://edition.cnn.com/2023/11/16/politics/ethics-committee-releases-santos-report/index.html --html-from-file=/home/user/myfile.html --output-format=json
Files can be read as file:// urls. If you want to preserver the original webpage url, use the previous example with –html-from-file :
python -m newspaper --url=file:///home/user/myfile.html --output-format=json
will print out the json representation of the article, for the html file stored in /home/user/myfile.html.