.. _cli: Command Line Interface (CLI) ============================ Newspaper4k provides a command-line interface (CLI) that lets you download and parse news articles without writing any Python code. You can process a single URL, a list of URLs from a file, or a stream of URLs piped from another command, and output the results as JSON, CSV, or plain text. **Usage**:: python -m newspaper [OPTIONS] (--url URL | --urls-from-file FILE | --urls-from-stdin) .. argparse:: :module: newspaper.cli :func: get_arparse :prog: python -m newspaper Arguments Reference ------------------- Input Source (mutually exclusive) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Exactly one of the following three arguments must be provided. ``--url URL``, ``-u URL`` The URL of the article to download and parse. Can be a standard ``http://`` / ``https://`` address or a ``file://`` path to a local HTML file. Example:: python -m newspaper --url=https://www.bbc.com/news/world-12345 ``--urls-from-file FILE``, ``-uf FILE`` Path to a plain-text file that contains one URL per line. Every URL in the file will be downloaded and parsed in order. Example:: python -m newspaper --urls-from-file=url_list.txt --output-format=csv ``--urls-from-stdin``, ``-us`` Read URLs from standard input (one per line). This is useful for piping the output of another command directly into Newspaper4k. Example:: grep "bbc.com" my_urls.txt | python -m newspaper --urls-from-stdin --output-format=json HTML Content ^^^^^^^^^^^^ ``--html-from-file FILE``, ``-hf FILE`` Instead of downloading the article from the network, read the HTML from a local file. The URL supplied via ``--url`` is still used as the canonical address of the article (for metadata and relative-URL resolution), but no HTTP request is made. When processing multiple URLs (``--urls-from-file`` / ``--urls-from-stdin``), this file is applied only to the *first* URL; subsequent URLs are downloaded normally. Example:: python -m newspaper \ --url=https://www.bbc.com/news/world-12345 \ --html-from-file=/tmp/cached_page.html \ --output-format=json Output ^^^^^^ ``--output-format {json,csv,text}``, ``-of {json,csv,text}`` Format used to write the parsed article data. Default: ``json``. * ``json`` – a JSON array where each element is an article object containing all extracted fields (title, text, authors, publish date, keywords, summary, images, …). * ``csv`` – a CSV file with a header row followed by one row per article. * ``text`` – plain text containing the article title followed by the article body, suitable for quick human reading. ``--output-file FILE``, ``-o FILE`` Write the output to *FILE* instead of printing to standard output. If the file already exists it will be **overwritten**. If omitted, results are printed to stdout. Example:: python -m newspaper --url=https://www.bbc.com/news/world-12345 \ --output-format=json --output-file=article.json Network & HTTP ^^^^^^^^^^^^^^ ``--browser-user-agent STRING``, ``-ua STRING`` Override the HTTP ``User-Agent`` header sent when downloading articles. Some websites block requests with the default user-agent; setting a browser-like string can help bypass those restrictions. Example:: --browser-user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36" ``--proxy URL`` Route all HTTP/HTTPS traffic through this proxy. The expected format is ``http://:``, for example ``http://10.10.1.1:8080``. ``--request-timeout SECONDS`` Maximum number of seconds to wait for an HTTP response. Default: ``7``. Increase this value for slow or rate-limited websites. ``--cookies STRING`` Cookies to include in every HTTP request, formatted as a semicolon- separated list of ``name=value`` pairs, e.g. ``session=abc123; consent=true``. ``--skip-ssl-verify`` Disable SSL/TLS certificate verification. Use with caution — this makes connections vulnerable to man-in-the-middle attacks. Useful when accessing internal or self-signed HTTPS endpoints. ``--follow-meta-refresh`` Follow ```` redirects when downloading an article. Some pages use this tag to redirect visitors to the actual content page. Content Parsing ^^^^^^^^^^^^^^^ ``--language LANG``, ``-l LANG`` Two-letter ISO 639-1 language code of the article (e.g. ``en``, ``fr``, ``de``). Default: ``en``. Providing the correct language improves text segmentation, stopword filtering, and NLP quality. See :ref:`languages` for the full list of supported languages. ``--read-more-link XPATH`` An XPath expression that identifies the *"read more"* or *"full article"* link present on summary/teaser pages. When supplied, Newspaper4k will click through that link and parse the full article instead of the truncated preview. Example:: --read-more-link="//a[@class='read-more']" ``--skip-fetch-images`` Do not download images when selecting the article's top image. This speeds up parsing because no additional HTTP requests are made, but may result in a less accurate top-image selection. ``--max-nr-keywords N`` Maximum number of keywords to extract from the article during NLP processing. Default: ``10``. ``--skip-nlp`` Skip the Natural Language Processing step entirely. When set, the ``keywords`` and ``summary`` fields in the output will be empty. Use this flag when NLP is not needed and you want faster processing. Examples -------- Download a single article and save it as JSON: .. code-block:: bash python -m newspaper \ --url=https://edition.cnn.com/2023/11/16/politics/ethics-committee-releases-santos-report/index.html \ --output-format=json \ --output-file=cli_cnn_article.json Process a list of URLs from a text file (one URL per line) and save all results as CSV: .. code-block:: bash python -m newspaper --urls-from-file=url_list.txt --output-format=csv --output-file=articles.csv Use pipe redirection to read URLs from stdin: .. code-block:: bash grep "cnn" huge_url_list.txt | python -m newspaper --urls-from-stdin --output-format=csv --output-file=articles.csv Parse a locally cached HTML file while preserving the original article URL: .. code-block:: bash python -m newspaper \ --url=https://edition.cnn.com/2023/11/16/politics/ethics-committee-releases-santos-report/index.html \ --html-from-file=/home/user/myfile.html \ --output-format=json Read a local HTML file directly using a ``file://`` URL (the canonical article URL will be derived from the file path): .. code-block:: bash python -m newspaper --url=file:///home/user/myfile.html --output-format=json The command above prints the JSON representation of the article parsed from ``/home/user/myfile.html``. Download a French article and skip NLP processing: .. code-block:: bash python -m newspaper \ --url=https://www.lemonde.fr/international/article/2023/11/01/example \ --language=fr \ --skip-nlp \ --output-format=text