.. _examples: Examples and Tutorials ====================== 1. Building and Crawling a News Sources using a Multithreaded approach ---------------------------------------------------------------------- Building and crawling news websites can require the handling of multiple sources simultaneously and processing a large volume of articles. You can significantly improve the performance of this process by using multiple threads when crawling. Even if Python is not truly multithreaded (due to the GIL), i/o requests can be handled in parallel. .. code-block:: python from newspaper import Source from newspaper.mthreading import fetch_news import threading class NewsCrawler: def __init__(self, source_urls, config=None): self.sources = [Source(url, config=config) for url in source_urls] self.articles = [] def build_sources(self): # Multithreaded source building threads = [threading.Thread(target=source.build) for source in self.sources] for thread in threads: thread.start() for thread in threads: thread.join() def crawl_articles(self): # Multithreaded article downloading self.articles = fetch_news(self.sources, threads=4) def extract_information(self): # Extract information from each article for source in self.sources: print(f"Source {source.url}") for article in source.articles[:10]: article.parse() print(f"Title: {article.title}") print(f"Authors: {article.authors}") print(f"Text: {article.text[:150]}...") # Printing first 150 characters of text print("-------------------------------") if __name__ == "__main__": source_urls = ['https://slate.com', 'https://time.com'] # Add your news source URLs here crawler = NewsCrawler(source_urls) crawler.build_sources() crawler.crawl_articles() crawler.extract_information() 2. Getting Articles with Scrapy -------------------------------- Install Necessary Packages ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python pip install scrapy pip install newspaper4k Create the scrapy project: .. code-block:: bash scrapy startproject news_scraper This command creates a new folder news_scraper with the necessary Scrapy files. Code the Scrapy Spider ^^^^^^^^^^^^^^^^^^^^^^ Navigate to the news_scraper/spiders folder and create a new spider. For example, news_spider.py: .. code-block:: python import scrapy import newspaper class NewsSpider(scrapy.Spider): name = 'news' start_urls = ['https://abcnews.go.com/elections'] # Replace with your target URLs def parse(self, response): # Extract URLs from the response and yield Scrapy Requests for href in response.css('a::attr(href)'): yield response.follow(href, self.parse_article) def parse_article(self, response): # Use Newspaper4k to parse the article article = newspaper.article(response.url, language='en', input_html=response.text) article.parse() article.nlp() # Extracted information yield { 'url': response.url, 'title': article.title, 'authors': article.authors, 'text': article.text, 'publish_date': article.publish_date, 'keywords': article.keywords, 'summary': article.summary, } Run the Spider ^^^^^^^^^^^^^^ .. code-block:: bash scrapy crawl news -o output.json 3. Using Playwright to Scrape Websites built with Javascript ------------------------------------------------------------- Install Necessary Packages ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python pip install newspaper4k pip install playwright playwright install Scrape with Playwright ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python from playwright.sync_api import sync_playwright import newspaper import time def scrape_with_playwright(url): # Using Playwright to render JavaScript with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto(url) time.sleep(1) # Allow the javascript to render content = page.content() browser.close() # Using Newspaper4k to parse the page content article = newspaper.article(url, input_html=content, language='en') return article # Example URL url = 'https://ec.europa.eu/commission/presscorner/detail/en/ac_24_84' # Replace with the URL of your choice # Scrape and process the article article = scrape_with_playwright(url) article.nlp() print(f"Title: {article.title}") print(f"Authors: {article.authors}") print(f"Publication Date: {article.publish_date}") print(f"Summary: {article.summary}") print(f"Keywords: {article.keywords}") 4. Using Playwright to Scrape Websites that require login ---------------------------------------------------------- .. code-block:: python from playwright.sync_api import sync_playwright import newspaper def login_and_fetch_article(url, login_url, username, password): # Using Playwright to handle login and fetch article with sync_playwright() as p: browser = p.chromium.launch(headless=True) # Set headless=False to watch the browser actions page = browser.new_page() # Automating login page.goto(login_url) page.fill('input[name="log"]', username) # Adjust the selector as per the site's HTML page.fill('input[name="pwd"]', password) # Adjust the selector as per the site's HTML page.click('input[type="submit"][value="Login"]') # Adjust the selector as per the site's HTML # Wait for navigation after login page.wait_for_url('/') # Navigating to the article page.goto(url) content = page.content() browser.close() # Using Newspaper4k to parse the page content article = newspaper.article(url, input_html=content, language='en') return article # Example URLs and credentials login_url = 'https://www.undercurrentnews.com/login/' # Replace with the actual login URL article_url = 'https://www.undercurrentnews.com/2024/01/08/editors-choice-farmed-shrimp-output-to-drop-in-2024-fallout-from-us-expanded-russia-ban/' # Replace with the URL of the article you want to scrape username = 'tester_news' # Replace with your username password = 'test' # Replace with your password # Fetch and process the article article = login_and_fetch_article(article_url, login_url, username, password) article.nlp() print(f"Title: {article.title}") print(f"Authors: {article.authors}") print(f"Publication Date: {article.publish_date}") print(f"Summary: {article.summary}") print(f"Keywords: {article.keywords}")