Mastering Web Scraping with Python

Question

Mastering Web Scraping with Python

Tejas Vaij posted Jun 27, 2024 11 min read

Data gathering is made possible via web scraping, which automates the process of extracting information from websites. Powerful libraries like Scrapy, BeautifulSoup, and Selenium, together with its straightforward syntax, make Python the best choice. For analysis purposes, such as pricing tracking, review gathering, or news tracking, proficiency with web scraping is essential. In this article, we'll cover web scraping basics, usage of tools like BeautifulSoup, Selenium, and Scrapy and Requests for API scraping.

Understanding Web Scraping

Web scraping makes use of a variety of tools and techniques to extract information.

Tools

BeautifulSoup: This Python module parses HTML and XML documents and can easily extract information by building a parse tree from the source code of webpages. Learn more here.
Scrapy: It is an effective Python library that uses self-contained crawlers called spiders, that extract data from websites. Learn more here.
Selenium: It's a browser automation tool that runs a web browser in a regular manner by scraping dynamic material generated using JavaScript. Learn more here.
Requests: It is an HTTP library for Python that allows you to send and handle HTTP requests and responses. Learn more here.

Legal Considerations

Many websites' Terms of Service (TOS) specifically ban scraping.
Scraping data may breach copyright laws by extracting and using material without authorization.
Always examine and respect a website's robots.txt file, which specifies which portions of the site can and cannot be scraped.

Installing and Configuring Web Scraping Libraries

Note: The following codes are run on the terminal or in a code editor or IDE like Visual Studio Code.

The libraries can be installed as follows:

BeautifulSoup:
```
  pip install beautifulsoup4
```
Scrapy:
```
  pip install scrapy
```
Selenium:
```
  pip install selenium
```
Requests:
```
  pip install requests
```

Caution: Make sure `pip` is installed on your system.

Working in a virtual environment is recommended since it reduces conflicts with other projects and helps manage dependencies.

Creating a virtual environment:

python -m venv sampenv

Activation:

On Windows:
```
  sampenv\Scripts\activate
```
On Linux and macOS:
```
  source sampenv/bin/activate
```

Tip: Working in a virtual environment is recommended since it reduces conflicts with other projects and helps manage dependencies.

Basic Web Scraping with BeautifulSoup

BeautifulSoup is a Python library that works on HTML and XML files to extract data by creating a parse tree from a page source to traverse and find elements, attributes or text using methods like find() and find_all().

Extracting Tags

You can extract HTML tags like <p>, <div>, <a>, etc. using find() and find_all() methods.
Syntax:

find(): soup.find(tag_name, attrs={})
find_all(): soup.find_all(tag_name, attrs={})

Extracting Attributes

HTML tag attributes can be extracted using the get() method or a dictionary-style access from the find() method.
Syntax:

tag = soup.find('tag_name')
get(): tag.get('attr_name')
Dictionary-style access: tag['attr_name']

Extracting Text

Text contents of HTML elements can be accessed using the get_text() method or the .text attribute.
Syntax:

tag = soup.find('tag_name')
get_text(): tag.get_text()
.text: tag.text

Let's take a look at an example to understand the methods better:

from bs4 import BeautifulSoup

html_content = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Grocery List</title>
</head>
<body>
    <div id="content">
        <h1>Welcome to my List</h1>
        <p>This is a list of my <strong>grocery items</strong>.</p>
        <ul>
            <li>Milk</li>
            <li>Bread</li>
            <li>Carrots</li>
            <li>Butter</li>
        </ul>
        <a href="https://samp.com">Sample</a>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html_content, 'html.parser')
title_tag = soup.find('title')     # extracts Title tag
h1_tag = soup.find('h1')     # extracts H1 tag
p_tag = soup.find('p')
ul_tag = soup.find('ul')
a_tag = soup.find('a')

print("Title:", title_tag.text)    # extracts text from Title tag
print("H1:", h1_tag.get_text())    # extracts text from H1 tag
print("Paragraph:", p_tag.text)
print("List Items:")
for li in ul_tag.find_all('li'):    # extracts list items
    print("-", li.text)
print("Link Href:", a_tag['href'])    # displays 'href' attribute's value

The code contains a sample HTML script, from which tags, attributes and texts are extracted.

Output:

beautifulsoup

Scraping Dynamic Websites with Selenium

Selenium is a great tool for scraping dynamic webpages using JavaScript. It enables button clicks, form submissions, and page navigation by allowing programmatic control of a web browser. In order to use Selenium, you need to set up Selenium WebDriver for your browser.

Google Chrome

Download ChromeDriver from chromedriver.chromium.org

Set up ChromeDriver:

  from selenium import webdriver
  from selenium.webdriver.chrome.service import Service
  from webdriver_manager.chrome import ChromeDriverManager
  service = Service(ChromeDriverManager().install())
  driver = webdriver.Chrome(service=service)

Mozilla Firefox

Download GeckoDriver from github.com/mozilla/geckodriver/releases.

Set up GeckoDriver:

  from selenium import webdriver
  from selenium.webdriver.firefox.service import Service
  from webdriver_manager.firefox import GeckoDriverManager
  service = Service(GeckoDriverManager().install())
  driver = webdriver.Firefox(service=service)

Microsoft Edge

Download EdgeDriver from developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/.

Set up EdgeDriver:

  from selenium import webdriver
  from selenium.webdriver.edge.service import Service
  from webdriver_manager.microsoft import EdgeChromiumDriverManager
  service = Service(EdgeChromiumDriverManager().install())
  driver = webdriver.Edge(service=service)

Extracting Elements

find_element() Method

To discover a single web element that meets the given criteria, use the find_element() function.

Syntax:

element = driver.find_element(By.ID, 'ID_value')

find_elements() Method

Use the find_elements() function to find numerous web items that fit certain criteria. It will return a list of WebElement objects that represent all matched things.

Syntax:

elements = driver.find_elements(By.CLASS_NAME, 'Class_name_val')

get_attribute() Method

It retrieves the value of an attribute like href, src, etc. from a web element.

Syntax:

link = driver.find_element(By.TAG_NAME, 'Tag_value')
href_value = link.get_attribute('attr_value')

text Property

It extracts text content from elements like paragraphs, headings, or list items.

Syntax:

element = driver.find_element(By.CLASS_NAME, 'Class_val')
txt = element.text

send_keys() Method

This method is helpful for input fields since it mimics typing into text fields or search boxes.

Syntax:

inp_box = driver.find_element(By.NAME, 'Name_value')
inp_box.send_keys('Hello!')

click() Method

This method is used to click buttons, links, checkboxes, or any clickable element.

Syntax:

button = driver.find_element(By.ID, 'ID_value')
button.click()

execute_script() Method

JavaScript code for advanced interactions, such as scrolling and event triggering, is executed via this method.

Syntax:

driver.execute_script("your_JS_code")
Example for scrolling:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Let's take a look at an example to understand it better:

samplepage.html:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sample Page for Selenium</title>
    <script>
        document.addEventListener('DOMContentLoaded', function() {
            document.getElementById('dynamicButton').addEventListener('click', function() {
                document.getElementById('dynamicContent').innerHTML = '<p class="dynamicText">I hope you are doing well.</p>';
                document.getElementById('dynamicContent').innerHTML += '<a href="https://samp.com" class="dynamicLink">Example Link</a>';
            });
        });
    </script>
</head>
<body>
    <h1 id="pageTitle">Welcome!</h1>
    <input type="text" id="inputField" name="inputName" placeholder="Enter something...">
    <button id="dynamicButton">Load Content</button>
    <div id="dynamicContent"></div>
</body>
</html>

Python script for data extraction:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
driver.get("samplepage.html")    # opens the webpage

wait = WebDriverWait(driver, 20)
page_title = wait.until(EC.presence_of_element_located((By.ID, 'pageTitle')))     # waits for 20 sec for page to load
page_title = driver.find_element(By.ID, 'pageTitle')
print("Page Title:", page_title.text)
input_field = driver.find_element(By.NAME, 'inputName')
input_field.send_keys('Hello, Alayne!')    # enters given data
print("Input Field Value:", input_field.get_attribute('value'))
dynamic_button = driver.find_element(By.ID, 'dynamicButton')
dynamic_button.click()    # clicks the button
wait = WebDriverWait(driver, 10)
dynamic_text = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'dynamicText')))
print("Dynamic Text:", dynamic_text.text)
dynamic_link = driver.find_element(By.CLASS_NAME, 'dynamicLink')
print("Dynamic Link Href:", dynamic_link.get_attribute('href'))
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")    # scrolls
driver.quit()    # close the browser

The code extracts data from the given JavaScript file and also enters data into the text field, clicks a button and scrolls through the webpage.

Output:

selenium

Scraping with Scrapy

A Python-based framework for online crawling and scraping, Scrapy provides a number of tools for effectively extracting data from webpages. It effortlessly pulls data, follows links, and manages requests.
An item is a container that item pipelines use to process and store scraped data. An item pipeline component is any Python class that implements a method on items.

Let's understand how to set up and define pipelines in for a Scrapy project using an example. We will be scraping this webpage

Setting up Scrapy Projects and Defining Item Pipelines

Open your terminal and go to the directory where you want to create a Python project.
Run the following command to create a Scrapy project consisting of files necessary for configuring the project.
```
 scrapy startproject scrapy_test
```

This creates a Scrapy project named scrapy_test.

Go to items.py in the scrapy_test folder. Here, you'll define a Scrapy item class, acting as a blueprint for your extracted data. Use scrapy.Field() to create field instances capable of storing any data type.
```
 import scrapy
 class  ScrapyTestItem(scrapy.Item):
 text = scrapy.Field()
 author = scrapy.Field()
 tags = scrapy.Field()
```

This code creates text, author and tags fields to store data from the webpage.

To set up the project settings, navigate to settings.py. This is where you design and configure the order of execution for the scraped item processing pipelines. Lower values indicate higher priority.
```
 ITEM_PIPELINES = {
 'scrapy_test.pipelines.JsonWriterPipeline': 300,
 'scrapy_test.pipelines.HtmlWriterPipeline': 800,
 }
```

This ensures that JsonWriterPipeline executes before HtmlWriterPipeline. These pipelines are defined in the next step.

Navigate to pipelines.py. This file handles data processing, including cleansing, verifying, storing, and exporting. Multiple pipelines can be defined here.

 import json
 class JsonWriterPipeline:
     def open_spider(self, spider):
         self.file = open('quotes.json', 'w')

     def close_spider(self, spider):
         self.file.close()

     def process_item(self, item, spider):
         line = json.dumps(dict(item), ensure_ascii=False) + "\n"     # extracted data is converted to JSON formatted string, and non-ASCII characters are handled properly.
         self.file.write(line)
         return item

 class HtmlWriterPipeline:
     def open_spider(self, spider):
         self.file = open('quotes.html', 'w')
         self.file.write('<html><body><ul>')

     def close_spider(self, spider):
         self.file.write('</ul></body></html>')
         self.file.close()

     def process_item(self, item, spider):
         self.file.write(f"<li><p>{item['text']}</p><p>{item['author']}</p><p>{', '.join(item['tags'])}</p></li>")     
         return item

This code formats the fields and stores the extracted data into two files, quotes.json and quotes.html.

Open scrapy_test/spiders, then create scrapy_code.py. This file will include spider classes to define URLs to crawl, process answers, and retrieve data using XPath or CSS selectors. The item classes in items.py are used by the spider.

 import scrapy
 from scrapy_test.items import ScrapyTestItem

 class QuotesSpider(scrapy.Spider):
     name = 'quotes'
     start_urls = [http://quotes.toscrape.com]

     def parse(self, response):
         for quote in response.css('div.quote'):
             item = ScrapyTestItem()
             item['text'] = quote.css('span.text::text').get()
             item['author'] = quote.css('span small::text').get()
             item['tags'] = quote.css('div.tags a.tag::text').getall()
             yield item

This code defines a spider named quotes. Using a CSS selector, it selects all elements with the class quote and creates a ScrapyTestItem instance for each. The get() method retrieves the text content and stores it in the respective fields.

In your terminal, navigate to the root directory of the project (scrapy_test) and run the command:
```
 scrapy crawl quotes
```

Where quotes is the spider we created.

The extracted data is stored in quotes.json and quotes.html in the project's root directory.

Output:

jsonfile

htmlfile

Scraping APIs and JSON Data

Scraping APIs provide an organized way to obtain structured data. These APIs make endpoints available for accessing structured data in formats like XML and JSON. Using Requests, you may retrieve the required data by submitting queries to these endpoints and interpreting the returned responses.
Accessing APIs might involve the usage of API keys or tokens. Keys or tokens are specified in the request header.

Making HTTP Requests and Accessing APIs

Acquire the API key or token by generating the key/token in the platform of your choice.

Define the URL of the webpage you want to scrape and include the key/token in the request header. Also mention the content type.

 url = https://website_name.com/data
 api_key = 'abcxyz'
 header = { 'Authorization': f'Bearer {api_key}', 'Content-Type': 'application/json' }

Send an HTTP GET request to the specified URL with the provided headers using the get() method. It returns a Response object containing the server's response to the HTTP request, which can be used to extract the status code, content, etc.
```
 response = requests.get(url, headers=header)
```

If the status code of the response is 200, the request was successful and the contents can be extracted in text or JSON format.

 if response.status_code == 200:
     print(response.headers)
     print(response.text)    # or response.json()
 else:
     print(f"Error: {response.status_code}")

Let's take a simple example using the URL 'https://jsonplaceholder.typicode.com/todos/12'

    import requests
    url = 'https://jsonplaceholder.typicode.com/todos/12'
    header = {'Content-type': 'application/json'}
    response = requests.get(url, headers=header)
    if response.status_code == 200:
        print(response.headers)
        print(response.json())
    else:
        print(f"Error: {response.status_code}")

This code does not involve the use of keys/tokens.

Output:

apioutput

You can also use Requests to download ZIP file. For example, to download a Kaggle dataset:

import os
import requests

key = '82jfvyg3j4jngjr'    # Kaggle API key
dataset_url = 'https://www.kaggle.com/datasets/vivek468/superstore-dataset-final'
save_path = 'C:\\Downloads'    # replace with the path you want to store the file in

response = requests.get(dataset_url, headers={'Authorization': f'Bearer {key}'}, stream=True)
if response.status_code == 200:
    with open(os.path.join(save_path, 'dataset.zip'), 'wb') as f:    # download the file
        for chunk in response.iter_content(chunk_size=1024):
            f.write(chunk)
    print('Download successful.')
else:
    print('Failed to download dataset. Status code:', response.status_code)

Output:

apikeyzip1

apikeyzip2

Conclusion

Thus, a potent method for obtaining data from websites is Web Scraping, which makes it possible to gather and analyze data effectively. The requests library, BeautifulSoup, Selenium, and Scrapy provide special benefits for various scraping requirements and enable developers to explore the web and retrieve important data easily. I encourage you to explore these tools further to unlock the full potential of web scraping to interact with and analyze web data.

If you read this far, tweet to the author to show them you care. Tweet a Thanks

chevron_left

Onumaku C Victory · Answer 1 · 2024-07-09T07:40:39+0000

Onumaku C Victory • Jul 9, 2024

Great Post. Web scraping is one of the Fundamental in learning python and also applying it in data analysis. Kudos

Ferdy · Answer 2 · 2025-02-16T09:01:00+0000

Ferdy • Feb 16

To increase more coverage, a nice addition will be the so called playwright.

	Streamlining Workflows Task Automation with Python Tejas Vaij - Jun 2, 2024
	Automating With Python: A Developer’s Guide Tejas Vaij - May 14, 2024
	Leveraging Python for System Administration Tejas Vaij - May 1, 2024
	Mastering Scripting with Python Tejas Vaij - Apr 26, 2024
	Mastering Data Visualization with Matplotlib in Python Muzzamil Abbas - Apr 18, 2024

Mastering Web Scraping with Python

Understanding Web Scraping

Installing and Configuring Web Scraping Libraries

Basic Web Scraping with BeautifulSoup

Scraping Dynamic Websites with Selenium

Scraping with Scrapy

Scraping APIs and JSON Data

Conclusion

0 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Streamlining Workflows Task Automation with Python

Automating With Python: A Developer’s Guide

Leveraging Python for System Administration

Mastering Scripting with Python

Mastering Data Visualization with Matplotlib in Python

More From Tejas Vaij

Streamlining Workflows Task Automation with Python

Django TemplateDoesNotExist Solved

Automating With Python: A Developer’s Guide

Welcome to Coder Legion Community

with 2,570 amazing developers

Connect with

Already have an account? Log in

Mastering Web Scraping with Python

Understanding Web Scraping

Installing and Configuring Web Scraping Libraries

Basic Web Scraping with BeautifulSoup

Scraping Dynamic Websites with Selenium

Scraping with Scrapy

Scraping APIs and JSON Data

Conclusion

0 Comments

Please log in to add a comment.

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Tejas Vaij