Web Scraping with Python and Selenium
This project post explores a web scraper built using Python and Selenium as part of my third year individual project. It was build as part of the data collection process and automates the download of lecture resources from Minerva Blackboard.
Project Details
Libraries and Setup:
This project leverages multiple libraries for its operations. It uses os and json for handling system operations and JSON data respectively, requests for HTTP requests, and selenium for browser automation. The Selenium WebDriver setup is automated to always use the latest version of ChromeDriver, ensuring up-to-date browser interactions:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
                
Login Functionality:
The login function automates the login process by navigating to the specified URL, waiting for the email input field to appear, and entering user credentials:

driver.get('https://minerva.leeds.ac.uk/ultra/courses/') # Navigate to login page
wait.until(EC.visibility_of_element_located((By.ID, 'i0116'))).send_keys(email)
                
Identifying and Collecting Files:
The getUnits function navigates post-login to extract URLs for each course unit, collecting hyperlinks and their titles for later use:

unit_links = container.find_elements(By.TAG_NAME, 'a')
units_info = [{'url': link.get_attribute('href'), 'title': link.text} for link in unit_links]
                
File Download Process:
Files are identified and downloaded based on their descriptions. The script checks if the files are lectures or transcripts before initiating downloads:

if 'lecture' in data_bbfile_json['linkName'].lower() or 'transcript' in data_bbfile_json['linkName'].lower():
    download_file(download_url, download_directory, filename, session)
                
Cookie Management:
To maintain continuity between browser and server sessions, cookies are synchronized as follows:

for cookie in cookies:
    session.cookies.set(cookie['name'], cookie['value'], domain=cookie['domain'])
                
The scraper identified files with either “lecture” or “transcript” in the title to filter out other material posted by the module leader such as further readings or videos. It retrieved and downloaded all lecture slides and transcripts as both pptx and pdf files respectively.