Web scraping can be a rewarding skill, but beginners often encounter challenges, especially when it comes to locating elements within a web page's structure. In this article, we'll explore a real-world experience of overcoming one such challenge: finding shadow DOM elements using Python and Selenium.
The Challenge: Unlocatable Element
As a novice web scraper, you may have encountered a frustrating roadblock when attempting to locate a specific element on a web page using standard XPath or CSS selectors. This scenario is not uncommon, and it often arises due to the intricacies of web page structures.
One of the reasons you couldn't easily find the element is that web pages can be complex, with elements nested within other elements, some of which might be dynamically generated. Moreover, websites today often employ technologies that make it challenging to access elements straightforwardly, such as single-page applications (SPAs) and shadow DOMs.
In the quest to scrape data from websites, you might have tried various selectors, inspected the web page's source code, and experimented with different XPath expressions. Yet, despite your best efforts, the elusive element remained hidden, seemingly immune to your scraping attempts.
This situation can be both perplexing and time-consuming, leaving you wondering whether there's a way to overcome this obstacle. Fortunately, there are strategies and tools available that can help you navigate such challenges and ultimately succeed in extracting the data you need.
Understanding Shadow DOM:
What Is It: Shadow DOM is a crucial concept in modern web development that provides encapsulation and isolation for web components.
Why It's Used: It's used to create reusable web components with scoped styling, preventing style and behaviour conflicts. It ensures that the internal structure of a component remains hidden and separate from the global web page.
How to Access It: Accessing elements within Shadow DOM typically requires using JavaScript to traverse the shadow DOM boundary and interact with the encapsulated elements programmatically. Traditional methods like XPath and CSS selectors may not work directly due to encapsulation.
To learn more about Shadow DOM you can refer the following article:
Tool To Find The Xpath of Element:
Manually finding Xpath is more challenging if the structure of the webpage is too complicated to get the Xpath automatically there are some open-source tools already available.
SelectorHub is one of the top-rated browser extensions that simplify the process of extracting the Xpath of complex web elements for web scraping. It enhances the web scraping workflow by assisting users in generating accurate and unique selectors, making it easier to locate and interact with elements on web pages. SelectorHub is particularly valuable when dealing with intricate HTML structures and helps streamline the development of web scraping scripts.
Accessing Shadow DOM Elements:
The Above image shows the shadow DOM structure of a webpage which can only be accessed by traversing the shadow DOM boundary programmatically and extracting the desired data.
Let's Dive Into Coding Part
Install the necessary modules and libraries
chrome browser always gets updated automatically unless you stop the update from its setting which results in your script failure and you have to download the Chrome driver compatible with your browser every time the browser gets updated.
To resolve this issue python community has provided the webdriver-manager library which install the driver compatible with your browser installed on the device so you don't need to download the driver every time the browser gets updated.
pip install webdriver-manager |
you just need 1 line of code to create a driver object
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
you can refer to the following documents for creating a driver object based on your Selenium version
Launch browser
url='https://dev145950.service-now.com'
driver.maximize_window()
driver.get(url)
Now if you look at the HTML image click-here there are multiple #shadow-root (open) tags.
You have to select the CSS_SELECTOR just above that to traverse through the shadowRoot
shadow_host=driver.find_element(By.TAG_NAME,"macroponent-f51912f4c700201072b211d4d8c26010")
# Access the shadow DOM
shadow_root = driver.execute_script('return arguments[0].shadowRoot', shadow_host)
root=shadow_root.find_element(By.CSS_SELECTOR,"sn-canvas-appshell-root")
appshell=root.find_element(By.CSS_SELECTOR,"sn-canvas-appshell-layout")
polaris=appshell.find_element(By.CSS_SELECTOR,"sn-polaris-layout")
Till now we have only traversed through the first #shadow-root
Now observe the tag sn-polaris-layout after that there is again #shadow-root
So to access the content inside that tag we have to first select that tag and then need to execute the Javascript
# Access the 2nd shadow DOM
polaris=appshell.find_element(By.CSS_SELECTOR,"sn-polaris-layout")
polaris=driver.execute_script('return arguments[0].shadowRoot',polaris)
polaris_enable=polaris.find_element(By.CSS_SELECTOR,'[class="sn-polaris-layout polaris-enabled"]')
NOTE: accessing elements inside shadow DOM is only possible through the CSS_SELECTOR method,
and if you try to access it by XPATH method it will throw the error.
Accessing Invisible element:
In the above Image if you want to find the Xpath for the laptop bag which is in the suggestion box
as soon as you click on the selector arrow to highlight the Xpath for that text it will disappear.
To tackle this problem you have to click on Event Listeners which is on the right side of the inspect tab and remove all the elements from blur Event Listeners
Event Listeners -----> blur ----->
Conclusion
Web scraping is a skill that evolves with experience. Challenges like locating shadow DOM elements can be daunting for beginners, but with perseverance, research, and continuous learning of web technologies, you can overcome them. Researching is one of the main skill needed for any developer to find the solution. There is abundant free knowledge available over the internet and as far I know only the passionate and problem solver reach out to every corner of the internet, if you master the skill of researching you are already in the TOP 10% of developers.
Happy coding😊!
0 Comments