Created
March 7, 2019 09:14
-
-
Save BitAndQuark/9f328cca9f90dce07ed730062d4a9b31 to your computer and use it in GitHub Desktop.
example of using selenium to parse HTML
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Get a table object | |
table = driver.find_element_by_class_name('my_table_class') | |
# Get all rows, including rows in rows | |
rows = table.find_elements_by_tag_name('tr') | |
# Get header which is in the first row | |
# Get text from <th> element | |
header = rows[0] | |
header_text_list = [i.text for i in header.find_elements_by_tag_name('th')] | |
# Get direct rows of top level table | |
direct_trs = table.find_elements_by_xpath('.//tbody//tr') | |
# This uses lxml to parse source code to get valid td tag which has raw data | |
tr_table = [] | |
for tr in direct_trs: | |
# Get HTML source code string to accelerate | |
root = lxml.html.fromstring(tr.get_attribute('innerHTML')) | |
tr_text = [] | |
# Assume <tr> has a few <td> inside | |
for td in root.xpath('td'): | |
tr_text.append(td.text) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment