Skip to content

Instantly share code, notes, and snippets.

@BitAndQuark
Created March 7, 2019 09:14
Show Gist options
  • Save BitAndQuark/9f328cca9f90dce07ed730062d4a9b31 to your computer and use it in GitHub Desktop.
Save BitAndQuark/9f328cca9f90dce07ed730062d4a9b31 to your computer and use it in GitHub Desktop.
example of using selenium to parse HTML
# Get a table object
table = driver.find_element_by_class_name('my_table_class')
# Get all rows, including rows in rows
rows = table.find_elements_by_tag_name('tr')
# Get header which is in the first row
# Get text from <th> element
header = rows[0]
header_text_list = [i.text for i in header.find_elements_by_tag_name('th')]
# Get direct rows of top level table
direct_trs = table.find_elements_by_xpath('.//tbody//tr')
# This uses lxml to parse source code to get valid td tag which has raw data
tr_table = []
for tr in direct_trs:
# Get HTML source code string to accelerate
root = lxml.html.fromstring(tr.get_attribute('innerHTML'))
tr_text = []
# Assume <tr> has a few <td> inside
for td in root.xpath('td'):
tr_text.append(td.text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment