Open In Colab

Webpage Loading and Scrapping#

Sports Worked Today#

More often than not, I’ll point my request at a page and return very little of the html code. Let’s see an example!

import requests
from bs4 import BeautifulSoup
import pandas as pa

url = "https://fbref.com/en/squads/361ca564/Tottenham-Hotspur-Stats"

r = requests.get(url)
soup = BeautifulSoup(r.text)

tables = soup.find_all('table')

pa.read_html(str(tables[0]))[0]
Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Playing Time Performance Per 90 Minutes Expected Per 90 Minutes Unnamed: 29_level_0
Player Nation Pos Age MP Starts Min 90s Gls Ast G-PK PK PKatt CrdY CrdR Gls Ast G+A G-PK G+A-PK xG npxG xA npxG+xA xG xA xG+xA npxG npxG+xA Matches
0 Hugo Lloris fr FRA GK 35-033 20 20 1800.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.5 0.5 0.00 0.03 0.03 0.00 0.03 Matches
1 Pierre Højbjerg dk DEN MF 26-176 19 19 1697.0 18.9 2.0 1.0 2.0 0.0 0.0 1.0 0.0 0.11 0.05 0.16 0.11 0.16 1.6 1.6 1.0 2.5 0.08 0.05 0.13 0.08 0.13 Matches
2 Eric Dier eng ENG DF 28-013 19 19 1631.0 18.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.4 0.4 1.3 1.7 0.02 0.07 0.09 0.02 0.09 Matches
3 Harry Kane eng ENG FW 28-184 19 18 1612.0 17.9 5.0 2.0 4.0 1.0 1.0 3.0 0.0 0.28 0.11 0.39 0.22 0.33 8.9 8.1 3.3 11.4 0.50 0.19 0.68 0.45 0.64 Matches
4 Son Heung-min kr KOR FW,MF 29-204 17 17 1483.0 16.5 8.0 3.0 8.0 0.0 0.0 0.0 0.0 0.49 0.18 0.67 0.49 0.67 6.9 6.9 3.5 10.4 0.42 0.21 0.63 0.42 0.63 Matches
5 Sergio Reguilón es ESP DF 25-043 18 17 1362.0 15.1 1.0 3.0 1.0 0.0 0.0 3.0 0.0 0.07 0.20 0.26 0.07 0.26 1.4 1.4 3.3 4.6 0.09 0.22 0.31 0.09 0.31 Matches
6 Lucas Moura br BRA FW,MF 29-168 19 15 1334.0 14.8 2.0 3.0 2.0 0.0 0.0 2.0 0.0 0.13 0.20 0.34 0.13 0.34 3.2 3.2 2.5 5.7 0.21 0.17 0.39 0.21 0.39 Matches
7 Oliver Skipp eng ENG MF 21-134 18 14 1350.0 15.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.00 0.00 0.00 0.00 0.00 0.2 0.2 0.9 1.1 0.01 0.06 0.07 0.01 0.07 Matches
8 Emerson br BRA DF 23-014 15 14 1238.0 13.8 0.0 1.0 0.0 0.0 0.0 3.0 0.0 0.00 0.07 0.07 0.00 0.07 0.4 0.4 1.5 2.0 0.03 0.11 0.14 0.03 0.14 Matches
9 Davinson Sánchez co COL DF 25-230 14 12 1111.0 12.3 2.0 0.0 2.0 0.0 0.0 3.0 0.0 0.16 0.00 0.16 0.16 0.16 1.6 1.6 0.1 1.7 0.13 0.01 0.14 0.13 0.14 Matches
10 Ben Davies wls WAL DF 28-279 11 10 931.0 10.3 0.0 1.0 0.0 0.0 0.0 2.0 0.0 0.00 0.10 0.10 0.00 0.10 1.1 1.1 0.5 1.6 0.11 0.04 0.16 0.11 0.16 Matches
11 Japhet Tanganga eng ENG DF 22-303 11 10 735.0 8.2 0.0 0.0 0.0 0.0 0.0 4.0 1.0 0.00 0.00 0.00 0.00 0.00 0.4 0.4 0.4 0.8 0.05 0.05 0.09 0.05 0.09 Matches
12 Dele Alli eng ENG MF,FW 25-292 10 8 657.0 7.3 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.14 0.00 0.14 0.00 0.00 1.5 0.7 0.8 1.5 0.20 0.11 0.31 0.10 0.21 Matches
13 Cristian Romero ar ARG DF 23-276 7 6 533.0 5.9 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.00 Matches
14 Harry Winks eng ENG MF 25-360 9 6 526.0 5.8 0.0 1.0 0.0 0.0 0.0 2.0 0.0 0.00 0.17 0.17 0.00 0.17 0.1 0.1 1.4 1.5 0.01 0.24 0.25 0.01 0.25 Matches
15 Tanguy Ndombele fr FRA MF 25-031 9 6 484.0 5.4 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.19 0.19 0.37 0.19 0.37 0.5 0.5 0.4 0.9 0.09 0.07 0.16 0.09 0.16 Matches
16 Steven Bergwijn nl NED FW 24-112 10 4 414.0 4.6 2.0 1.0 2.0 0.0 0.0 1.0 0.0 0.43 0.22 0.65 0.43 0.65 1.9 1.9 0.5 2.4 0.42 0.11 0.53 0.42 0.53 Matches
17 Giovani Lo Celso ar ARG MF,FW 25-294 9 2 238.0 2.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.7 0.7 0.6 1.4 0.28 0.24 0.52 0.28 0.52 Matches
18 Ryan Sessegnon eng ENG DF,MF 21-255 4 2 221.0 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.2 0.2 0.02 0.07 0.09 0.02 0.09 Matches
19 Matt Doherty ie IRL DF,MF 30-012 6 1 237.0 2.6 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.00 0.38 0.38 0.00 0.38 1.1 1.1 0.3 1.4 0.43 0.11 0.54 0.43 0.54 Matches
20 Bryan es ESP FW,MF 20-351 9 0 94.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.1 0.1 0.03 0.07 0.10 0.03 0.10 Matches
21 Joe Rodon wls WAL DF 24-098 1 0 79.0 0.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.00 Matches
22 Brandon Austin eng ENG GK 23-020 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Matches
23 Pierluigi Gollini it ITA GK 26-316 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Matches
24 Dilan Markanday eng ENG FW,MF 20-161 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Matches
25 Tobi Omole eng ENG DF 22-042 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Matches
26 Dane Scarlett eng ENG FW 17-310 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Matches
27 Moussa Sissoko fr FRA MF 32-165 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Matches
28 Harvey White eng ENG MF 20-131 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Matches
29 Squad Total NaN NaN 27.0 20 220 1800.0 20.0 24.0 18.0 22.0 2.0 2.0 36.0 1.0 1.20 0.90 2.10 1.10 2.00 31.6 30.1 23.0 53.1 1.58 1.15 2.73 1.50 2.65 NaN
30 Opponent Total NaN NaN 27.5 20 220 1800.0 20.0 23.0 19.0 22.0 1.0 1.0 44.0 5.0 1.15 0.95 2.10 1.10 2.05 21.4 20.6 16.2 36.8 1.07 0.81 1.88 1.03 1.84 NaN

I usually have issues with this site but it worked today. Great place for all sports data!

News Source with Javascript#

Here is a scrape I did a while back that still has the issues I want to discuss. I’ve found the stripped away table from National Public Radio looking at the individuals involved in the attack at the capitol on 1/6/21. https://www.npr.org/2021/02/09/965472049/the-capitol-siege-the-arrested-and-their-stories#database

The link will take you directly to the table I am insterested in scrapping. From this page, I found where they reference the table from. This gets me around parsing through all the extra frames. It is stored as my url below. Take a look at it.

url = "https://apps.npr.org/dailygraphics/graphics/capitol-riot-table-20210204/table.html"

r = requests.get(url)
soup = BeautifulSoup(r.text)

soup.find_all('table')
[]

Looking at the developer tools there is clearly a table called, ‘riot-table’ but I get nothing on this call! Not even any tables!

This is where the big guns come in! The code below was taken from https://colab.research.google.com/github/nestauk/im-tutorials/blob/3-ysi-tutorial/notebooks/Web-Scraping/Web Scraping Tutorial.ipynb it will install the packages we need to execute the scrape of the data. The chromium is the Google chrome browser. That is how we will go to the websites via the code. selenium is the big tool. It will allow us to interact on webpage and with the javascript. Essentially the machine is opening a browser page and going to the website with the browser open!

# RUN THIS CELL WHEN USING THE NOTEBOOK ON COLAB - NO PREVIOUS INSTALLATION OF SELENIUM IS NEEDED
# install chromium, its driver, and selenium
!apt update
!apt install chromium-chromedriver
!pip install selenium
# set options to be headless
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
driver = webdriver.Chrome('chromedriver',options=options)

0% [Working]
            
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [696 B]
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:12 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:13 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease [15.9 kB]
Hit:14 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:15 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [76.0 kB]
Get:16 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,463 kB]
Get:17 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [2,517 kB]
Get:18 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [738 kB]
Get:20 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [872 kB]
Get:21 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main Sources [1,823 kB]
Get:22 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main amd64 Packages [934 kB]
Get:23 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [771 kB]
Get:24 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2,242 kB]
Get:25 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [2,954 kB]
Get:26 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic/main amd64 Packages [45.3 kB]
Fetched 14.7 MB in 8s (1,960 kB/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
64 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  chromium-browser chromium-browser-l10n chromium-codecs-ffmpeg-extra
Suggested packages:
  webaccounts-chromium-extension unity-chromium-extension
The following NEW packages will be installed:
  chromium-browser chromium-browser-l10n chromium-chromedriver
  chromium-codecs-ffmpeg-extra
0 upgraded, 4 newly installed, 0 to remove and 64 not upgraded.
Need to get 95.3 MB of archives.
After this operation, 327 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-codecs-ffmpeg-extra amd64 97.0.4692.71-0ubuntu0.18.04.1 [1,142 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-browser amd64 97.0.4692.71-0ubuntu0.18.04.1 [84.7 MB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-browser-l10n all 97.0.4692.71-0ubuntu0.18.04.1 [4,370 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-chromedriver amd64 97.0.4692.71-0ubuntu0.18.04.1 [5,055 kB]
Fetched 95.3 MB in 4s (23.1 MB/s)
Selecting previously unselected package chromium-codecs-ffmpeg-extra.
(Reading database ... 155229 files and directories currently installed.)
Preparing to unpack .../chromium-codecs-ffmpeg-extra_97.0.4692.71-0ubuntu0.18.04.1_amd64.deb ...
Unpacking chromium-codecs-ffmpeg-extra (97.0.4692.71-0ubuntu0.18.04.1) ...
Selecting previously unselected package chromium-browser.
Preparing to unpack .../chromium-browser_97.0.4692.71-0ubuntu0.18.04.1_amd64.deb ...
Unpacking chromium-browser (97.0.4692.71-0ubuntu0.18.04.1) ...
Selecting previously unselected package chromium-browser-l10n.
Preparing to unpack .../chromium-browser-l10n_97.0.4692.71-0ubuntu0.18.04.1_all.deb ...
Unpacking chromium-browser-l10n (97.0.4692.71-0ubuntu0.18.04.1) ...
Selecting previously unselected package chromium-chromedriver.
Preparing to unpack .../chromium-chromedriver_97.0.4692.71-0ubuntu0.18.04.1_amd64.deb ...
Unpacking chromium-chromedriver (97.0.4692.71-0ubuntu0.18.04.1) ...
Setting up chromium-codecs-ffmpeg-extra (97.0.4692.71-0ubuntu0.18.04.1) ...
Setting up chromium-browser (97.0.4692.71-0ubuntu0.18.04.1) ...
update-alternatives: using /usr/bin/chromium-browser to provide /usr/bin/x-www-browser (x-www-browser) in auto mode
update-alternatives: using /usr/bin/chromium-browser to provide /usr/bin/gnome-www-browser (gnome-www-browser) in auto mode
Setting up chromium-chromedriver (97.0.4692.71-0ubuntu0.18.04.1) ...
Setting up chromium-browser-l10n (97.0.4692.71-0ubuntu0.18.04.1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for hicolor-icon-theme (0.17-2) ...
Processing triggers for mime-support (3.60ubuntu1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.3) ...
/sbin/ldconfig.real: /usr/local/lib/python3.7/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link

Collecting selenium
  Downloading selenium-4.1.0-py3-none-any.whl (958 kB)
     |████████████████████████████████| 958 kB 5.1 MB/s 
?25hCollecting trio~=0.17
  Downloading trio-0.19.0-py3-none-any.whl (356 kB)
     |████████████████████████████████| 356 kB 62.0 MB/s 
?25hCollecting urllib3[secure]~=1.26
  Downloading urllib3-1.26.8-py2.py3-none-any.whl (138 kB)
     |████████████████████████████████| 138 kB 38.5 MB/s 
?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Requirement already satisfied: idna in /usr/local/lib/python3.7/dist-packages (from trio~=0.17->selenium) (2.10)
Collecting outcome
  Downloading outcome-1.1.0-py2.py3-none-any.whl (9.7 kB)
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.7/dist-packages (from trio~=0.17->selenium) (21.4.0)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Requirement already satisfied: sortedcontainers in /usr/local/lib/python3.7/dist-packages (from trio~=0.17->selenium) (2.4.0)
Collecting wsproto>=0.14
  Downloading wsproto-1.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from urllib3[secure]~=1.26->selenium) (2021.10.8)
Collecting pyOpenSSL>=0.14
  Downloading pyOpenSSL-21.0.0-py2.py3-none-any.whl (55 kB)
     |████████████████████████████████| 55 kB 2.3 MB/s 
?25hCollecting cryptography>=1.3.4
  Downloading cryptography-36.0.1-cp36-abi3-manylinux_2_24_x86_64.whl (3.6 MB)
     |████████████████████████████████| 3.6 MB 46.1 MB/s 
?25hRequirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.7/dist-packages (from cryptography>=1.3.4->urllib3[secure]~=1.26->selenium) (1.15.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.7/dist-packages (from cffi>=1.12->cryptography>=1.3.4->urllib3[secure]~=1.26->selenium) (2.21)
Requirement already satisfied: six>=1.5.2 in /usr/local/lib/python3.7/dist-packages (from pyOpenSSL>=0.14->urllib3[secure]~=1.26->selenium) (1.15.0)
Collecting h11<1,>=0.9.0
  Downloading h11-0.13.0-py3-none-any.whl (58 kB)
     |████████████████████████████████| 58 kB 4.8 MB/s 
?25hRequirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from h11<1,>=0.9.0->wsproto>=0.14->trio-websocket~=0.9->selenium) (3.10.0.2)
Installing collected packages: sniffio, outcome, h11, cryptography, async-generator, wsproto, urllib3, trio, pyOpenSSL, trio-websocket, selenium
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
requests 2.23.0 requires urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you have urllib3 1.26.8 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
Successfully installed async-generator-1.10 cryptography-36.0.1 h11-0.13.0 outcome-1.1.0 pyOpenSSL-21.0.0 selenium-4.1.0 sniffio-1.2.0 trio-0.19.0 trio-websocket-0.9.2 urllib3-1.26.8 wsproto-1.0.0

Now with all that done, I use the driver to go to the exact same website and voila we have the table!


driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.find_all('table')
#tables
pa.read_html(str(tables[0]))[0]
Name Summary Charges Case Updates
0 Stefanie Nicole Chiguer New 36 years old Dracu... Stefanie Nicole Chiguer came to the attention ... Federal: Entering and Remaining in a Restricte... NaN
1 Alan Fischer New 27 years old Tampa, Florida According to court documents, Alan Fischer III... Federal: Assaulting, Resisting, or Impeding Ce... NaN
2 Juliano Gross New 28 years old Kearny, New Jersey The FBI alleges that Juliano Gross documented ... Federal: Knowingly Entering or Remaining in an... NaN
3 Kirstyn Niemela New 33 years old Hudson, New H... “The whole nation will watch it happen,” Kirst... Federal: Entering and Remaining in a Restricte... NaN
4 Jeffrey Schaefer New 35 years old Milton, Dela... Law enforcement claimed in a court filing that... Federal: Entering and Remaining in a Restricte... NaN
... ... ... ... ...
743 Kyle James Young 37 years old Redfield, Iowa Federal prosecutors accused Kyle Young of assa... Federal: Obstruction of an Official Proceeding... Young pleaded Not Guilty to all charges
744 Philip S. Young 59 years old Sewell, New Jersey Philip Young allegedly pushed a barricade into... Federal: Assaulting, Resisting, or Impeding Ce... Young pleaded Not Guilty to all charges
745 Darrell Alan Youngers 31 years old Cleveland, ... “Non violent disobedience is how you take your... Federal: Entering and Remaining in any Restric... Youngers pleaded Not Guilty to all charges.
746 Ryan Scott Zink 32 years old Lubbock, Texas On Jan. 6, Ryan Zink posted at least three vid... Federal: Obstruction of an Official Proceeding... Zink pleaded Not Guilty to all charges
747 Joseph Elliott Zlab 51 years old Lake Forest P... The man who federal agents believe to be Josep... Federal: Entering and Remaining in a Restricte... Zlab pleaded Guilty to one charge: Parading, D...

748 rows × 4 columns

Of course we can do much more will selenium. You can actually program it to interact with the page, clicking on links, scrolling to the bottom of the page, etc. These tools are very useful on some webpages!

Crawling Along Our Directory#



url2 = "https://www.ecok.edu/directory"

driver.get(url2)
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.find_all('table')
pa.read_html(str(tables[0]))[0]
Last Name Department Position Phone Email Building Office
0 Adams, Larame Police Department Police Officer 580-559-5760 ladams@ecok.edu Chickasaw Business & Conference Center ECU Police Department
1 Adamson, Ashley Housing & Residence Life Residence Director 580-559-5127 aadamson@ecok.edu Chokka-Chaffa’ Hall 131
2 Aguirre-Berman, Alexandra Department of Performing Arts Adjunct Instructor of Flute 580-559-5736 aaguirre@ecok.edu Hallie Brown Ford Fine Arts Center 146
3 Allen, Debbie Administration and Finance Administrative Assistant, 580-559-5539 dallen@ecok.edu Administration Building 163
4 Ananga, Erick Department of Politics, Law and Society Assistant Professor 580-559-5413 eananga@ecok.edu Horace Mann 237D
5 Anderson, Destini Athletics Head Softball Coach 580-559-5363 dfanderson@ecok.edu Women's Athletic Facility 119
6 Anderson, Ty Employment Services Director, Assistant VP of Admin & Finance 580-559-5217 tydand@ecok.edu Administration Building 160
7 Anderson, Reanna Oka' Institute Public Service Operations Assistant 580-559-5151 reahand@ecok.edu Fentem Hall 114
8 Andrews, Kenneth College of Health and Sciences Dean College of Health and Sciences 580-559-5496 kandrews@ecok.edu Physical and Environmental Science Center 101A
9 Arcos, Jaime Facilities Management Custodian 580-559-5377 jarcos@ecok.edu Physical Plant NaN
10 Armstrong, Pamla Child Care Resource & Referral Agency Assistant Director 580-559-5303 parmstrong@ecok.edu Fentem Hall 131
11 Armstrong, Darrel Police Department Police Officer 580-559-5555 darmstrong@ecok.edu Chickasaw Business & Conference Center NaN
12 Autrey, LaDonna Department of Psychology Instructor 580-559-5328 lautrey@ecok.edu Lanoy Education Building 211A
13 Baggech, Melody Department of Performing Arts Professor of Music 580-559-5464 mbaggech@ecok.edu Hallie Brown Ford Fine Arts Center 150
14 Bailey, Jessika Office of International Student Services Director 580-559-5252 jbailey@ecok.edu Administration Building 152
15 Bailey, Shelley Department of Professional Programs and Human ... Instructor 580-559-5460 rrbailey@ecok.edu Horace Mann 220 C
16 Bailey III, Riley W. Athletics Head Coach Women's Soccer 580-559-5747 rbailey@ecok.edu Women's Athletic Facility 117
17 Baker, Nelaina (Lainie) Department of Psychology Secretary 580-559-5319 nelbak@ecok.edu Lanoy Education Building 204
18 Baker, Amy Office of Purchasing Purchasing Specialist 580-559-5264 amycbak@ecok.edu Administration Building 164

Here I have gathered the first page of the ECU directory. Say I want to gather all the tables. I am going to click on the Link that has each letter in the alphabet and add that to my list.

from selenium.webdriver.common.by import By

elem = driver.find_element(By.LINK_TEXT, 'A')

elem.click()
soup = BeautifulSoup(driver.page_source)
table = soup.find_all('table')

df = pa.read_html(str(table[0]))[0]
df['Table'] = 'A'
df
Last Name Department Position Phone Email Building Office Table
0 Adams, Larame Police Department Police Officer 580-559-5760 ladams@ecok.edu Chickasaw Business & Conference Center ECU Police Department A
1 Adamson, Ashley Housing & Residence Life Residence Director 580-559-5127 aadamson@ecok.edu Chokka-Chaffa’ Hall 131 A
2 Aguirre-Berman, Alexandra Department of Performing Arts Adjunct Instructor of Flute 580-559-5736 aaguirre@ecok.edu Hallie Brown Ford Fine Arts Center 146 A
3 Allen, Debbie Administration and Finance Administrative Assistant, 580-559-5539 dallen@ecok.edu Administration Building 163 A
4 Ananga, Erick Department of Politics, Law and Society Assistant Professor 580-559-5413 eananga@ecok.edu Horace Mann 237D A
5 Anderson, Destini Athletics Head Softball Coach 580-559-5363 dfanderson@ecok.edu Women's Athletic Facility 119 A
6 Anderson, Ty Employment Services Director, Assistant VP of Admin & Finance 580-559-5217 tydand@ecok.edu Administration Building 160 A
7 Anderson, Reanna Oka' Institute Public Service Operations Assistant 580-559-5151 reahand@ecok.edu Fentem Hall 114 A
8 Andrews, Kenneth College of Health and Sciences Dean College of Health and Sciences 580-559-5496 kandrews@ecok.edu Physical and Environmental Science Center 101A A
9 Arcos, Jaime Facilities Management Custodian 580-559-5377 jarcos@ecok.edu Physical Plant NaN A
10 Armstrong, Darrel Police Department Police Officer 580-559-5555 darmstrong@ecok.edu Chickasaw Business & Conference Center NaN A
11 Armstrong, Pamla Child Care Resource & Referral Agency Assistant Director 580-559-5303 parmstrong@ecok.edu Fentem Hall 131 A
12 Autrey, LaDonna Department of Psychology Instructor 580-559-5328 lautrey@ecok.edu Lanoy Education Building 211A A

Now that I have the A’s I’ll do a for loop and automate. Why type out all the letters when you can spend 5 minutes googling a solution?

import string
alpha = list(string.ascii_uppercase)[1:]

Now I move through the for loop. I hit a problem at “X” as there were no people in the directory there and it would not go forward so I made it go back when it hit that error.



for i in alpha:
  elem = driver.find_element(By.LINK_TEXT, i)
  elem.click()
  soup = BeautifulSoup(driver.page_source)
  table = soup.find_all('table')
  df1 =[]
  try:
    df1 = pa.read_html(str(table[0]))[0]
    df1['Table'] = i
    df = df.append(df1, ignore_index=True)
  except:
    driver.back()
df
Last Name Department Position Phone Email Building Office Table
0 Adams, Larame Police Department Police Officer 580-559-5760 ladams@ecok.edu Chickasaw Business & Conference Center ECU Police Department A
1 Adamson, Ashley Housing & Residence Life Residence Director 580-559-5127 aadamson@ecok.edu Chokka-Chaffa’ Hall 131 A
2 Aguirre-Berman, Alexandra Department of Performing Arts Adjunct Instructor of Flute 580-559-5736 aaguirre@ecok.edu Hallie Brown Ford Fine Arts Center 146 A
3 Allen, Debbie Administration and Finance Administrative Assistant, 580-559-5539 dallen@ecok.edu Administration Building 163 A
4 Ananga, Erick Department of Politics, Law and Society Assistant Professor 580-559-5413 eananga@ecok.edu Horace Mann 237D A
... ... ... ... ... ... ... ... ...
306 Yoncha, Anne Art + Design : Media + Communication Assistant Professor of Art 580-559-5355 ayoncha@ecok.edu Hallie Brown Ford Fine Arts Center 175 Y
307 York, Christopher Department of English and Languages Adjunct Instructor of English 580-559-5471 chrfyor@ecok.edu Horace Mann 329A Y
308 Youngblood, Susan Career Center Director 580-559-5890 susryou@ecok.edu Administration Building 155 Y
309 Zachary, Kimberly Business Administration Adjunct NaN kzachary@ecok.edu NaN NaN Z
310 Zhang, Hongkai Business Administration Professor 580-559-5561 hzhang@ecok.edu Chickasaw Business & Conference Center 359 Z

311 rows × 8 columns

Over all I would say this is very challenging and super specific to the website you are working on. I tried many times to get selenium to find the links in many different ways. I think the agreed upon method is XPATH.

X Path#

XPath is great for finding boxes and forms on webpages (maybe links too!) You use the first call in the html to get just those elements of the page and then if needed another identifier. Here is a nice image.

xpath.png

Let’s see this in action on a simple page. DuckDuckGo is a search engine. We see that it has a box for entering a search. The entry box is called a ‘form’ with id ‘search_form_homepage’ I can get selenium to this by using Xpath=//input[@id="search_form_input_homepage"]

url2 = "https://duckduckgo.com"

driver.get(url2)

elem = driver.find_element(By.XPATH,'//input[@id="search_form_input_homepage"]' )

I can also get to the icon that makes it search on a path input[@id="search_button_homepage"]

Other Selenium Commands#

We used the click() command above. I think that is really cool to see the driver interacting with the web page! Let’s show you a few others that are nice.

send_keys will allow you to send information to the element you have selected! You can use this to send text or even hit certain keys on your keyboard. The one we’ll need is Keys.ENTER this hits the carriage return (yes I am that old that I still call it that!) Let me search for myself.

from selenium.webdriver.common.keys import Keys #this did take another part of selenium not yet loaded!

elem.send_keys('Nicholas Jacob')
elem.send_keys(Keys.ENTER)

So where did this get me?

driver.current_url
'https://duckduckgo.com/?q=Nicholas+Jacob&t=h_&ia=web'

What is here?

soup = BeautifulSoup(driver.page_source)

for i in soup.find_all('a', class_="result__check"):
  print(i['href'])
https://www.kinopoisk.ru/name/2801548/
https://www.imdb.com/name/nm4966502/
https://www.facebook.com/nicholas.jacob1
https://ru.kinorium.com/name/2727104/
https://en.wikipedia.org/wiki/Stoneman_Douglas_High_School_shooting
https://www.film.ru/person/nicholas-jacob
https://www.tumblr.com/tagged/nicholas-jacob
https://www.youtube.com/watch?v=AwJj9D9neUo
https://www.wikitree.com/wiki/Jacob-2
https://www.listal.com/nicholas-jacob

You can also navigate the page with

  1. refresh(), back, forward

  2. scroll()

If you need to!

Waiting Is Hard#

Lot’s of times I am finding that my python output is behind where I am browsing. If that is the case, use a wait function to slow down the browsing. The integer here is seconds.

driver.implicitly_wait(10)

Or you could do it by a condition!

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


WebDriverWait(driver,1).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))

I found the best condition (esspecially when I am navigating the web, was the staleness_of command.

from selenium.webdriver.support.expected_conditions import staleness_of

old_page = driver.find_element(By.XPATH,'//html')
driver.find_element(By.PARTIAL_LINK_TEXT,'some_text').click()

WebDriverWait(driver, 10).until(staleness_of(old_page))

Your Turn#

Use python to browse to google. Search for something of interest and go to the first link. Return the html from that first page.