Reading Data From the Web

Open In Colab

Reading Data From the Web#

Some of the data is easy to gather directly from the web! The UCI Machine Learning repository has lost of data cleaned up and ready for us to use. We can also upload csv and other files to GitHub and access the ‘raw’ version to get to the data. We did that with the iris dataset earlier! How than can we get data from a table in a web page? We can of course copy and paste , but if there are multiple tables or the table is of an odd shape, this sometimes just won’t work! Instead we want to read that data directly from the web.

Reading data from the web is an important task for some data analysis projects. Web Scrapping is the gathering of that data. There are lots of fantastic packages to read and parse html. requests is going to gather the raw html for me. BeautifulSoup will help me parse the code.

import requests
import pandas as pa
from bs4 import BeautifulSoup

Next I am going to look at a simple web page from Wikipedia. I have been a big fan of The Simpsons for many years. Let’s look at the Wikipedia page for them. https://en.wikipedia.org/wiki/The_Simpsons

Let’s gather that html!

r = requests.get('https://en.wikipedia.org/wiki/The_Simpsons')
html_contents = r.text
html_soup = BeautifulSoup(html_contents,"lxml")
#html_soup

Basic Building Blocks#

I do not print the html because it is very long! Let’s examine some aspects of our html that we have gathered

html_soup.title
<title>The Simpsons - Wikipedia</title>

I think the title is rather obvious. It is what shows in my tab!

html_soup.a
<a id="top"></a>

The a is an anchor. Normally that is a hyperlink but this one does not appear to be one!

html_soup.p
<p class="mw-empty-elt">
</p>

p stands for paragraph. This one happens to be empty.

html_soup.img
<img alt="Featured article" data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>

img is an image. Next are several classes of headers, six in total.

html_soup.h2
<h2 id="mw-toc-heading">Contents</h2>

The thing we will use the most for this class is table

html_soup.table
<table class="infobox vevent"><tbody><tr><th class="infobox-above summary" colspan="2" style="background: #CCCCFF; padding: 0.25em 1em; font-size: 125%;"><i>The Simpsons</i></th></tr><tr><td class="infobox-image" colspan="2"><a class="image" href="/wiki/File:The_Simpsons_yellow_logo.svg"><img alt="The Simpsons yellow logo.svg" data-file-height="206" data-file-width="464" decoding="async" height="111" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/250px-The_Simpsons_yellow_logo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/375px-The_Simpsons_yellow_logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/500px-The_Simpsons_yellow_logo.svg.png 2x" width="250"/></a></td></tr><tr><th class="infobox-label" scope="row">Genre</th><td class="infobox-data category"><div class="plainlist">
<ul><li><a href="/wiki/Animated_sitcom" title="Animated sitcom">Animated sitcom</a></li>
<li><a href="/wiki/Satire" title="Satire">Satire</a></li></ul>
</div></td></tr><tr><th class="infobox-label" scope="row">Created by</th><td class="infobox-data"><a href="/wiki/Matt_Groening" title="Matt Groening">Matt Groening</a></td></tr><tr><th class="infobox-label" scope="row">Based on</th><td class="infobox-data"><a href="/wiki/The_Simpsons_shorts" title="The Simpsons shorts"><i>The Simpsons</i> shorts</a><br/>by Matt Groening</td></tr><tr><th class="infobox-label" scope="row">Developed by</th><td class="infobox-data"><div class="plainlist">
<ul><li><a href="/wiki/James_L._Brooks" title="James L. Brooks">James L. Brooks</a></li>
<li>Matt Groening</li>
<li><a href="/wiki/Sam_Simon" title="Sam Simon">Sam Simon</a></li></ul>
</div></td></tr><tr><th class="infobox-label" scope="row">Voices of</th><td class="infobox-data attendee"><div class="plainlist">
<ul><li><a href="/wiki/Dan_Castellaneta" title="Dan Castellaneta">Dan Castellaneta</a></li>
<li><a href="/wiki/Julie_Kavner" title="Julie Kavner">Julie Kavner</a></li>
<li><a href="/wiki/Nancy_Cartwright" title="Nancy Cartwright">Nancy Cartwright</a></li>
<li><a href="/wiki/Yeardley_Smith" title="Yeardley Smith">Yeardley Smith</a></li>
<li><a href="/wiki/Hank_Azaria" title="Hank Azaria">Hank Azaria</a></li>
<li><a href="/wiki/Harry_Shearer" title="Harry Shearer">Harry Shearer</a></li>
<li>(<a href="/wiki/List_of_The_Simpsons_cast_members" title="List of The Simpsons cast members">Complete list</a>)</li></ul>
</div></td></tr><tr><th class="infobox-label" scope="row">Theme music composer</th><td class="infobox-data"><a href="/wiki/Danny_Elfman" title="Danny Elfman">Danny Elfman</a></td></tr><tr><th class="infobox-label" scope="row">Opening theme</th><td class="infobox-data">"<a href="/wiki/The_Simpsons_Theme" title="The Simpsons Theme"><i>The Simpsons</i> Theme</a>"</td></tr><tr><th class="infobox-label" scope="row">Composers</th><td class="infobox-data"><a href="/wiki/Richard_Gibbs" title="Richard Gibbs">Richard Gibbs</a> (1989–1990)<br/><a href="/wiki/Alf_Clausen" title="Alf Clausen">Alf Clausen</a> (1990–2017)<br/><a href="/wiki/Bleeding_Fingers_Music" title="Bleeding Fingers Music">Bleeding Fingers Music</a> (2017–present)</td></tr><tr><th class="infobox-label" scope="row">Country of origin</th><td class="infobox-data">United States</td></tr><tr><th class="infobox-label" scope="row">Original language</th><td class="infobox-data">English</td></tr><tr><th class="infobox-label" scope="row"><abbr title="Number">No.</abbr> of seasons</th><td class="infobox-data">33</td></tr><tr><th class="infobox-label" scope="row"><abbr title="Number">No.</abbr> of episodes</th><td class="infobox-data">717 <span class="nowrap">(<a href="/wiki/List_of_The_Simpsons_episodes" title="List of The Simpsons episodes">list of episodes</a>)</span></td></tr><tr><th class="infobox-header summary" colspan="2" style="background: #CCCCFF; padding: 0.25em 1em;">Production</th></tr><tr><th class="infobox-label" scope="row">Executive producers</th><td class="infobox-data"><div class="mw-collapsible mw-collapsed" style="text-align: center; font-size: 95%;">
<div style="line-height: 1.6em; font-weight: bold; font-size: 100%; text-align: left;"><div>List</div></div>
<ul class="mw-collapsible-content" style="font-size: 105%; margin-top: 0; margin-bottom: 0; line-height: inherit; text-align: left;"><li style="line-height: inherit; margin: 0">  James L. Brooks (entire run)
 </li><li style="line-height: inherit; margin: 0">  Matt Groening (entire run)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/Al_Jean" title="Al Jean">Al Jean</a> (1992–1993; 1995–present)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/Matt_Selman" title="Matt Selman">Matt Selman</a> (2005–present)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/John_Frink" title="John Frink">John Frink</a> (2009–present)
 </li><li style="line-height: inherit; margin: 0">  Sam Simon (1989–1993)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/Mike_Reiss" title="Mike Reiss">Mike Reiss</a> (1992–1993; 1995–1998)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/David_Mirkin" title="David Mirkin">David Mirkin</a> (1993–1995)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/Bill_Oakley" title="Bill Oakley">Bill Oakley</a> (1995–1997)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/Josh_Weinstein" title="Josh Weinstein">Josh Weinstein</a> (1995–1997)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/Mike_Scully" title="Mike Scully">Mike Scully</a> (1997–2001)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/George_Meyer" title="George Meyer">George Meyer</a> (1999–2001)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/Carolyn_Omine" title="Carolyn Omine">Carolyn Omine</a> (2005–2006)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/Tim_Long" title="Tim Long">Tim Long</a> (2005–2008)
 </li><li style="line-height: inherit; margin: 0"> <a href="/wiki/Ian_Maxtone-Graham" title="Ian Maxtone-Graham">Ian Maxtone-Graham</a> (2005–2012)
</li></ul>
</div></td></tr><tr><th class="infobox-label" scope="row">Running time</th><td class="infobox-data">21–24 minutes</td></tr><tr><th class="infobox-label" scope="row">Production companies</th><td class="infobox-data"><div class="plainlist">
<ul><li><a href="/wiki/Gracie_Films" title="Gracie Films">Gracie Films</a></li>
<li><a href="/wiki/20th_Television" title="20th Television">20th Television</a><sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[a]</a></sup> (seasons 1–32)</li>
<li><a href="/wiki/20th_Television_Animation" title="20th Television Animation">20th Television Animation</a> (season 33–present)</li></ul>
</div></td></tr><tr><th class="infobox-label" scope="row">Distributor</th><td class="infobox-data">20th Television</td></tr><tr><th class="infobox-header summary" colspan="2" style="background: #CCCCFF; padding: 0.25em 1em;">Release</th></tr><tr><th class="infobox-label" scope="row">Original network</th><td class="infobox-data"><a href="/wiki/Fox_Broadcasting_Company" title="Fox Broadcasting Company">Fox</a></td></tr><tr><th class="infobox-label" scope="row">Picture format</th><td class="infobox-data"><a href="/wiki/NTSC" title="NTSC">NTSC</a> (1989–2009)<br/><a class="mw-redirect" href="/wiki/HDTV" title="HDTV">HDTV</a> <a href="/wiki/720p" title="720p">720p</a> (2009–present)</td></tr><tr><th class="infobox-label" scope="row">Audio format</th><td class="infobox-data">Stereo (1989–1991)<br/><a class="mw-redirect" href="/wiki/Dolby_Surround" title="Dolby Surround">Dolby Surround</a> (1991–2009)<br/><a href="/wiki/Dolby_Digital" title="Dolby Digital">Dolby Digital</a> (2009–present)</td></tr><tr><th class="infobox-label" scope="row">Original release</th><td class="infobox-data">December 17, 1989<span style="display:none"> (<span class="bday dtstart published updated">1989-12-17</span>)</span> –<br/>present</td></tr><tr><th class="infobox-header summary" colspan="2" style="background: #CCCCFF; padding: 0.25em 1em;">Chronology</th></tr><tr><th class="infobox-label" scope="row">Preceded by</th><td class="infobox-data"><a href="/wiki/The_Simpsons_shorts" title="The Simpsons shorts"><i>The Simpsons</i> shorts</a> from <i><a href="/wiki/The_Tracey_Ullman_Show" title="The Tracey Ullman Show">The Tracey Ullman Show</a></i></td></tr><tr><th class="infobox-header summary" colspan="2" style="background: #CCCCFF; padding: 0.25em 1em;">External links</th></tr><tr><td class="infobox-full-data url" colspan="2"><a class="external text" href="https://www.fox.com/the-simpsons/" rel="nofollow">Official website</a></td></tr></tbody></table>

You can combine these commands!

html_soup.table.a['href']
'/wiki/File:The_Simpsons_yellow_logo.svg'

This looks like an image on top of the table. href is the link to the file that give that image. You should go check out the webpage and see if you can find it!

Notice how I only keep getting the first of something? There are many more links and table on the webpage! Use the find_all

html_soup.table.find_all('a')
[<a class="image" href="/wiki/File:The_Simpsons_yellow_logo.svg"><img alt="The Simpsons yellow logo.svg" data-file-height="206" data-file-width="464" decoding="async" height="111" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/250px-The_Simpsons_yellow_logo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/375px-The_Simpsons_yellow_logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/500px-The_Simpsons_yellow_logo.svg.png 2x" width="250"/></a>,
 <a href="/wiki/Animated_sitcom" title="Animated sitcom">Animated sitcom</a>,
 <a href="/wiki/Satire" title="Satire">Satire</a>,
 <a href="/wiki/Matt_Groening" title="Matt Groening">Matt Groening</a>,
 <a href="/wiki/The_Simpsons_shorts" title="The Simpsons shorts"><i>The Simpsons</i> shorts</a>,
 <a href="/wiki/James_L._Brooks" title="James L. Brooks">James L. Brooks</a>,
 <a href="/wiki/Sam_Simon" title="Sam Simon">Sam Simon</a>,
 <a href="/wiki/Dan_Castellaneta" title="Dan Castellaneta">Dan Castellaneta</a>,
 <a href="/wiki/Julie_Kavner" title="Julie Kavner">Julie Kavner</a>,
 <a href="/wiki/Nancy_Cartwright" title="Nancy Cartwright">Nancy Cartwright</a>,
 <a href="/wiki/Yeardley_Smith" title="Yeardley Smith">Yeardley Smith</a>,
 <a href="/wiki/Hank_Azaria" title="Hank Azaria">Hank Azaria</a>,
 <a href="/wiki/Harry_Shearer" title="Harry Shearer">Harry Shearer</a>,
 <a href="/wiki/List_of_The_Simpsons_cast_members" title="List of The Simpsons cast members">Complete list</a>,
 <a href="/wiki/Danny_Elfman" title="Danny Elfman">Danny Elfman</a>,
 <a href="/wiki/The_Simpsons_Theme" title="The Simpsons Theme"><i>The Simpsons</i> Theme</a>,
 <a href="/wiki/Richard_Gibbs" title="Richard Gibbs">Richard Gibbs</a>,
 <a href="/wiki/Alf_Clausen" title="Alf Clausen">Alf Clausen</a>,
 <a href="/wiki/Bleeding_Fingers_Music" title="Bleeding Fingers Music">Bleeding Fingers Music</a>,
 <a href="/wiki/List_of_The_Simpsons_episodes" title="List of The Simpsons episodes">list of episodes</a>,
 <a href="/wiki/Al_Jean" title="Al Jean">Al Jean</a>,
 <a href="/wiki/Matt_Selman" title="Matt Selman">Matt Selman</a>,
 <a href="/wiki/John_Frink" title="John Frink">John Frink</a>,
 <a href="/wiki/Mike_Reiss" title="Mike Reiss">Mike Reiss</a>,
 <a href="/wiki/David_Mirkin" title="David Mirkin">David Mirkin</a>,
 <a href="/wiki/Bill_Oakley" title="Bill Oakley">Bill Oakley</a>,
 <a href="/wiki/Josh_Weinstein" title="Josh Weinstein">Josh Weinstein</a>,
 <a href="/wiki/Mike_Scully" title="Mike Scully">Mike Scully</a>,
 <a href="/wiki/George_Meyer" title="George Meyer">George Meyer</a>,
 <a href="/wiki/Carolyn_Omine" title="Carolyn Omine">Carolyn Omine</a>,
 <a href="/wiki/Tim_Long" title="Tim Long">Tim Long</a>,
 <a href="/wiki/Ian_Maxtone-Graham" title="Ian Maxtone-Graham">Ian Maxtone-Graham</a>,
 <a href="/wiki/Gracie_Films" title="Gracie Films">Gracie Films</a>,
 <a href="/wiki/20th_Television" title="20th Television">20th Television</a>,
 <a href="#cite_note-1">[a]</a>,
 <a href="/wiki/20th_Television_Animation" title="20th Television Animation">20th Television Animation</a>,
 <a href="/wiki/Fox_Broadcasting_Company" title="Fox Broadcasting Company">Fox</a>,
 <a href="/wiki/NTSC" title="NTSC">NTSC</a>,
 <a class="mw-redirect" href="/wiki/HDTV" title="HDTV">HDTV</a>,
 <a href="/wiki/720p" title="720p">720p</a>,
 <a class="mw-redirect" href="/wiki/Dolby_Surround" title="Dolby Surround">Dolby Surround</a>,
 <a href="/wiki/Dolby_Digital" title="Dolby Digital">Dolby Digital</a>,
 <a href="/wiki/The_Simpsons_shorts" title="The Simpsons shorts"><i>The Simpsons</i> shorts</a>,
 <a href="/wiki/The_Tracey_Ullman_Show" title="The Tracey Ullman Show">The Tracey Ullman Show</a>,
 <a class="external text" href="https://www.fox.com/the-simpsons/" rel="nofollow">Official website</a>]

This gives all the links from this table that includes the talent for the show. You can access each link by using

html_soup.table.find_all('a')[1]['href']
'/wiki/Animated_sitcom'

If you wanted to do some crawling along the web you might do something like:

links = html_soup.table.find_all('a')
listOfURLS = []

for link in links:
  listOfURLS.append('https://en.wikipedia.org' + link['href'])

listOfURLS
['https://en.wikipedia.org/wiki/File:The_Simpsons_yellow_logo.svg',
 'https://en.wikipedia.org/wiki/Animated_sitcom',
 'https://en.wikipedia.org/wiki/Satire',
 'https://en.wikipedia.org/wiki/Matt_Groening',
 'https://en.wikipedia.org/wiki/The_Simpsons_shorts',
 'https://en.wikipedia.org/wiki/James_L._Brooks',
 'https://en.wikipedia.org/wiki/Sam_Simon',
 'https://en.wikipedia.org/wiki/Dan_Castellaneta',
 'https://en.wikipedia.org/wiki/Julie_Kavner',
 'https://en.wikipedia.org/wiki/Nancy_Cartwright',
 'https://en.wikipedia.org/wiki/Yeardley_Smith',
 'https://en.wikipedia.org/wiki/Hank_Azaria',
 'https://en.wikipedia.org/wiki/Harry_Shearer',
 'https://en.wikipedia.org/wiki/List_of_The_Simpsons_cast_members',
 'https://en.wikipedia.org/wiki/Danny_Elfman',
 'https://en.wikipedia.org/wiki/The_Simpsons_Theme',
 'https://en.wikipedia.org/wiki/Richard_Gibbs',
 'https://en.wikipedia.org/wiki/Alf_Clausen',
 'https://en.wikipedia.org/wiki/Bleeding_Fingers_Music',
 'https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes',
 'https://en.wikipedia.org/wiki/Al_Jean',
 'https://en.wikipedia.org/wiki/Matt_Selman',
 'https://en.wikipedia.org/wiki/John_Frink',
 'https://en.wikipedia.org/wiki/Mike_Reiss',
 'https://en.wikipedia.org/wiki/David_Mirkin',
 'https://en.wikipedia.org/wiki/Bill_Oakley',
 'https://en.wikipedia.org/wiki/Josh_Weinstein',
 'https://en.wikipedia.org/wiki/Mike_Scully',
 'https://en.wikipedia.org/wiki/George_Meyer',
 'https://en.wikipedia.org/wiki/Carolyn_Omine',
 'https://en.wikipedia.org/wiki/Tim_Long',
 'https://en.wikipedia.org/wiki/Ian_Maxtone-Graham',
 'https://en.wikipedia.org/wiki/Gracie_Films',
 'https://en.wikipedia.org/wiki/20th_Television',
 'https://en.wikipedia.org#cite_note-1',
 'https://en.wikipedia.org/wiki/20th_Television_Animation',
 'https://en.wikipedia.org/wiki/Fox_Broadcasting_Company',
 'https://en.wikipedia.org/wiki/NTSC',
 'https://en.wikipedia.org/wiki/HDTV',
 'https://en.wikipedia.org/wiki/720p',
 'https://en.wikipedia.org/wiki/Dolby_Surround',
 'https://en.wikipedia.org/wiki/Dolby_Digital',
 'https://en.wikipedia.org/wiki/The_Simpsons_shorts',
 'https://en.wikipedia.org/wiki/The_Tracey_Ullman_Show',
 'https://en.wikipedia.orghttps://www.fox.com/the-simpsons/']

Doesn’t look like all of these worked but you should get the general idea! We could visit each of these sites just like we did above!

Developer Tools#

Your favorite web browser will have developer tools! These will allow you to examine the raw html code while also hightlighting the rendered output with your browser. This is very useful for webscrapping and figuring out how a website has been constructed! I acessed the developer tools with F12 key but it may vary for you!

Here is a screen shot of me highlighting the first table.

simpsonsDevTools

The developer tools are at the bottom and I have grabbed the first table that we have also scrapped. The html is well organized in the developer tools but it also might have called a server and gotten external data from somewhere. So be aware what you see here and on your requests.get may not be the same.

Your Turn#

Navigate you the wikipedia page for your favorite television show or sports club.

  1. Display the title for the page

  2. Within an interesting table, retrieve all links and store them in a list