{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Untitled73.ipynb", "provenance": [], "authorship_tag": "ABX9TyNVPMaATxgg++GisGG3KZbs", "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "# Reading Data From the Web" ], "metadata": { "id": "doK0wCaB_THI" } }, { "cell_type": "markdown", "source": [ "Some of the data is easy to gather directly from the web! The [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/index.php) has lost of data cleaned up and ready for us to use. We can also upload csv and other files to GitHub and access the 'raw' version to get to the data. We did that with the [iris](https://raw.githubusercontent.com/nurfnick/Data_Viz/main/iris.csv) dataset earlier! How than can we get data from a table in a web page? We can of course [copy and paste ](https://en.wikipedia.org/wiki/Copypasta), but if there are multiple tables or the table is of an odd shape, this sometimes just won't work! Instead we want to read that data directly from the web." ], "metadata": { "id": "ao_B4IAdovzG" } }, { "cell_type": "markdown", "source": [ "Reading data from the web is an important task for some data analysis projects. Web Scrapping is the gathering of that data. There are lots of fantastic packages to read and parse html. `requests` is going to gather the raw html for me. `BeautifulSoup` will help me parse the code." ], "metadata": { "id": "u-lO-CaE_Zo0" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SgZe3pB0_Sd3" }, "outputs": [], "source": [ "import requests\n", "import pandas as pa\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "source": [ "Next I am going to look at a simple web page from Wikipedia. I have been a big fan of **The Simpsons** for many years. Let's look at the Wikipedia page for them. [https://en.wikipedia.org/wiki/The_Simpsons](https://en.wikipedia.org/wiki/The_Simpsons)\n", "\n", "Let's gather that html!" ], "metadata": { "id": "vTiqcQ2jBCjq" } }, { "cell_type": "code", "source": [ "r = requests.get('https://en.wikipedia.org/wiki/The_Simpsons')\n", "html_contents = r.text\n", "html_soup = BeautifulSoup(html_contents,\"lxml\")\n", "#html_soup" ], "metadata": { "id": "MPoK5-3XCKco" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Basic Building Blocks" ], "metadata": { "id": "5aUDKftFHS2z" } }, { "cell_type": "markdown", "source": [ "I do not print the html because it is very long! Let's examine some aspects of our html that we have gathered" ], "metadata": { "id": "RSWpqzbPDYhR" } }, { "cell_type": "code", "source": [ "html_soup.title" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "13_wcrXlCRT9", "outputId": "ba58dece-f0bc-405c-ff37-208fe8848e30" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "The Simpsons - Wikipedia" ] }, "metadata": {}, "execution_count": 4 } ] }, { "cell_type": "markdown", "source": [ "I think the `title` is rather obvious. It is what shows in my tab!" ], "metadata": { "id": "G4zdeafhDtFg" } }, { "cell_type": "code", "source": [ "html_soup.a" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KMyx7iKLDkwt", "outputId": "351d61d5-4e56-4efb-acbd-5ea7e74571aa" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 5 } ] }, { "cell_type": "markdown", "source": [ "The `a` is an anchor. Normally that is a hyperlink but this one does not appear to be one!" ], "metadata": { "id": "v_hgXKDWE6vd" } }, { "cell_type": "code", "source": [ "html_soup.p" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Pm5mS68TDoee", "outputId": "ebcb128d-9c59-4dae-e27f-f058454d9436" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "

\n", "

" ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "markdown", "source": [ "`p` stands for paragraph. This one happens to be empty. " ], "metadata": { "id": "cwF4ttq9FHhz" } }, { "cell_type": "code", "source": [ "html_soup.img" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "eRJ_Ab9HFGcU", "outputId": "ff391dd2-5efc-4896-d0cc-7420cfee5a9c" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "\"Featured" ] }, "metadata": {}, "execution_count": 7 } ] }, { "cell_type": "markdown", "source": [ "`img` is an image. Next are several classes of headers, six in total." ], "metadata": { "id": "dUcYVALuFpF2" } }, { "cell_type": "code", "source": [ "html_soup.h2" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MRww5VOEFoPj", "outputId": "effb6c1f-7d32-44e0-be62-3c57601c020e" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "

Contents

" ] }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "markdown", "source": [ "The thing we will use the most for this class is `table`" ], "metadata": { "id": "PeLWvFEXGV8W" } }, { "cell_type": "code", "source": [ "html_soup.table" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "q6Q1OuzsGUjP", "outputId": "a04ea5ee-ba76-4a07-b163-729247bad75a" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "
The Simpsons
\"The
Genre
\n", "\n", "
Created byMatt Groening
Based onThe Simpsons shorts
by Matt Groening
Developed by
\n", "\n", "
Voices of
Theme music composerDanny Elfman
Opening theme\"The Simpsons Theme\"
ComposersRichard Gibbs (1989–1990)
Alf Clausen (1990–2017)
Bleeding Fingers Music (2017–present)
Country of originUnited States
Original languageEnglish
No. of seasons33
No. of episodes717 (list of episodes)
Production
Executive producers
\n", "
List
\n", "\n", "
Running time21–24 minutes
Production companies
\n", "\n", "
Distributor20th Television
Release
Original networkFox
Picture formatNTSC (1989–2009)
HDTV 720p (2009–present)
Audio formatStereo (1989–1991)
Dolby Surround (1991–2009)
Dolby Digital (2009–present)
Original releaseDecember 17, 1989 (1989-12-17) –
present
Chronology
Preceded byThe Simpsons shorts from The Tracey Ullman Show
External links
Official website
" ] }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "markdown", "source": [ "You can combine these commands!" ], "metadata": { "id": "fLcS5XT6Gj2O" } }, { "cell_type": "code", "source": [ "html_soup.table.a['href']" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "8kVhhpigGdy8", "outputId": "f8794bc2-bfe7-477b-b89d-72222e1093c0" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'/wiki/File:The_Simpsons_yellow_logo.svg'" ] }, "metadata": {}, "execution_count": 16 } ] }, { "cell_type": "markdown", "source": [ "This looks like an image on top of the table. `href` is the link to the file that give that image. You should go check out the webpage and see if you can find it!\n", "\n", "Notice how I only keep getting the first of something? There are many more links and table on the webpage! Use the `find_all`" ], "metadata": { "id": "Z2kcaOC3G1KW" } }, { "cell_type": "code", "source": [ "html_soup.table.find_all('a')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tb2z0szvJx97", "outputId": "a56bccef-0712-44bd-c343-f23f58fb3e9c" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[\"The,\n", " Animated sitcom,\n", " Satire,\n", " Matt Groening,\n", " The Simpsons shorts,\n", " James L. Brooks,\n", " Sam Simon,\n", " Dan Castellaneta,\n", " Julie Kavner,\n", " Nancy Cartwright,\n", " Yeardley Smith,\n", " Hank Azaria,\n", " Harry Shearer,\n", " Complete list,\n", " Danny Elfman,\n", " The Simpsons Theme,\n", " Richard Gibbs,\n", " Alf Clausen,\n", " Bleeding Fingers Music,\n", " list of episodes,\n", " Al Jean,\n", " Matt Selman,\n", " John Frink,\n", " Mike Reiss,\n", " David Mirkin,\n", " Bill Oakley,\n", " Josh Weinstein,\n", " Mike Scully,\n", " George Meyer,\n", " Carolyn Omine,\n", " Tim Long,\n", " Ian Maxtone-Graham,\n", " Gracie Films,\n", " 20th Television,\n", " [a],\n", " 20th Television Animation,\n", " Fox,\n", " NTSC,\n", " HDTV,\n", " 720p,\n", " Dolby Surround,\n", " Dolby Digital,\n", " The Simpsons shorts,\n", " The Tracey Ullman Show,\n", " Official website]" ] }, "metadata": {}, "execution_count": 49 } ] }, { "cell_type": "markdown", "source": [ "This gives all the links from this table that includes the talent for the show. You can access each link by using " ], "metadata": { "id": "oRda-UOPKECG" } }, { "cell_type": "code", "source": [ "html_soup.table.find_all('a')[1]['href']" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "RnnzT33SKX12", "outputId": "c9d86c7c-24ac-495c-dc84-b04384e4d09d" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'/wiki/Animated_sitcom'" ] }, "metadata": {}, "execution_count": 53 } ] }, { "cell_type": "markdown", "source": [ "If you wanted to do some crawling along the web you might do something like:" ], "metadata": { "id": "Q6sPpzHlKW0n" } }, { "cell_type": "code", "source": [ "links = html_soup.table.find_all('a')\n", "listOfURLS = []\n", "\n", "for link in links:\n", " listOfURLS.append('https://en.wikipedia.org' + link['href'])\n", "\n", "listOfURLS" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gADVAMDiK4iN", "outputId": "34ec2c6d-d0e4-4b4c-a382-52f0d6f84f2e" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['https://en.wikipedia.org/wiki/File:The_Simpsons_yellow_logo.svg',\n", " 'https://en.wikipedia.org/wiki/Animated_sitcom',\n", " 'https://en.wikipedia.org/wiki/Satire',\n", " 'https://en.wikipedia.org/wiki/Matt_Groening',\n", " 'https://en.wikipedia.org/wiki/The_Simpsons_shorts',\n", " 'https://en.wikipedia.org/wiki/James_L._Brooks',\n", " 'https://en.wikipedia.org/wiki/Sam_Simon',\n", " 'https://en.wikipedia.org/wiki/Dan_Castellaneta',\n", " 'https://en.wikipedia.org/wiki/Julie_Kavner',\n", " 'https://en.wikipedia.org/wiki/Nancy_Cartwright',\n", " 'https://en.wikipedia.org/wiki/Yeardley_Smith',\n", " 'https://en.wikipedia.org/wiki/Hank_Azaria',\n", " 'https://en.wikipedia.org/wiki/Harry_Shearer',\n", " 'https://en.wikipedia.org/wiki/List_of_The_Simpsons_cast_members',\n", " 'https://en.wikipedia.org/wiki/Danny_Elfman',\n", " 'https://en.wikipedia.org/wiki/The_Simpsons_Theme',\n", " 'https://en.wikipedia.org/wiki/Richard_Gibbs',\n", " 'https://en.wikipedia.org/wiki/Alf_Clausen',\n", " 'https://en.wikipedia.org/wiki/Bleeding_Fingers_Music',\n", " 'https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes',\n", " 'https://en.wikipedia.org/wiki/Al_Jean',\n", " 'https://en.wikipedia.org/wiki/Matt_Selman',\n", " 'https://en.wikipedia.org/wiki/John_Frink',\n", " 'https://en.wikipedia.org/wiki/Mike_Reiss',\n", " 'https://en.wikipedia.org/wiki/David_Mirkin',\n", " 'https://en.wikipedia.org/wiki/Bill_Oakley',\n", " 'https://en.wikipedia.org/wiki/Josh_Weinstein',\n", " 'https://en.wikipedia.org/wiki/Mike_Scully',\n", " 'https://en.wikipedia.org/wiki/George_Meyer',\n", " 'https://en.wikipedia.org/wiki/Carolyn_Omine',\n", " 'https://en.wikipedia.org/wiki/Tim_Long',\n", " 'https://en.wikipedia.org/wiki/Ian_Maxtone-Graham',\n", " 'https://en.wikipedia.org/wiki/Gracie_Films',\n", " 'https://en.wikipedia.org/wiki/20th_Television',\n", " 'https://en.wikipedia.org#cite_note-1',\n", " 'https://en.wikipedia.org/wiki/20th_Television_Animation',\n", " 'https://en.wikipedia.org/wiki/Fox_Broadcasting_Company',\n", " 'https://en.wikipedia.org/wiki/NTSC',\n", " 'https://en.wikipedia.org/wiki/HDTV',\n", " 'https://en.wikipedia.org/wiki/720p',\n", " 'https://en.wikipedia.org/wiki/Dolby_Surround',\n", " 'https://en.wikipedia.org/wiki/Dolby_Digital',\n", " 'https://en.wikipedia.org/wiki/The_Simpsons_shorts',\n", " 'https://en.wikipedia.org/wiki/The_Tracey_Ullman_Show',\n", " 'https://en.wikipedia.orghttps://www.fox.com/the-simpsons/']" ] }, "metadata": {}, "execution_count": 54 } ] }, { "cell_type": "markdown", "source": [ "Doesn't look like all of these worked but you should get the general idea! We could visit each of these sites just like we did above!" ], "metadata": { "id": "3TkJJQQJLd3u" } }, { "cell_type": "markdown", "source": [ "## Developer Tools" ], "metadata": { "id": "FUrLnRixILVJ" } }, { "cell_type": "markdown", "source": [ "Your favorite web browser will have developer tools! These will allow you to examine the raw html code while also hightlighting the rendered output with your browser. This is very useful for webscrapping and figuring out how a website has been constructed! I acessed the developer tools with F12 key but it may vary for you!\n", "\n", "Here is a screen shot of me highlighting the first table. \n", "\n", "![simpsonsDevTools](https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Content/devtools.png)" ], "metadata": { "id": "gLCMQs1qy-E7" } }, { "cell_type": "markdown", "source": [ "The developer tools are at the bottom and I have grabbed the first table that we have also scrapped. The html is well organized in the developer tools but it also might have called a server and gotten external data from somewhere. So be aware what you see here and on your `requests.get` may not be the same." ], "metadata": { "id": "Gvh1mL5VJUKv" } }, { "cell_type": "markdown", "source": [ "## Your Turn" ], "metadata": { "id": "Fc3r6g-LL8Ar" } }, { "cell_type": "markdown", "source": [ "Navigate you the wikipedia page for your favorite television show or sports club. \n", "\n", "1. Display the title for the page\n", "2. Within an interesting table, retrieve all links and store them in a list" ], "metadata": { "id": "eOPCzF-KL91V" } } ] }