Web Scraping Using Python For Beginners

Posted on  by 



This document assumes you have already installed Python 3, and you have used both pip and venv. If not, refer to these instructions.

May 25, 2020 Web Scraping in Python Datacamp Web scrapping courses Udemy Using Python to Access Web Data Coursera Conclusion. So, in this python web scraping tutorial we learned how to create a web scraper. I hope you got a basic idea about web scraping and understand this simple example. From here, you can try to scrap any other website of your choice. This is when, Web Scraping or Web Crawling comes into picture. Web Scraping is an easy way to perform the repetitive task of copy and pasting data from the websites. With web scraping, we can crawl/surf through the websites and save and represent the necessary data in a customized format. Let us now understand the working of Web Scraping in the. Mar 27, 2017 Scraping a web page using R; Analyzing scraped data from the web 1. What is Web Scraping? Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. Almost all the main languages provide ways for performing web scraping. Absolute beginners guide to slaying APIs using Python. Specifically a web API in python we can make use of the standard requests module to make the request, because most web service APIs.

Sweigart briefly covers scraping in chapter 12 of Automate the Boring Stuff with Python (second edition).

This chapter here and the two following chapters provide additional context and examples for beginners.

BeautifulSoup documentation:

Setup for BeautifulSoup¶

BeautifulSoup is a scraping library for Python. We want to run all our scraping projects in a virtual environment, so we will set that up first. (Students have already installed Python 3.)

Create a directory and change into it¶

The first step is to create a new folder (directory) for all your scraping projects. Mine is:

Do not use any spaces in your folder names. If you must use punctuation, do not use anything other than an underscore _. It’s best if you use only lowercase letters.

Change into that directory. For me, the command would be:

Create a new virtualenv in that directory and activate it¶

Create a new virtual environment there (this is done only once).

MacOS:

Windows:

Web Scraping With Selenium Python

Activate the virtual environment:

MacOS:

Windows:

Python For Beginners Pdf

Important: You should now see (env) at the far left side of your prompt. This indicates that the virtual environment is active. For example (MacOS):

When you are finished working in a virtual environment, you should deactivate it. The command is the same in MacOS or Windows (DO NOT DO THIS NOW):

You’ll know it worked because (env) will no longer be at the far left side of your prompt.

Install the BeautifulSoup library¶

Web

In MacOS or Windows, at the command prompt, type:

This is how you install any Python library that exists in the Python Package Index. Pretty handy. pip is a tool for installing Python packages, which is what you just did.

Note

You have installed BeautifulSoup in the Python virtual environment that is currently active. When that virtual environment is not active, BeautifulSoup will not be available to you. This is ideal, because you will create different virtual environments for different Python projects, and you won’t need to worry about updated libraries in the future breaking your (past) code.

Test BeautifulSoup¶

Start Python. Because you are already in a Python 3 virtual environment, Mac users need only type python (NOT python3). Windows users also type python as usual.

You should now be at the >>> prompt — the Python interactive shell prompt.

In MacOS or Windows, type (or copy/paste) one line at a time:

Web Scraping Using Python For Beginners Pdf

  1. You imported two Python modules, urlopen and BeautifulSoup (the first two lines).

  2. You used urlopen to copy the entire contents of the URL given into a new Python variable, page (line 3).

  3. You used the BeautifulSoupfunction to process the value of that variable (the plain-text contents of the file at that URL) through a built-in HTML parser called html.parser.

  4. The result: All the HTML from the file is now in a BeautifulSoup object with the new Python variable name soup. (It is just a variable name.)

  5. Last line: Using the syntax of the BeautifulSoup library, you printed the first h1 element (including its tags) from that parsed value.

If it works, you’ll see:

Check out the page on the web to see what you scraped.

Attention

If you got an error about SSL, quit Python (quit() or Command-D) and COPY/PASTE this at the command prompt (MacOS only):

Then return to the Python prompt and retry the five lines above.

The command soup.h1 would work the same way for any HTML tag (if it exists in the file). Instead of printing it, you might stash it in a variable:

For

Then, to see the text in the element without the tags:

Understanding BeautifulSoup¶

BeautifulSoup is a Python library that enables us to extract information from web pages and even entire websites.

Web scraping using python for beginners for beginners

We use BeautifulSoup commands to create a well-structured data object (more about objects below) from which we can extract, for example, everything with an <li> tag, or everything with class='book-title'.

After extracting the desired information, we can use other Python commands (and libraries) to write the data into a database, CSV file, or other usable format — and then we can search it, sort it, etc.

What is the BeautifulSoup object?¶

It’s important to understand that many of the BeautifulSoup commands work on an object, which is not the same as a simple string.

Many programming languages include objects as a data type. Python does, JavaScript does, etc. An object is an even more powerful and complex data type than an array (JavaScript) or a list (Python) and can contain many other data types in a structured format.

When you extract information from an object with a BeautifulSoup command, sometimes you get a single Tag object, and sometimes you get a Python list (similar to an array in JavaScript) of Tag objects. The way you treat that extracted information will be different depending on whether it is one item or a list (usually, but not always, containing more than one item).

That last paragraph is REALLY IMPORTANT, so read it again. For example, you cannot call .text on a list. You’ll see an error if you try it.

How BeautifulSoup handles the object¶

In the previous code, when this line ran:

… you copied the entire contents of a file into a new Python variable named page. The contents were stored as an HTTPResponse object. We can read the contents of that object like this:

… but that’s not going to be very usable, or useful — especially for a file with a lot more content in it.

When you transform that HTTPResponse object into a BeautifulSoup object — with the following line — you create a well-structured object from which you can extract any HTML element and the text and/or attributes within any HTML element.

Some basic BeautifulSoup commands¶

Macos sierra year. Let’s look at a few examples of what BeautifulSoup can do.

Finding elements that have a particular class¶

Deciding the best way to extract what you want from a large HTML file requires you to dig around in the source, using Developer Tools, before you write the Python/BeautifulSoup commands. In many cases, you’ll see that everything you want has the same CSS class on it. After creating a BeautifulSoup object (here, as before, it is soup), this line will create a Python list containing all the <td> elements that have the class city.

Attention

The word class is a reserved word in Python. Using class (alone) in the code above would give you a syntax error. So when we search by CSS class with BeautifulSoup, we use the keyword argument class_ — note the added underscore. Other HTML attributes DO NOT need the underscore.

Maybe there were 10 cities in <td> tags in that HTML file. Maybe there were 10,000. No matter how many, they are now in a list (assigned to the variable city_list), and you can search them, print them, write them out to a database or a JSON file — whatever you like. Often you will want to perform the same actions on each item in the list, so you will use a normal Python for-loop:

.get_text() is a handy BeautifulSoup method that will extract the text — and only the text — from the Tag object. If instead you wrote just print(city), you’d get the complete <td> — and any other tags inside that as well.

Note

Using

The BeautifulSoup methods .get_text() and .getText() are the same. The BeautifulSoup property .text is a shortcut to .get_text() and is acceptable unless you need to pass arguments to .get_text().

Finding all vs. finding one¶

The BeautifulSoup find_all() method you just saw always produces a list. (Note: findAll() will also work.) If you know there will be only one item of the kind you want in a file, you should use the find() method instead.

For example, maybe you are scraping the address and phone number from every page in a large website. In this case, there is only one phone number on the page, and it is enclosed in a pair of tags with the attribute id='call'. One line of your code gets the phone number from the current page:

You don’t need to loop through that result — the variable phone_number will contain only one Tag object, for whichever HTML tag had that ID. To test what the text alone will look like, just print it using get_text() to strip out the tags.

Notice that you’re often using soup. Review above if you’ve forgotten where that came from. (You may use another variable name instead, but soup is the usual choice.)

Finding the contents of a particular attribute¶

One last example from the example page we have been using.

Say you’ve made a BeautifulSoup object from a page that has dozens of images on it. You want to capture the path to each image file on that page (perhaps so that you can download all the images). I would do this in two steps:

  1. First, you make a Python list containing all the img elements that exist in the soup object.

  2. Second, you loop through that list and print the contents of the src attribute from each img tag in the list.

It is possible to condense that code and do the task in two lines, or even one line, but for beginners it is clearer to get the list of elements and name it, then use the named list and get what is wanted from it.

Important

We do not need get_text() in this case, because the contents of the src attribute (or any HTML attribute) are nothing but text. There are never tags inside the src attribute. So think about exactly what you’re trying to get, and what is it like inside the HTML of the page.

You can see the code from above all in one file.

There’s a lot more to learn about BeautifulSoup, and we’ll be working with various examples. You can always read the docs. Most of what we do with BeautifulSoup, though, involves these tasks:

  • Find everything with a particular class

  • Find everything with a particular attribute

  • Find everything with a particular HTML tag

  • Find one thing on a page, often using its id attribute

  • Find one thing that’s inside another thing

A BeautifulSoup scraping example¶

To demonstrate the process of thinking through a small scraping project, I made a Jupyter Notebook that shows how I broke down the problem step by step, and tested one thing at a time, to reach the solution I wanted. Open the notebook here on GitHub to follow along and see all the steps. (If that link doesn’t work, try this instead.)

The code in the final cell of the notebook produces this 51-line CSV file by scraping 10 separate web pages.

To run the notebook, you will need to have installed the Requests module and also Jupyter Notebook.

See these instructions for information about how to run Jupyter Notebooks.

Attention

After this introduction, you should NOT use fromurllib.requestimporturlopen or the urlopen() function. Instead, you will use requests as demonstrated in the notebook linked above.

Next steps¶

In the next chapter, we’ll look at how to handle common web scraping projects with BeautifulSoup and Requests.

.

In this digital world, data is now everything. Data can help you gain useful insights and notice things that are too valuable to let go of. To effectively utilize the data, we need a good way to first collect the massive amount of data. Web Scraping helps us achieve this. There are many ways to scrape a website like APIs, online services, and writing your code.

In this web scraping tutorial, I will show you how to scrape any kind of website with python. This tutorial is a little different as we will explore a library called SelectorLib which makes it super easy for us to scrape any website and the web scraping tutorial is aimed at beginners so even if you know only the basics of python you are good to go. So if you want to learn how to scrape data from a website using Python 3 then this tutorial is for you. In this tutorial, you will learn how to :

  • Create a web scraping template using Chrome extension
  • Use Requests and SelectorLib for scraping the data from the created template
  • Build a python script that scrapes Top Rated Movies from IMDB and stores them in a JSON file

What is Web Scraping?

Web Scraping is the process of extracting information from websites by using the power of automation. The data from the websites are usually unstructured so we use web scraping to scrape the data and store it in a database that exists in a structured form. For example, let's say we want to collect the data of top-rated movies for research purposes. Now if we were to do it manually it would take many hours if not days. We will solve this problem by creating a web scraper that will automatically scrape all the data of top-rated movies from the website and store it in a database. This would take only a matter of seconds as the computer will do all the heavy lifting for us.

Why is Python best for Web Scraping?

You can create a web scraper with any programming language like Javascript, Java, C++ etc. But here I will list the reasons why Python is preferred for web scraping.

  • Easy to Code - Python is one of the most beginner-friendly languages to learn because it is so easy to code in python unlike languages like Java and C++.
  • Huge collection of libraries - Python has one of the largest collections of libraries and frameworks. Are you interested in web development? You have got frameworks like Django and Flask for that. Are you into game development? You have got Pygame for that.
  • Good community support - Are you constantly getting an error that you just are just able not to fix? Worry not, as many people would be willing to help you fix the error as there are many python communities on different platforms.

How does web scraping work exactly?

Have you ever seen the source code of any webpage? When you right-click on any web page, you would see the option of 'View Page Source'. Now when you click on it, you would be able to see the whole HTML of the current webpage. That's exactly what our code does.

When we run our code, it will make an HTTP request to the specified URL. Then it will store the whole source code web page in a variable. Now we query for any data that we want. In our case, we would have already created a template that will only fetch the data that we require from the source code. So it's much easier than the conventional methods of querying data.

Web Scraping Example: Scraping Top Rated Movies from IMDB

Python

In this example, we will scrape top-rated movies from the IMDB top rated page. We aim to save the details of all movies along with their title, rating, and year. Here are some things that you should have installed on your system before diving into the tutorial.

  • Python 3.x with Selectorlib, Requests libraries installed
  • Google Chrome Browser
  • SelectorLib extension installed on Google Chrome Browser

Once you have downloaded and installed everything, let us get right into the tutorial.

Step 1: Creating the web scraping template using Chrome extension

Beginners

First, we will go to this link and right-click there. You will see an option called inspect and you have to click it.

After you click on inspect, you have to click on the small arrow on the top right corner. You would be able to see the option to click on SelectorLib. You have to click on it and then create a new template. You can name it anything you like. I will name it 'Top Movies'.

Step 2: Extracting title, rating, and year of the movies

Now we will begin by adding a selector to our template which will contain the title, rating, and year of the movie. Web scraping heavily relies on using the right CSS selector for our type of data. If you want to learn more about CSS selectors you can go to this link. When we inspect to see the source code of the page we can see that all our data is in a table. So all the individual data must be in <tr> tag. The class of parent element of <tr> is lister-list . So first click on Add and name it 'movies'. Then make the selector as .lister-list tr. Finally, make sure that the multiple option is checked and click on save. The final result would look like this.

Our main selector is now created. All that remains is to create children of this selector which will be in the <tr> tag. So click on 'Add a child to this selector' which is right next to the selector that we just created.

On further inspection, we can observe that the title is in a <td> tag with class as titleColumn. So name the selector as td.titleColumn a. The a here signifies the link tag as our title is in it. Then click on save. We now need to inspect the rating. Like the title, it is also in a <td> tag with the name of the class as imdbRating. So again create a child of the selector and name it 'rating' and name the selector as td.imdbRating. Again click on save. Finally, all that remains is to create a selector for the year. Further inspecting on it reveals that it is in a <span> tag with the name of the class as secondaryInfo. So name the child selector as 'year' and set the selector as span.secondaryInfo.

Now the last part of creating the template is to export the YAML file from the extension. In the top right corner of the extension, there is a button for exporting the file. You have to click on it and download the YAML file.

Step 3: Coding our Web Scraper using Python

Now comes the easy part which is coding our web scraper. Create a scraper.py file and place the file that you downloaded earlier in the same directory as the python file that you just created now.

We will now import all our required modules.

The Extractor module will be used to load our YAML file and convert our unstructured data to structured data. The Requests module will be used to make a GET request to the specified URL. Finally, the json module will be used to save all our data in a .json file.

The extractor variable is used for importing the template file in our code. Then we make a get request to the IMDB link which contains the list of Top Rated Movies. Finally, we use the function extract to get the data that we need and print it to the python console. You can use the following command in your terminal to execute the python script - python scraper.py. We get the following output when we run our python script.

Now all that remains is to save all our data in a JSON file. So we will use the following code to save our data.

Complete Source Code

Here is the complete code in python which we used for our web scraping tutorial.

Conclusion

SelectorLib combined with the Chrome extension is a very handy python library for quickly scraping websites. It is much easier to scrape data using this library compared to other python libraries where you have to use regular expressions or complicated syntax. So go ahead and scrape as many websites as you want with your newly learned skill! I hope you found this web scraping with python tutorial informative and learned something new.

Thank you for reading !!

Tags:

#python
#beginners




Coments are closed