{"id":1059,"date":"2022-07-16T01:34:00","date_gmt":"2022-07-16T01:34:00","guid":{"rendered":"https:\/\/blog.ngocha.biz\/?p=1059"},"modified":"2022-07-16T01:34:00","modified_gmt":"2022-07-16T01:34:00","slug":"python-web-scrapping","status":"publish","type":"post","link":"https:\/\/blog.ngocha.biz\/?p=1059","title":{"rendered":"Python Web Scrapping Tutorial: Step by Step Guide for Beginners"},"content":{"rendered":"<p>In this <strong>Python Web Scrapping Tutorial<\/strong> you will learn about python web scrapping techniques using python libraries.<\/p>\n<p>One of the most important things in the <strong>field of Data Science<\/strong> is the skill of getting the right data for the problem you want to solve. Data Scientists don&#8217;t always have a prepared database to work on but rather have to pull data from the right sources. For this purpose, <strong>APIs<\/strong> and <strong>Web Scraping<\/strong> are used.<\/p>\n<ol>\n<li><strong><a href=\"https:\/\/en.wikipedia.org\/wiki\/Application_programming_interface?ref=devopscube.com\" rel=\"noreferrer noopener\">API <\/a>(Application Program Interface)<\/strong>: An API is a set of methods and tools that allows one&#8217;s to query and retrieve data dynamically. <strong>Reddit, Spotify, Twitter, Facebook<\/strong>, and many other companies provide free APIs that enable developers to access the information they store on their servers; others charge for access to their APIs.<\/li>\n<li><strong>Web Scraping<\/strong>: A lot of data isn&#8217;t accessible through data sets or APIs but rather exists on the internet as <strong>Web pages<\/strong>. So, through web-scraping, one can access the data without waiting for the provider to create an API.<\/li>\n<\/ol>\n<h2 id=\"what-is-web-scraping\">What is Web Scraping?<\/h2>\n<p>Web scraping is a technique to fetch data from websites. While surfing on the web, many websites don\u2019t allow the user to save data for private use.<\/p>\n<p>One way is to manually copy-paste the data, which both tedious and time-consuming.<\/p>\n<p>Web Scraping is the automatic process of data extraction from websites. This process is done with the help of web scraping software known as web scrapers.<\/p>\n<p>They automatically load and extract data from the websites based on user requirements. These can be custom built to work for one site or can be configured to work with any website.<\/p>\n<p>One classic<strong> real world use case for web scrapping<\/strong> is, price comparison apps and websites. The data provided by these websites are scrapped from multiple e-commerce websites.<\/p>\n<h2 id=\"why-python-for-web-scrapping\">Why Python for Web Scrapping?<\/h2>\n<p>There are a number of web scraping tools out there to perform the task and various languages too, having libraries that support web scraping.<\/p>\n<p>Among all these languages, <strong><a href=\"https:\/\/www.python.org\/?ref=devopscube.com\" rel=\"noreferrer noopener\">Python<\/a><\/strong> is considered as one of the best for Web Scraping because of features like \u2013 <strong>a rich library, easy to use, dynamically typed,<\/strong> etc.<\/p>\n<h2 id=\"python-web-scrapping-libraries\">Python Web Scrapping Libraries<\/h2>\n<p>Here are some most commonly used python3 web Scraping libraries.<\/p>\n<ol>\n<li>Beautiful Soup<\/li>\n<li>Selenium<\/li>\n<li>Python Requests<\/li>\n<li>Lxml<\/li>\n<li>Mechanical Soup<\/li>\n<li>Urllib2<\/li>\n<li>Scrapy<\/li>\n<\/ol>\n<p>Now discuss the steps involved in web scraping using the implementation of <strong>Web Scraping in Python with Beautiful Soup<\/strong>.<\/p>\n<h2 id=\"how-to-build-web-scraper-using-python\">How to Build Web Scraper Using Python?<\/h2>\n<p>In this section, we will look at the <strong>step by step guide<\/strong> on how to <strong>build a basic web scraper using python<\/strong> Beautiful Soup module.<\/p>\n<ol>\n<li>First of all, to get the HTML source code of the web page, send an HTTP request to the URL of that web page one wants to access. The server responds to the request by returning the HTML content of the webpage. For doing this task, one will use a third-party HTTP library called <strong>requests<\/strong> in python.<\/li>\n<li>After accessing the HTML content, the next task is <strong>parsing the data<\/strong>. Though most of the HTML data is nested, so it&#8217;s not possible to extract data simply through string processing. So there is a need for a parser that can create a nested\/tree structure of the HTML data. Ex. <strong>html5lib, lxml,<\/strong> etc.<\/li>\n<li>The last task is navigating and searching the parse tree that was created using the parser. For this task, we will be using another third-party python library called <strong>Beautiful Soup<\/strong>. It is a very popular Python library for pulling data from HTML and XML files.<\/li>\n<\/ol>\n<h3 id=\"step-1-import-required-third-party-libraries\"><strong>Step 1:<\/strong> Import required third party libraries<\/h3>\n<p>Before starting with the code, import some required third-party libraries to your Python IDE.<\/p>\n<pre><code>pip install requests\npip install lxml\npip install bs4<\/code><\/pre>\n<h3 id=\"step-2-get-the-html-content-from-the-web-page\">Step 2: Get the HTML content from the web page<\/h3>\n<p>To get the HTML source code from the web page using the request library and to do this we have to write this code. I am taking <a href=\"https:\/\/devopscube.com\/project-management-software\/\" rel=\"noreferrer noopener\">this<\/a> webpage.<\/p>\n<pre><code>source = requests.get('https:\/\/devopscube.com\/project-management-software').text<\/code><\/pre>\n<h3 id=\"step-3-parsing-the-html-content\">Step 3: Parsing the HTML content<\/h3>\n<p>Parse the HTML file into the Beautiful Soup and one also needs to specify his\/her parser. Here we are taking <strong>lxml<\/strong> parser.<\/p>\n<pre><code>soup = BeautifulSoup(source, 'lxml')<\/code><\/pre>\n<p>To print the visual representation of the parse tree created from the raw HTML content write down this code.<\/p>\n<pre><code>print(soup.prettify())<\/code><\/pre>\n<h3 id=\"step-4-navigating-and-searching-the-parse-tree\">Step 4: Navigating and searching the parse tree<\/h3>\n<p>Now, we would like to extract some useful data from the HTML content. The <strong>soup<\/strong> object contains all the data in a nested structure that could be programmatically extracted. In our example, we are scraping a web page contains a headline and its corresponding website.<\/p>\n<p>We can start parsing out the information that we want now just like before. Let&#8217;s start by grabbing the headline and its official website.<\/p>\n<p>So to grab the first headline and its official website for the first post on <a href=\"https:\/\/devopscube.com\/project-management-software\/\" rel=\"noreferrer noopener\">this<\/a> page let&#8217;s inspect this web page and see if we can figure out what the structure is.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/python-weg-scapping-min-1.jpg\" class=\"kg-image\" alt=\"python web scrapping inspect web page\" loading=\"lazy\" width=\"631\" height=\"220\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2025\/03\/python-weg-scapping-min-1.jpg 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/python-weg-scapping-min-1.jpg 631w\"><\/figure>\n<p>From the above diagram, you can see that the whole content including the headline and the official website is under the <strong>article<\/strong> tag. So let&#8217;s start off by first grabbing this entire first article that contains all of this information.<\/p>\n<pre><code>article = soup.find('article')<\/code><\/pre>\n<p>Now let&#8217;s grab the <strong>headline<\/strong>. So if we look in the HTML source code, we have our &lt;div&gt; tag and within that &lt;h3&gt; tag the headline is present. So the code for grabbing the headline is<\/p>\n<pre><code>headline = article.div.h3.text\nprint(headline)<\/code><\/pre>\n<p><strong>Output:<\/strong><\/p>\n<p>Backlog.com<\/p>\n<p>Next, let&#8217;s grab the <strong>website<\/strong>. So if we look in the HTML source code, we have our &lt;div&gt; tag with its class = &#8220;entry-content&#8221; and inside that, we have a link inside &lt;a&gt; tag and the text of that link contains the official website. So the code for grabbing the website is<\/p>\n<pre><code>offcialWebsite = article.find('div', class_='entry-content').a.text\nprint(offcialWebsite)<\/code><\/pre>\n<p><strong>Output:<\/strong><\/p>\n<p>www.backlog.com<\/p>\n<p>The complete python web scrapping code is given below.<\/p>\n<pre><code># Python program to illustrate web Scraping\n\nimport requests\nfrom bs4 import BeautifulSoup\nimport lxml\n\nsource = requests.get('https:\/\/devopscube.com\/project-management-software').text\nsoup = BeautifulSoup(source, 'lxml')\n\narticle = soup.find('article')\nheadline = article.div.h3.text\nprint(headline)\noffcialWebsite = article.find('div', class_='entry-content').a.text\nprint(offcialWebsite)<\/code><\/pre>\n<p><strong>Output:<\/strong><\/p>\n<p>Backlog.com www.backlog.com<\/p>\n<h2 id=\"realworld-python-web-scrapping-projects\">Realworld Python Web Scrapping Projects<\/h2>\n<p>Here are some real world project ideas you can try for web scrapping using python.<\/p>\n<ol>\n<li>Price monitoring in e-commerce websites<\/li>\n<li>News syndication from multiple news websites and blogs.<\/li>\n<li>Competitor content analysis<\/li>\n<li>Social media analysis for trending contents.<\/li>\n<li>COVID-9 data tracker<\/li>\n<\/ol>\n<p>Also look at some of the <a href=\"https:\/\/github.com\/kjam\/python-web-scraping-tutorial?ref=devopscube.com\" rel=\"noreferrer noopener\">python web scrapping examples from Github<\/a>.<\/p>\n<blockquote><p><strong>Important<\/strong> <strong>Note<\/strong>: Web scraping is not considered good practice if you try to scrape web pages without the website owner&#8217;s consent. It may also cause your IP to be blocked permanently by a website.<\/p><\/blockquote>\n<h2 id=\"python-web-scrapping-courses\">Python Web Scrapping Courses<\/h2>\n<p>If you want to learn full-fledged web scraping techniques, you can try the following on-demand courses.<\/p>\n<ol>\n<li><a href=\"https:\/\/devopscube.com\/recommends\/datacamp-web-scrapping\/\" rel=\"noreferrer noopener\">Web Scraping in Python<\/a> [Datacamp &#8211; Check <a href=\"https:\/\/devopscube.com\/datacamp-discount\/\" rel=\"noreferrer noopener\">Datacamp discounts<\/a> for latest offers]<\/li>\n<li><a href=\"https:\/\/devopscube.com\/recommends\/python-web-scrapping-2\/\" rel=\"noreferrer noopener\">APIs and Web Scraping in Python<\/a> &#8211; [Check <a href=\"https:\/\/devopscube.com\/dataquest-coupon-offers\/\">DataQuest Coupons<\/a> for latest offers]<\/li>\n<li><a href=\"https:\/\/devopscube.com\/recommends\/python-web-scrapping-3\/\" rel=\"noreferrer noopener\">Predictive Data Analysis With Python<\/a><\/li>\n<li><a href=\"https:\/\/devopscube.com\/recommends\/udemy-python-web-scapping\/\" rel=\"noreferrer noopener\">Web scrapping courses<\/a> [Udemy]<\/li>\n<li><a href=\"https:\/\/devopscube.com\/recommends\/python-web-scrapping\/\" rel=\"noreferrer noopener\">Using Python to Access Web Data<\/a> [Coursera]<\/li>\n<\/ol>\n<h2 id=\"conclusion\">Conclusion<\/h2>\n<p>So, in this <strong>python web scraping tutorial<\/strong>, we learned how to create a web scraper. I hope you got a basic idea about web scraping and understand this simple example.<\/p>\n<p>From here, you can try to scrap any other website of your choice.<\/p>\n<p>Also, if you are starting your coding journey, checkout <a href=\"https:\/\/devopscube.com\/top-websites-to-learn-programming-online\/\">30+ top websites to learning coding online.<\/a><\/p>\n<hr>\n<p><strong>Ngu\u1ed3n:<\/strong> <a href=\"https:\/\/devopscube.com\/python-web-scrapping\/\" target=\"_blank\" rel=\"noopener noreferrer\">Python Web Scrapping Tutorial: Step by Step Guide for Beginners \u2014 DevOpsCube<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Source: https:\/\/devopscube.com\/python-web-scrapping\/<\/p>\n","protected":false},"author":0,"featured_media":1060,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1059","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts\/1059","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1059"}],"version-history":[{"count":0,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts\/1059\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/media\/1060"}],"wp:attachment":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1059"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1059"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1059"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}