Web Scrapping

In this article we will see how you can scrape any website & import its content in your python program. From there you can easily dump that data into any file, database or CSV. For web scrapping we will use BeautifulSoup module of python.


What is Web Scrapping?

Whenever we send request of any page suppose www.google.com web servers return raw html file. This html file is converted to webpage by browser. In this tutorial we will import this html file in our python program & save some of its data without sending it to python. This is called Web Scrapping.

 

Complete code : GitHub 


Overview of steps :

  1. Setting up the environment
  2. Get the html
  3. Parse the html
  4. HTML tree traversal
  5. Lets go

 

Related Articles : CRUD website using node & mongoose , Customize Github landing page

 

We will install requests (helps in importing any content from any website), html5lib (this will parse the data) & bs4 (it will provide different functionality on data) python libraries.

 

 In order to get HTML as string we will use requests module & parse (using BeautifulSoup) it to give tree like structure so that we can traverse over it.

 

Now use your terminal & install all the required modules by typing below code :

  1. pip install requests
  2. pip install bs4
  3. pip install html5lib

 

Create main.py import modules in the file & write URL of website which you want to scrape. Next we will get the html.


r = requests.get(url)
htmlContent = r.content

 

Then we will parse the html.

soup = BeautifulSoup(htmlContent, 'html.parser')

 

Now if you print the htmlContent you will get source code of website in console window. You can now parse html content & store it in a variable. If you print that variable you will see source code in indented manner.


Other Articles : Data Structures, LeetCode


Some commands with their functions which we have used :

  1. print(type(title))  : Print the type of title. In our case its bs4.
  2. print(type(title.string)) : Print NavigableString
  3. print(type(soup)) : Print BeautifulSoup object
  4. paras = soup.find_all('p') : Get all the paragraphs from the page. Similarly you can pass 'a' (for anchor tags), 'img' (for image)
  5. print(soup.find('p') ) : Get first para tag
  6. print(soup.find('p')['class']) : Get classes of any element in the HTML page
  7. print(soup.find_all("div", class_="rating-number")) : Find all div with class rating-number
  8. print(soup.find('div').get_text()) : Get text present inside all div
  9. print(soup.get_text()) : Get all the text of source code
  10. anchors = soup.find_all('a') : Find all anchor tags of page

Next by running loop you can get all links present in the page.

  1. for i in anchors:
         print(i.get('href')) # Get all links present on page.
  2. Now we will modify the above statement. Basically we will avoid pound signs (#) & inorder to keep all links unique we will create set. Important note : Make sure that your linkText & url variable has same URL in it.
 

 Other commands.

  • anyID = soup.find(id='anyID')
    print(anyID)                              # Prints all element associated with anyID  id.

 

No comments:

If you have any doubt or suggestion let me know in comment section.

Powered by Blogger.