Web Scrapping

In this article we will see how you can scrape any website & import its content in your python program. From there you can easily dump that data into any file, database or CSV. For web scrapping we will use BeautifulSoup module of python.

What is Web Scrapping?

Whenever we send request of any page suppose www.google.com web servers return raw html file. This html file is converted to webpage by browser. In this tutorial we will import this html file in our python program & save some of its data without sending it to python. This is called Web Scrapping.

Complete code : GitHub

Overview of steps :

Setting up the environment
Get the html
Parse the html
HTML tree traversal
Lets go

Related Articles : CRUD website using node & mongoose , Customize Github landing page

We will install requests (helps in importing any content from any website), html5lib (this will parse the data) & bs4 (it will provide different functionality on data) python libraries.

In order to get HTML as string we will use requests module & parse (using BeautifulSoup) it to give tree like structure so that we can traverse over it.

Now use your terminal & install all the required modules by typing below code :

pip install requests
pip install bs4
pip install html5lib

Create main.py import modules in the file & write URL of website which you want to scrape. Next we will get the html.

r = requests.get(url)

htmlContent = r.content

Then we will parse the html.

soup = BeautifulSoup(htmlContent, 'html.parser')

Now if you print the htmlContent you will get source code of website in console window. You can now parse html content & store it in a variable. If you print that variable you will see source code in indented manner.

Other Articles : Data Structures, LeetCode

Some commands with their functions which we have used :

print(type(title)) : Print the type of title. In our case its bs4.
print(type(title.string)) : Print NavigableString
print(type(soup)) : Print BeautifulSoup object
paras = soup.find_all('p') : Get all the paragraphs from the page. Similarly you can pass 'a' (for anchor tags), 'img' (for image)
print(soup.find('p') ) : Get first para tag
print(soup.find('p')['class']) : Get classes of any element in the HTML page
print(soup.find_all("div", class_="rating-number")) : Find all div with class rating-number
print(soup.find('div').get_text()) : Get text present inside all div
print(soup.get_text()) : Get all the text of source code
anchors = soup.find_all('a') : Find all anchor tags of page

Next by running loop you can get all links present in the page.

for i in anchors:
print(i.get('href')) # Get all links present on page.
Now we will modify the above statement. Basically we will avoid pound signs (#) & inorder to keep all links unique we will create set. Important note : Make sure that your linkText & url variable has same URL in it.

Other commands.

anyID = soup.find(id='anyID')
print(anyID) # Prints all element associated with anyID id.

Web Scrapping

No comments:

Followers

Source Code

Blog Archive

Recent

Popular

Comments

Labels

Popular Posts

Recent Post

corona tracker

Web Scrapping

No comments:

Social Widget

Followers

Source Code

Blog Archive

Recent

Popular

Comments

Labels

Popular Posts

Recent Post

corona tracker