ETL for Children Fiction Books Details from Open Trolley Online Bookstore

Fatin Haziqah
15 min readJan 22, 2021

WEB SCRAPING
Web scraping also known as data scraping refers to the process of extracting and processing information or data from a website. This information is collected and then exported into a spreadsheet or local file saved on the computer. In this project Python is used for web scraping because it can handle almost all processes related to data extraction smoothly.

ETL ?
ETL (Extract, Transform and Load) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the sources or in different context than the sources. The first phase of ETL is extract data. During this phase, the organization identifies the required data sources with the rows and columns to be extracted from those sources. During transform phase, the extracted data is convert from its previous form into the form it needs to be in so that it can be placed
into another database. Lastly, during the load phase, the cleaned and formated data then loaded into a storage system, such as a cloud data warehouse.

Open Trolley Online Bookstore
Open Trolley Online Bookstore is an online book purchase system that as a variety of book category such as children fiction, cooking, science, games and business. For this project, we choose children fiction category that has 8 attributes which is:-
> Title
> Author
> Type
> Description
> Price(RM)
> Discount Price(RM)
> Link
> Availability

Before we start, we need to import the required libraries which is BeautifulSoup. Other than that, we also used the requests module to allows us to send HTTP request using Python . The HTTP request returns a Response Object with all the response data.

!pip install beautifulsoup4
!pip install requests
from bs4 import BeautifulSoup
import requests
url = "https://opentrolley.com.my/Search.aspx?bestseller=children-fiction&page=1&pgsz=20&stockstatus=Available&sorttype=2"
response = requests.get(url)
response
data = response.text
data

The output :

\r\n\r\n<!DOCTYPE html>\r\n\r\n\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<head id="ctl00_Head1"><meta charset="utf-8" /><meta http-equiv="X-UA-Compatible" content="IE=edge" /><meta name="viewport" content="width=device-width, initial-scale=1" /><meta name="author" /><title>\r\n\tChildren Fiction - OpenTrolley Bookstore Malaysia\r\n</title><link rel="icon" type="image/png" href="assets/img_favicon.png" /><link href="lib/w3.css" rel="stylesheet" /><link href="lib/unslider/css/unslider-dots.css" rel="stylesheet" /><link href="lib/unslider/css/unslider.css" rel="stylesheet" /><link href="css/ig.css" rel="stylesheet" /><link href="css/fonts.css" rel="stylesheet" /><link href="css/ot.css" rel="stylesheet" /><link href="css/ot-nav.css" rel="stylesheet" /><link href="css/ot-footer.css" rel="stylesheet" /><link href="css/submitting.css" rel="stylesheet" /><link href="css/benefits.css" rel="stylesheet" />\r\n    \r\n    <script src="../../js/jquery.js"></script>\r\n\r\n    \r\n    <scri

So lets start for our ETL!

1. EXTRACT

a) Title

soup = BeautifulSoup(data, 'html.parser')title_tag = []for div in soup.findAll("div", {"class":"book-title"}):
titles = div.findAll("a")
for title in titles:
title_tag.append(title.text)
for title in title_tag:
print(title)

The output:

Room on the Broom 
Hatchet
Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter, Book 1), Volume 1: The Illustrated Edition Everything Is Mama
Antiracist Baby Board Book
Dog Man: Grime and Punishment
The Deep End
Harry Potter: The Illustrated Collection
The Boxcar Children Mysteries Boxed Set #1-4
The Little Engine That Could: The Complete, Original Edition Strega Nona
The Spider Who Saved Christmas
Goodnight Moon
Spy School Revolution
The Very Hungry Caterpillar
The Silver Arrow Cat Kid Comic Club: From the Creator of Dog Man 365 Bedtime Stories and Rhymes The Enchanted Forest Chronicles: (boxed Set)
The Biggest Snowman Ever

b) Author

author_tag = []
divs = soup.find_all("div", {"class":"book-detail w3-col l7 m7 s8"})
for div in divs:
authors=div.find("div",{"book-author"}).text.replace(' ','').replace(u'\xa0', '').replace('\n\r','').replace('\n\n\n','').replace('\r','').replace('\n','')
author_tag.append(authors)
for author in author_tag:
s=author
print(s)

The output:

byDonaldson,JuliaScheffler,Axel 
byPaulsen,Gary
byRowling,J.K.Kay,Jim
byFallon,JimmyOrdóñez,Miguel
byKendi,IbramX.Lukashevsky,Ashley
byPilkey,DavPilkey,Dav
byKinney,Jeff
byKay,JimRowling,J.K.
byWarner,GertrudeChandler
byPiper,Watty
bydePaola,TomiedePaola,Tomie
byArroyo,Raymond
byBrown,MargaretWiseHurd,Clement
byGibbs,Stuart
byCarle,Eric
byGrossman,Lev
byPilkey,DavPilkey,Dav
byCottageDoorPressParragonBooks
byWrede,PatriciaC.
byKroll,StevenBassett,Jeni

c) Types

types_tag = []divs = soup.find_all("div", {"class":"book-detail w3-col l7 m7 s8"})for div in divs:
types = div.find("div", {"class":"book-type"}).text[1:-50]
types = types.strip()
types_tag.append(types)
for type in types_tag:
print(type)

The output:

Paperback 
Paperback
Hardcover
Board Books
Board Books
Hardcover
Hardcover
Boxed Set
Boxed Set
Hardcover
Paperback
Hardcover
Board Books
Hardcover
Board Books
Hardcover
Hardcover
Hardcover
Boxed Set
Paperback

d) Description

descs_tag = []divs = soup.find_all("div", {"class":"book-detail w3-col l7 m7 s8"})for div in divs:
descs = div.find("div", {"class":"book-desc"})
desc = descs.text if descs else "N/A"
desc = desc.strip()
descs_tag.append(desc)
for description in descs_tag:
print(description)

The output:

This fun family read-aloud is a story of quick wits, friendship, and inclusivity from the creators of The Gruffalo      The witch and her cat are happily flying through the sky on a broomstick when the wind picks up and blows away the witch's hat, then her bow, and then her wand  Luckily, three helpful animals find the missing items, and all they want in return is a ride on the broom. But is there room on the broom for so many friends? And when disaster strikes, will they be able to save the witch from a hungry dragon?  Julia Donaldson and Axel Scheffler are the creators of many beloved picture books including The Gruffalo ,  The Gruffalo's Child ,  The Snail an .. This award-winning contemporary classic is the survival story with which all others are compared--and a page-turning, heart-stopping adventure, recipient of the Newbery Honor. Hatchet  has also been nominated as one of America's best-loved novels by PBS's The Great American Read.    Thirteen-year-old Brian Robeson, haunted by his secret knowledge of his mother's infidelity, is traveling by single-engine plane to visit his father for the first time since the divorce. When the plane crashes, killing the pilot, the sole survivor is Brian. He is alone in the Canadian wilderness with nothing but his clothing, a tattered windbreaker, and the hatchet his mother had giv .. The beloved first book of the Harry Potter series, now fully illustrated by award-winning artist Jim Kay.    For the first time, J.K. Rowling's beloved Harry Potter books will be presented in lavishly illustrated full-color editions. Kate Greenaway-award-winning artist Jim Kay has created over 100 stunning illustrations, making this deluxe format a perfect gift as much for a child being introduced to the series, as for the dedicated fan.   Harry Potter has never been the star of a Quidditch team, scoring points while riding a broom far above the ground. He knows no spells, has never helped to hatch a dragon, and has never worn a cloak of invisibility.   All he k .. Everything is . . .   MAMA    Jimmy Fallon , one of the most popular entertainers in the world and NBC's Tonight Show  host, was on a mission with his first children's book to have every baby's first word be DADA. And it worked  A lot of babies' first words were DADA. However, everything after that was MAMA.  So take a lighthearted look at the world from your baby's point of view as different animals try to teach their children that there are other words in addition to MAMA for familiar objects and activities, now available in board book.  .. A #1 New York Times  Bestseller   br Featured in its own episode in the Netflix original show Bookmarks: Celebrating Black Voices   br Featured on Good Morning America, NPR's Morning Edition, CBS This Morning, and more    From the National Book Award-winning author of Stamped from the Beginning  and How to Be an Antiracist  comes a fresh new board book that empowers parents and children to uproot racism in our society and in ourselves.  Take your first steps with Antiracist Baby  Or rather, follow Antiracist Baby 's nine easy steps for building a more equitable world.  With bold art and thoughtful yet playful text, Antiracist Baby  introduces the youngest reader .. The mayor has had enough of Dog Man's shenanigans in the ninth book from worldwide bestselling author and artist Dav Pilkey.  br   br Dog Man's really done it this time  He hands over his badge and clears out his desk, but while he may be out of a job, he's not yet out of hope. With his friends at his side, can Dog Man dig himself out of this hole and paw his way back onto the force?  br Dav Pilkey's wildly popular Dog Man series appeals to readers of all ages and explores universally positive themes, including empathy, kindness, persistence, and the importance of doing good.  .. An instant #1 USA Today , Wall Street Journal , and New York Times  bestseller    In The Deep End , book 15 of the Diary of a Wimpy Kid series from #1 international bestselling author Jeff Kinney, Greg Heffley and his family hit the road for a cross-country camping trip, ​ready for the adventure of a lifetime.     But things take an unexpected turn, and they find themselves stranded at an RV park that's not exactly a summertime paradise. When the skies open up and the water starts to rise, the Heffleys wonder if they can save their vacation--or if they're already in too deep.   And don't miss Rowley Jefferson's Awesome Friendly Adventure , an all-new fanta .. The first three books in the Harry Potter series, gorgeously illustrated in full color by Jim Kay, now available in a collectible boxed set   br   br This beautifully produced boxed set is the perfect introduction to the Harry Potter series, and an impressive gift for new readers and lifelong fans alike. It contains the first three books in the series (Harry Potter and the Sorcerer's Stone , Harry Potter and the Chamber of Secrets , and Harry Potter and the Prisoner of Azkaban ) in large-scale editions, gorgeously illustrated in full color by award-winning artist Jim Kay. These editions are a pleasure to read, with generously sized pages, color on every page, an .. The paperback editions of The Boxcar Children Mysteries: #1, The Boxcar Children ; #2, Surprise Island ; #3, The Yellow House Mystery ; and #4, Mystery Ranch  are offered together in a cardboard case. Follow Henry, Jessie, Violet, and Benny Alden as they explore the world on their own, solve mysteries, and search for home.  .. bthinkcan,thinkcan,thinkcan... Discover the inspiring story of the Little Blue Engine as she makes her way over the mountain in this beloved classic    The kindness and determination of the Little Blue Engine have inspired millions of children around the world since the story was first published in 1930. Cherished by readers for ninety years, The Little Engine That Could  is a classic tale of the little engine that, despite her size, triumphantly pulls a train full of wonderful things to the children waiting on the other side of a mountain. .. When Strega Nona leaves him alone with her magic pasta pot, Big Anthony is determined to show the townspeople how it works in this classic Caldecott Honor book from Tomie dePaola.  Strega Nona--Grandma Witch--is the source for potions, cures, magic, and comfort in her Calabrian town. Her magical everfull pasta pot is especially intriguing to hungry Big Anthony. He is supposed to look after her house and tend her garden but one day, when she goes over the mountain to visit Strega Amelia, Big Anthony recites the magic verse over the pasta pot, with disastrous results.  In this retelling of an old tale, author-illustrator Tomie dePaola combines humor in the writing .. The Spider That Saved Christmas tells the tale of what happened to the Holy Family on their way to Egypt after receiving the message of the angel. When Joseph, Mary, and Jesus are in danger of being discovered and harmed by Herod's murderous soldiers, a cave-dwelling spider named Nephila risks her and her children's safety to help her hallowed visitors.  Majestically illustrated by artist Randy Gallegos, EWTN host Raymond Arroyo's moving story sheds new light on a family of Golden Silk Orb Weavers, whose silk is considered the most precious of all and is displayed at Christmastime in the sparkling tinsel that glints from evergreen trees the world over. After rea .. strong In this classic of children's literature, beloved by generations of readers and listeners, the quiet poetry of the words and the gentle, lulling illustrations combine to make a perfect book for the end of the day.  strong  In a great green room, tucked away in bed, is a little bunny. Goodnight room, goodnight moon. And to all the familiar things in the softly lit room--to the picture of the three little bears sitting on chairs, to the clocks and his socks, to the mittens and the kittens, to everything one by one--the little bunny says goodnight. One of the most beloved books of all time, Goodnight Moon  is a must for every bookshelf. This board book edit .. Superspy middle schooler Ben Ripley faces the Croatoan--a new evil organization that's so mysterious, the only proof it exists is from the American Revolution--in this latest addition to the New York Times  bestselling Spy School series.   With SPYDER defeated, Ben Ripley is looking forward to his life getting back to normal, or as normal as possible when you're a superspy in training. For once, everything seems to be right in Ben's world...until someone bombs the CIA conference room next door.  To Ben's astonishment, the attacker is none other than Erica Hale, the spy-in-training he respects more than any other. Ben refuses to believe Erica is working for the e .. THE all-time classic picture book, from generation to generation, sold somewhere in the world every 30 seconds  A sturdy and beautiful book to give as a gift for new babies, baby showers, birthdays, and other new beginnings    Featuring interactive die-cut pages, this board book edition is the perfect size for little hands and great for teaching counting and days of the week.   The very hungry caterpillar literally eats his way through the pages of the book--and right into your child's heart... br  --Mother's Manual   Gorgeously illustrated, brilliantly innovative...--The New York Times Book Review   .. A New York Times  bestseller  br    br   "I loved every page. This is middle grade fiction at its best."-- Ann Patchett  From the #1 New York Times  bestselling author of The Magicians  comes a must-read, wholly original middle-grade debut perfect for fans of The Chronicles of Narnia and Roald Dahl.  br    br   Dear Uncle Herbert,   br   You've never met me, but I'm your niece Kate, and since it is my birthday tomorrow and you are super-rich could you please send me a present?  br    br   Kate and her younger brother Tom lead dull, uninteresting lives. And if their dull, uninteresting parents are anything to go by, they don't have much to look forward to. Why ca .. A pioneering new graphic novel series by Dav Pilkey, the author and illustrator of the internationally bestselling Dog Man and Captain Underpants series.  br   br In Cat Kid Comic Club , Li'l Petey, Flippy, and Molly introduce twenty-one rambunctious, funny, and talented baby frogs to the art of comic making. As the story unwinds with mishaps and hilarity, readers get to see the progress, mistakes, and improvements that come with practice and persistence.  br   br Squid Kid and Katydid , Baby Frog Squad , Monster Cheese Sandwich , and Birds Flowers Trees: A Haiku Photo Comic  are just some of the mini-comics that are included as stories-within-the-story, each do .. Snuggle up with your favorite nursery rhymes and drift into the magical worlds of all your most beloved fairytale characters. This  strong beautifully illustrated  strong  storytime treasury padded keepsake brings together the talents of illustrators from around the world with well-loved stories and rhymes both kiddos and their grown-ups will enjoy. Find new stories and adventures to fill your imagination every night with this  strong bedtime family favorite  strong . Sweet Dreams    ul   li Short stories designed to quickly entertain without cutting into precious bedtime schedules  li   li Bond with your child through family storytime and strengthen a love for  .. Collected together for the first time are Patricia C. Wrede's hilarious adventure stories about Cimorene, the princess who refuses to be proper. Every one of Cimorene's adventures is included in its paperback edition--Dealing with Dragons, Searching for Dragons, Calling on Dragons,   and Talking to Dragons --in one handsome package that's perfect for gift giving.  br  .. Kids will adore this sequel to the New York Times bestseller The Biggest Pumpkin Ever   When the mayor of Mouseville announces the town snowman contest, Clayton and Desmond claim that they will each make the biggest snowman ever. But building a huge snowman alone is hard  They work and work, but their snowmen just aren't big enough.  Soon they have an idea. As the day of the contest approaches, Clayton and Desmond join forces to build the biggest snowman ever. br  ..

e) Price

price_before_tag=[]
divs=soup.find_all("div",{"class":"book-control w3-col l3 m3 s-margin-top-sm"})
for div in divs:
prices=div.find("div",{"class":"price-before-disc"})
price= prices.text.replace("RM","")
price=price.strip()
price_before_tag.append(price)
print(price)

The output:

39.26 
42.25
224.59
44.37
49.61
76.20
76.05
732.00
121.64
50.77
43.48
89.71
45.11
89.87
53.35
83.59
71.19
92.43
164.42
24.14

f) Discount Price

price_after_tag=[]
divs=soup.find_all("div",{"class":"book-control w3-col l3 m3 s-margin-top-sm"})
for div in divs:
prices=div.find("div",{"class":"price-after-disc"})
price= prices.text.replace("RM","")
price=price.strip()
price_after_tag.append(price)
print(price)

The output:

35.29 
37.98
201.91
39.89
44.60
68.50
68.37
658.07
109.35
45.64
39.09
80.65
40.55
80.79
47.96
75.15
64.00
83.09
147.81
21.70

g) Link

link_tag = []for div in soup.findAll("div", {"class":"book-detail w3-col l7 m7 s8"}):
tags = div.find("a")
tag = tags.get("href")
link_tag.append("https://opentrolley.com.my/" + tag)
for link in link_tag:
print(link)

The output:

https://opentrolley.com.my/book/9780142501122/room-on-the-broom https://opentrolley.com.my/book/9781416936473/hatchet-r-e https://opentrolley.com.my/book/9780545790352/harry-potter-the-sorcerers-s https://opentrolley.com.my/book/9781250125835/everything-is-mama-board https://opentrolley.com.my/book/9780593110416/antiracist-baby https://opentrolley.com.my/book/9781338535624/dog-man-grime-punishment-fro https://opentrolley.com.my/book/9781419748684/diary-of-a-wimpy-kid-diary https://opentrolley.com.my/book/9781338312911/boxed-harry-potter-the-illu-3v https://opentrolley.com.my/book/9780807508541/boxed-boxc-mysteries-boxed-4v https://opentrolley.com.my/book/9780448405209/letc https://opentrolley.com.my/book/9780671666064/strega-nona-r-e https://opentrolley.com.my/book/9781644132111/spider-who-saved-xmas https://opentrolley.com.my/book/9780694003617/goodnight-moon-board https://opentrolley.com.my/book/9781534443785/spy-school-revolution https://opentrolley.com.my/book/9780399226908/very-hungry-caterpillar-board https://opentrolley.com.my/book/9780316539531/silver-arrow https://opentrolley.com.my/book/9781338712766/cat-kid-comic-club https://opentrolley.com.my/book/9781680524048/365-bedtime-stories-rhymes https://opentrolley.com.my/book/9780544542648/boxed-enchanted-forest-chron https://opentrolley.com.my/book/9780439627689/biggest-snowman-ever

h) Availability

availability_tag = []for divs_tag in soup.findAll("div", {"class":"book-control w3-col l3 m3 s-margin-top-sm"}):
available = soup.findAll("div", {"class":"stock-status"})
status = available[0]
availability_tag.append(status.text)
for availability in availability_tag:
print(availability)

The output:

Available 
Available
Available
Available
Available
Available
Available
Available
Available
Available
Available
Available
Available
Available
Available
Available
Available
Available
Available
Available

To show all the extracted attributes combine in a table, we need to do a dataframe.

import pandas as pddf = pd.DataFrame({"Title":title_tag, "Author":author_tag,"Type":types_tag,"Description":descs_tag,"Price(RM)":price_before_tag,"Discount Price(RM)":price_after_tag,"Link":link_tag, "Availability":availability_tag})df

2. TRANSFORM

As we known that transformation is a phase that we need to transform the extraction data to something that easier to read. So the first transformation is to replace the type and availability attributes from string to number for easier understanding. For example the type hardcover transformed to number ‘1’ , Boxed Set to number ‘2’ and so on.

import numpy as np
df.replace("Available", 1, inplace = True)
df.replace("Unavailable", 0, inplace = True)
df.replace("Hardcover", 1, inplace = True)
df.replace("Boxed Set", 2, inplace = True)
df.replace("Board Books", 3, inplace = True)
df.replace("Paperback", 4, inplace = True)
df.replace("", "Not Available", inplace = True)
df

DATA VISUALIZATION

Next, in transformation phase, we also need to visualize the dataframe into graphical information such as Histogram, Boxplot, Scatterplot and etc. First of all, we must to import the matplotlib to perform the graphical information.

%matplotlib inlineimport matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot') # optional: for ggplot-like style# check for latest version of Matplotlib
print('Matplotlib version: ', mpl.__version__) # >= 2.0.0

Then the output will show the matplotlib version :

Matplotlib version:  3.2.2

Next, for Histogram, we want to show the number of books based on the prices. Since we don’t have attribute for number of the books based on the price , so we need to count it first.

df["Discount Price(RM)"] = pd.to_numeric(df["Discount Price(RM)"])# np.histogram returns 2 values
count, bin_edges = np.histogram(df['Discount Price(RM)'])
print(count) # frequency count
print(bin_edges) # bin ranges, default = 10 bins

The output of the number of the books based on the price:

[16  2  1  0  0  0  0  0  0  1] 
[ 21.7 85.337 148.974 212.611 276.248 339.885 403.522 467.159 530.796 594.433 658.07 ]

After that, we can build the Histogram.

df['Discount Price(RM)'].plot(kind='hist', figsize=(8, 5), xticks=bin_edges)plt.title('Histogram of Price Per Book In The Store') # add a title to the histogram
plt.ylabel('Number of Books') # add y-label
plt.xlabel('Prices') # add x-label
plt.show()

Other than that, we also can build a Boxplot for Data Visualisation. For the Boxplot, we decided to show the Boxplot of the price per book in the store. The codes below show the list of the Discount Price :

df_bxplt = pd.DataFrame(df['Discount Price(RM)'])
df_bxplt

Then, after that we can build the boxplot…

df_bxplt.plot(kind='box', figsize=(8,6))plt.title('Boxplot Of Price Per Book In The Store')
plt.ylabel('Discount Price(RM)')

For the last data visualization, we can build a Scatterplot. So, this scatterplot we decided to choose the price and the book types. For build the Scatterplot, the codes are shown as below :

df.plot(kind='scatter', x='Discount Price(RM)', y='Type', figsize=(10, 6), color='darkblue')plt.title('Scatterplot For Prices of Book Type')
plt.xlabel('Price')
plt.ylabel("Book Cover")
plt.show()

The output of the Scatterplot would be like this:

3. LOAD

For the final process of this web scraping and data cleaning session, we store(load) the data in a csv file and download it.

df.to_csv("bookstore.csv", index=False, encoding="utf-8")

That’s all for this article. Hope this solutions could help you on your problem. If you want to see the full codes, just click the link below. THANK YOU :)

I would like to thank Dr Sabrina because helps me a lot in this subject & also all my teammates Ma Junjie, Anas Farhan & Aljannah who were together with me to complete this project. Thank You Everyone ❤

--

--