Website Scraping with Dart (Flutter)
Being an active Flutter developer, i have expertise with Dart and hence wanted to scrap this using Dart only (although i now recommend to use Python).
What is a Website Scraping?
Website scraping, or web data extraction, is the process of extracting or scraping data from websites.
DISCLAIMER: Although this isn’t legal for some websites but here the purpose is solely educational
Gather Fuel and Gear Up
1. Create a new Flutter Project
You need a flutter project to get started, although you can create a standalone Dart console app but to make things easy we’ll do this in an app project.
Open up terminal and run following command.
flutter create flutter_scrap
This will create a new Flutter Project. Confirm the app working by giving project a run.
2. Add web_scraper dependency
There is a pub package named web_scraper which make scraping easy.
Open up pubspec.yaml and add following line in dependencies section
web_scraper | Dart Package
A very basic web scraper implementation to scrap html elements from a web page. Pull requests certainly welcome. In…
flutter pub get to complete adding dependency
3. Gather website and data to scrap
For this article, we will be using unacademy.com website.
Here if you open up the link you will find the chapter list as above.
We need to extract these chapter titles and parse them into an array of strings.
4. See what’s in HTML
A web scraper simply reads and parses the website’s HTML and extracts the different elements (like <div>,<a>) from it. Now one can provide condition to whichever elements he likes to scrap.
Right click a chapter and select Inspect elements(or press F12). A window opens up and you can see the corresponding element’s HTML.
You can see how nested is this title in HTML. As there are multiple chapters, hence multiple such elements. We need to extract all those, but also need to make sure we don’t get any irrelevant text.
5. Write Scraper
Have a look at documentation of web_sraper. There are 3 parts:
- Domain — https://unacademy.com
- Endpoint — /course/gravitation-for-iit-jee/D5A8YSAJ
- Address of element to be extracted
'div.Week__Wrapper-sc-1qeje5a-2 > a.Link__StyledAnchor-sc-1n9f3wx-0 > div.ItemCard__ItemInfo-xrh60s-1 > h6.H6-sc-1gn2suh-0'
The address here needs to be correct. That’s where the exact brainstorming is required. The above representation is done using CSS selectors.
The title element can be found inside this hierarchy:
Here the text left to . (dot) is the tag and the text to right is the class of that tag. So h6.H6-sc-1gn2suh-0 means find <h6> tags which has class H6-sc-1gn2suh-0 in it.
> is the css selector means to find the direct child of of the parent. There are many other combinations that can be seen here.
The final code for the extraction of titles is as shown below. The code is easy to understand unless you are a absolute noob.
The output of
print(titleList) would be the list of extracted titles:
flutter: [Newton’s Law of Gravitation, Variation in Value of ‘g’: Part 2, Gravitational Field Due to a Point Mass, Variation in Value of ‘g’: Part 1, Gauss Theorem, Gravitational Field due to a Uniform Solid Sphere and Uniform Spherical Shell, Gravitational Potential, Relation Between Gravitational Field and Potential, Gravitational Potential Energy, Binding Energy, Motion of Satellites, Maximum Height attained by a Particle, Trajectory of Satellites, Quality Numerical 001 : Sphere with a Cavity, Quality Numerical 002 : Gravitational Force by a Rod, Quality Numerical 003 : Ring and Sphere, Quality Numerical 004 : Gravitational Field of Semi Circular Wire, Quality Numerical 005 : Binding Energy and Escape Velocity, Quality Numerical 006 : Tunnel and Velocity, Quality Numerical 007 : Tunnel and Acceleration, Quality Numerical 008 : Tunnel and Amplitude, Quality Numerical 009 : Gravity Variation, Quality Numerical 010 : Gravity Variation, Quality Numerical 011 : Weightlessness, Quality Numerical 012 : Planet revolving, Quality Numerical 013 : Escape Velocity, Quality Numerical 014 : Angular Momentum, Quality Numerical 015 : Gravitational Field and Potential]
Wow! We just completed our first scrap successfully.
Beyond this, your mind is the limit on what and how you want to extract data.
TIP: Having multiple request to a website may be recognized as DOS attack by server leading to blacklisting of your IP or permanent ban. Hence whenever applying a loop of requests try to add intentional delay of 2–5 sec before making request. If possible try changing IP frequently.
Though you must not scrap if its clearly mentioned not to scrap data on a website (For ex. robots.txt).
Web scraping is quite a large and complex topic to summarize all in one article. The above presents a very simple use case in real world.
(Do show love by hitting 🌟)
(Basics of web scraping)
(Excellent book on Python web scraping)
Whola! Both you and I learnt something new today. Congrats
Clap! Clap! Clap!