UOB – – Arif Works

Web Scraping

Course Outline

Download Materials

Cheatsheet (Scrapy)

Cheatsheet (Selenium)

1. Import the Selenium library

You can get Selenium from here.

from selenium import webdriver

2. Start the webdriver and the browser

Starting the webdriver and the Chrome browser.

You can get ChromeDriver from here.

chromedriver = "C:/tests/chromedriver.exe"
driver = webdriver.Chrome(executable_path = chromedriver)

Starting the webdriver and the Firefox browser.

You can get GeckoDriver from here.

geckodriver = "C:/tests/geckodriver.exe"
driver = webdriver.Firefox(executable_path = geckodriver)

Starting the webdriver and the Internet Explorer browser.

You can get IEDriverServer from here.

iedriver = "C:/tests/IEDriverServer.exe"
driver = webdriver.Firefox(executable_path = iedriver)

Starting the webdriver and the Safari browser.

Nothing to download. The SafariDriver is integrated in Safari.

driver = webdriver.Safari()

Instead of having machines with all those browsers, I just use Endtest.

It’s a platform for Codeless Automated Testing where you can create, manage and execute tests on real browsers on Windows and macOS machines and mobile devices.

3. Open a website

the_url = "https://example.com"
driver.get(the_url)

4. Find an element

Let’s try to find this element:

<a href="/sign-up" id="register" name="register" class="cta nav-link">Sign Up</a>

Find element by ID

the_id = 'register'
element = driver.find_element_by_id(the_id)

Find element by Name

the_name = 'register'
element = driver.find_element_by_id(the_name)

Find element by Class Name

the_class_name = 'nav-link'
element = driver.find_element_by_class_name(the_class_name)

Find element by Tag Name

the_tag_name = 'a'
element = driver.find_element_by_tag_name(the_tag_name)

Find element by Link Text

Works only for anchor elements.

the_link_text = 'Sign Up'
element = driver.find_element_by_link_text(the_link_text)

Find element by Partial Link Text

Works only for anchor elements.

the_partial_link_text = 'Sign'
element = driver.find_element_by_partial_link_text(the_partial_link_text)

Find element by CSS Selector

You can extract the CSS Selector from the browser.

Or you can write your own by using an attribute from the element:

*[attribute="attribute_value"]

For our element, a custom CSS Selector would be:

a[href="/sign-up"]

the_css_selector = 'a[href="/sign-up"]'
element = driver.find_element_by_css_selector(the_css_selector)

Find element by XPath

You can extract the XPath from the browser.

Or you can write your own by using an attribute from the element:

//*[@attribute = "attribute_value"]

For our element, a custom XPath would be:

//a[@href = "/sign-up"]

You can read more about that here.

the_xpath = '//a[@href = "/sign-up"]'
element = driver.find_element_by_xpath(the_xpath)

5. Click on an element

the_id = 'register'
element = driver.find_element_by_id(the_id)
element.click()

6. Write text inside an element

Works only for inputs and textareas.

the_id = 'email'
the_email = 'klaus@werner.de'
element = driver.find_element_by_id(the_id)
element.send_keys(the_email)

7. Select an option

Works only for select elements.

<select id="country">
<option value="US">United States</option>
<option value="CA">Canada</option>
<option value="MX">Mexico</option>
</select>

Let’s select Canada. 🇨🇦

You can use the visible text:

the_id = 'country'
element = driver.find_element_by_id(the_id)
select_element = Select(element)
select_element.select_by_visible_text('Canada')

You can use the value:

the_id = 'country'
element = driver.find_element_by_id(the_id)
select_element = Select(element)
select_element.select_by_value('CA')

You can also use the index:

the_id = 'country'
element = driver.find_element_by_id(the_id)
select_element = Select(element)
select_element.select_by_index(1)

8. Take a screenshot

the_path = 'C:/tests/screenshots/1.png'
driver.save_screenshot(the_path)

Selenium does not offer Screenshot Comparison but we know who does.

9. Upload a file

This works by using the send_keys method to write the local path of the file in the input type=”file” element.

Let’s use this example:

<input type="file" multiple="" id="upload_button">

the_file_path = 'C:/tests/files/example.pdf'
the_id = 'upload_button'
element = driver.find_element_by_id(the_id)
element.send_keys(the_file_path)

You can read more about uploading files in a test here.

10. Execute JavaScript

In some cases, you might need to execute some JavaScript code.

This works exactly like you would execute it in your browser console.

js_code = 'document.getElementById("pop-up").remove()'
driver = execute_script(js_code)

11. Switch to iframe

<iframe id="payment_section">
   <input id="card_number">
   <input id="card_name">
   <input id="expiration_date">
   <input id="cvv">
</iframe>

the_iframe_id = 'payment_section'
the_element_id = 'card_number'
the_iframe = driver.find_element_by_id(the_iframe_id)
driver.switch_to.frame(the_iframe)
element = driver.find_element_by_id(the_element_id)
element.send_keys('41111111111111')
driver.switch_to.default_content()

Endtest also supports iframes and it even supports Shadow DOM.

12. Switch to the next tab

You have to store the handle of your current tab in a global variable.

If you have only one tab open, the handle is 0.

global nextTab
global currentTab
nextTab = currentTab + 1
driver.switch_to_window(driver.window_handles[nextTab])
currentTab = currentTab + 1

13. Switch to the previous tab

global previousTab
global currentTab
previousTab = currentTab - 1
driver.switch_to_window(driver.window_handles[previousTab])
currentTab = currentTab - 1

14. Close tab

driver.close()

15. Close alert

driver.switch_to.alert.accept()

16. Refresh

driver.refresh()

17. Hover

the_id = "register"
the_element = driver.find_element_by_id(the_id)
hover = ActionChains(driver).move_to_element(the_element)
hover.perform()

18. Right Click

the_id = "register"
the_element = driver.find_element_by_id(the_id)
right_click = ActionChains(driver).context_click(the_element)
right_click.perform()

19. Click with offset

In order to precisely click on a certain position in a canvas element, you have to provide the offset.

The offset represents the number of pixels to the right and down, starting from the top left corner of your canvas element.

the_id = "register"
the_element = driver.find_element_by_id(the_id)
x = 30
y = 20
offset = ActionChains(driver).move_to_element_with_offset(the_element,x,y)
offset.click()
offset.perform()

You can read how to do this with Endtest here.

20. Press Key

the_id = 'register'
element = driver.find_element_by_id(the_id)
element.send_keys(Keys.RETURN)

21. Drag and drop

element_to_drag_id = 'ball'
target_element_id = 'goal'
element_to_drag = driver.find_element_by_id(element_to_drag_id)
target_element = driver.find_element_by_id(target_element_id)
ActionChains(driver).drag_and_drop(element_to_drag_id, target_element).perform()

22. Get Page Source

the_page_source = driver.page_source

23. Get Cookies

cookies_list = driver.get_cookies()

24. Delete Cookies

cookie_item = 'shopping_cart'
# delete one cookie
driver.delete_cookie(cookie_item)
# delete all cookies
driver.delete_all_cookies()

25. Get first element from list

the_id = 'register'
list_of_elements = driver.find_elements_by_id(the_id)
first_element = list_of_elements[0]

26. Configure Page Load Timeout

driver.set_page_load_timeout(20)

27. Configure Element Load Timeout

from selenium.webdriver.support.ui import WebDriverWait

the_id = 'register'
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID, the_id)))

28. Set window size

driver.set_window_size(1600, 1200)

29. Change the user agent string

the_user_agent = 'hello'
chromedriver = 'C:/tests/chromedriver.exe'
options = webdriver.ChromeOptions()
options.add_argument('--user-agent = '+ the_user_agent)
driver = webdriver.Chrome(
   executable_path = chromedriver, 
   chrome_options = options)

30. Simulate webcam and microphone

chromedriver = 'C:/tests/chromedriver.exe'
options = webdriver.ChromeOptions()
options.add_argument("--use-fake-ui-for-media-stream")
options.add_argument("--use-fake-device-for-media-stream")
driver = webdriver.Chrome(
   executable_path = chromedriver, 
   chrome_options = options)

31. Add Chrome Extension

chromedriver = 'C:/tests/chromedriver.exe'
extension_path = 'C:/tests/my_extension.zip'
options = webdriver.ChromeOptions()
options.add_extension(extension_path)
driver = webdriver.Chrome(
   executable_path = chromedriver, 
   chrome_options = options)

32. Emulate mobile device

google_pixel_3_xl_user_agent = 'Mozilla/5.0 (Linux; Android 9.0; Pixel 3 XL Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.98 Mobile Safari/537.36'
pixel_3_xl_emulation = {
   "deviceMetrics": {
      "width": 411, 
      "height": 731, 
      "pixelRatio": 3
   },
   "userAgent": google_pixel_3_xl_user_agent
}
options = webdriver.ChromeOptions()
options.add_experimental_option("mobileEmulation", pixel_3_xl_emulation)
driver = webdriver.Chrome(
   executable_path = chromedriver, 
   chrome_options = options)

Mindmap

https://gitmind.com/app/doc/fc31291426

Exercise

Debt to GDP ratio by country

So your job is to scrape the national debt to GDP for each country listed in this website ‘http://worldpopulationreview.com/countries/countries-by-national-debt/‘.

The population is not required to be scraped, however if you want to scrape it that’s fine.

Now since this will be your first exercise, I’m gonna list below the steps you need to follow:

First thing first please scaffold a new project called ‘national_debt‘ and then generate a spider within that same project called “gdp_debt“
The website does use the “http” protocol by default so there is no need to modify that in the ‘start_urls’ since by default Scrapy will use “http“
Next inside the parse method make sure to iterate(loop) through all rows and yield two keys ‘country_name‘ and ‘gdp_debt‘
Finally make sure to execute the spider

A sample of the output:

NLP

e-Attendance

Please follow the steps to complete your e-attendance

Click this link
Get your Student ID
Class ID: 32103
User Guide

Goal

Text Cleaning

Text Cleaning includes following steps (but not limited to) :

Lower Casing
Removal of punctuations
Removal of stopwords
Removal of frequent words
Removal of very rare words
Stemming
Lemmatization
Removal of emojis
removal of emoticons
Conversion of emoticons to words
Conversion of emojis to words
Use of regular expressions (removal of URL, HTML tags , Phone no, Email id, etc)
Chat words conversion
Spelling correction
Removal of non-english words

Example Notebook

Topic Modeling

For Topic Modeling we can use

LDA
Guided (Seeded) LDA
Anchored CorEx

Example Notebook

Text Summarization

Text Summarization with Seq2Seq Model
Text summarization with transformers
text summarization with Hugging Face transformers

Example Notebook

Streamlit

Example Notebook

Text Classification

Example Notebook

Sentiment Analysis

Dataset:

Extra: Hugging Face

Official Libraries

First-party cool stuff made with love by Hugging Face.

transformers – State-of-the-art natural language processing for Jax, PyTorch and TensorFlow.
datasets – The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use, and efficient data manipulation tools.
tokenizers – Fast state-of-the-Art tokenizers optimized for research and production.
knockknock – Get notified when your training ends with only two additional lines of code.
accelerate – A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision.
autonlp – Train state-of-the-art natural language processing models and deploy them in a scalable environment automatically.
nn_pruning – Prune a model while finetuning or training.
huggingface_hub – Client library to download and publish models and other files on the huggingface.co hub.
tune – A benchmark for comparing Transformer-based models.

Tutorials

Learn how to use Hugging Face toolkits, step-by-step.

Official Course (from Hugging Face) – The official course series provided by Hugging Face.
transformers-tutorials (by @nielsrogge) – Tutorials for applying multiple models on real-world datasets.

NLP Toolkits

NLP toolkits built upon Transformers. Swiss Army!

AllenNLP (from AI2) – An open-source NLP research library.
Graph4NLP – Enabling easy use of Graph Neural Networks for NLP.
Lightning Transformers – Transformers with PyTorch Lightning interface.
Adapter Transformers – Extension to the Transformers library, integrating adapters into state-of-the-art language models.
Obsei – A low-code AI workflow automation tool and performs various NLP tasks in the workflow pipeline.
Trapper (from OBSS) – State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Text Representation

Converting a sentence to a vector.

Sentence Transformers (from UKPLab) – Widely used encoders computing dense vector representations for sentences, paragraphs, and images.
WhiteningBERT (from Microsoft) – An easy unsupervised sentence embedding approach with whitening.
SimCSE (from Princeton) – State-of-the-art sentence embedding with contrastive learning.
DensePhrases (from Princeton) – Learning dense representations of phrases at scale.

Inference Engines

Highly optimized inference engines implementing Transformers-compatible APIs.

TurboTransformers (from Tencent) – An inference engine for transformers with fast C++ API.
FasterTransformer (from Nvidia) – A script and recipe to run the highly optimized transformer-based encoder and decoder component on NVIDIA GPUs.
lightseq (from ByteDance) – A high performance inference library for sequence processing and generation implemented in CUDA.
FastSeq (from Microsoft) – Efficient implementation of popular sequence models (e.g., Bart, ProphetNet) for text generation, summarization, translation tasks etc.

Model Scalability

Parallelization models across multiple GPUs.

Parallelformers (from TUNiB) – A library for model parallel deployment.
OSLO (from TUNiB) – A library that supports various features to help you train large-scale models.
Deepspeed (from Microsoft) – Deepspeed-ZeRO – scales any model size with zero to no changes to the model. Integrated with HF Trainer.
fairscale (from Facebook) – Implements ZeRO protocol as well. Integrated with HF Trainer.
ColossalAI (from Hpcaitech) – A Unified Deep Learning System for Large-Scale Parallel Training (1D, 2D, 2.5D, 3D and sequence parallelism, and ZeRO protocol).

Model Compression/Acceleration

Compressing or accelerate models for improved inference speed.

torchdistill – PyTorch-based modular, configuration-driven framework for knowledge distillation.
TextBrewer (from HFL) – State-of-the-art distillation methods to compress language models.
BERT-of-Theseus (from Microsoft) – Compressing BERT by progressively replacing the components of the original BERT.

Adversarial Attack

Conducting adversarial attack to test model robustness.

TextAttack (from UVa) – A Python framework for adversarial attacks, data augmentation, and model training in NLP.
TextFlint (from Fudan) – A unified multilingual robustness evaluation toolkit for NLP.
OpenAttack (from THU) – An open-source textual adversarial attack toolkit.

Style Transfer

Transfer the style of text! Now you know why it’s called transformer?

Styleformer – A neural language style transfer framework to transfer text smoothly between styles.
ConSERT – A contrastive framework for self-supervised sentence representation transfer.

Sentiment Analysis

Analyzing the sentiment and emotions of human beings.

conv-emotion – Implementation of different architectures for emotion recognition in conversations.

Grammatical Error Correction

You made a typo! Let me correct it.

Gramformer – A framework for detecting, highlighting, and correcting grammatical errors on natural language text.

Translation

Translating between different languages.

dl-translate – A deep learning-based translation library based on HF Transformers.
EasyNMT (from UKPLab) – Easy-to-use, state-of-the-art translation library and Docker images based on HF Transformers.

Knowledge and Entity

Learning knowledge, mining entities, connecting the world.

PURE (from Princeton) – Entity and relation extraction from text.

Speech

Speech processing powered by HF libraries. Need for speech!

s3prl – A self-supervised speech pre-training and representation learning toolkit.
speechbrain – A PyTorch-based speech toolkit.

Multi-modality

Understanding the world from different modalities.

ViLT (from Kakao) – A vision-and-language transformer Without convolution or region supervision.

Reinforcement Learning

Combining RL magic with NLP!

trl – Fine-tune transformers using Proximal Policy Optimization (PPO) to align with human preferences.

Question Answering

Searching for answers? Transformers to the rescue!

Haystack (from deepset) – End-to-end framework for developing and deploying question-answering systems in the wild.

Recommender Systems

I think this is just right for you!

Transformers4Rec (from Nvidia) – A flexible and efficient library powered by Transformers for sequential and session-based recommendations.

Evaluation

Evaluating NLP outputs powered by HF datasets!

Jury (from OBSS) – Easy to use tool for evaluating NLP model outputs, spesifically for NLG (Natural Language Generation), offering various automated text-to-text metrics.

Neural Search

Search, but with the power of neural networks!

Jina Integration – Jina integration of Hugging Face Accelerated API.
Weaviate Integration (text2vec) (QA) – Weaviate integration of Hugging Face Transformers.
ColBERT (from Stanford) – A fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Cloud

Cloud makes your life easy!

Amazon SageMaker – Making it easier than ever to train Hugging Face Transformer models in Amazon SageMaker.

Hardware

The infrastructure enabling the magic to happen.

Qualcomm – Collaboration on enabling Transformers in Snapdragon.
Intel – Collaboration with Intel for configuration options.

NOTE: This list of resources are entirely from Hugging Faces Repository, I am just sharing it here as a Wiki. Please visit the repository for more