| Title: | Easy Web Scraping |
|---|---|
| Description: | The goal of 'ralger' is to facilitate web scraping in R. |
| Authors: | Mohamed El Fodil Ihaddaden [aut, cre], Ezekiel Ogundepo [ctb], Romain François [ctb] |
| Maintainer: | Mohamed El Fodil Ihaddaden <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2.3.0 |
| Built: | 2026-05-13 08:58:42 UTC |
| Source: | https://github.com/feddelegrand7/ralger |
This function is used to scrape attributes from HTML elements
attribute_scrap(link, node, attr, askRobot = FALSE)attribute_scrap(link, node, attr, askRobot = FALSE)
link |
the link of the web page to scrape |
node |
the HTML element to consider |
attr |
the attribute to scrape |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a character vector.
# Extracting the web links within the World Bank research and publications page link <- "https://ropensci.org/" # scraping the class attributes' names from all the anchor attribute_scrap(link = link, node = "a", attr = "class")# Extracting the web links within the World Bank research and publications page link <- "https://ropensci.org/" # scraping the class attributes' names from all the anchor attribute_scrap(link = link, node = "a", attr = "class")
This function is used to scrape audio file URLs from a web page
and optionally download them. It searches both <audio> tags and
<a> (anchor) tags for links matching the specified audio extensions.
audio_scrap(link, extensions = c("mp3", "wav"), path = getwd(), askRobot = FALSE)audio_scrap(link, extensions = c("mp3", "wav"), path = getwd(), askRobot = FALSE)
link |
the link of the web page to scrape |
extensions |
a character vector of audio file extensions to filter by
(without the leading dot). Defaults to |
path |
the path where audio files will be downloaded. Defaults to the
current working directory. Set to |
askRobot |
logical. Should the function ask the robots.txt if we're
allowed or not to scrape the web page? Default is |
called for the side effect of downloading audio files. Returns the
vector of matched audio URLs invisibly, or NULL if none are found.
## Not run: # Scrape and download mp3 and wav files from a page audio_scrap( link = "https://www.example.com/podcasts", extensions = c("mp3", "wav"), path = getwd() ) # Return audio URLs without downloading audio_scrap( link = "https://www.example.com/podcasts", extensions = "mp3", path = NULL ) ## End(Not run)## Not run: # Scrape and download mp3 and wav files from a page audio_scrap( link = "https://www.example.com/podcasts", extensions = c("mp3", "wav"), path = getwd() ) # Return audio URLs without downloading audio_scrap( link = "https://www.example.com/podcasts", extensions = "mp3", path = NULL ) ## End(Not run)
This function scrapes all color codes present within a given
website. It searches inline style attributes, <style> tags,
and linked external CSS stylesheets for color values in the following
formats: hexadecimal (#RGB, #RRGGBB, #RRGGBBAA),
rgb(), rgba(), hsl(), and hsla().
colors_scrap(link, askRobot = FALSE)colors_scrap(link, askRobot = FALSE)
link |
the link of the web page to scrape. Can be a character vector of multiple URLs. |
askRobot |
logical. Should the function ask the robots.txt if we're
allowed or not to scrape the web page? Default is |
a character vector of unique color codes found on the page, or
invisible(NULL) if none are found, or NA on error.
colors_scrap(link = "https://ropensci.org/")colors_scrap(link = "https://ropensci.org/")
Extracts HTML comments (<!– comment –>) from a webpage. Useful for detecting hidden notes, debug info, or developer messages.
comments_scrap(link, askRobot = FALSE)comments_scrap(link, askRobot = FALSE)
link |
Character. The URL of the web page to scrape. |
askRobot |
Logical. Should the function check robots.txt before scraping? Default is FALSE. |
A character vector of HTML comments found on the page.
link <- "https://example.com" comments_scrap(link)link <- "https://example.com" comments_scrap(link)
Scrape and download CSV files from a Web Page
csv_scrap(link, path = getwd(), askRobot = FALSE)csv_scrap(link, path = getwd(), askRobot = FALSE)
link |
the link of the web page |
path |
the path where to save the CSV files. Defaults to the current directory |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
called for the side effect of downloading CSV files from a website
Scrape Images URLS that don't have 'alt' attributes
images_noalt_scrap(link, askRobot = FALSE)images_noalt_scrap(link, askRobot = FALSE)
link |
the URL of the web page |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a character vector of images' URL without "alt" attribute
images_noalt_scrap(link = "https://www.r-consortium.org/")images_noalt_scrap(link = "https://www.r-consortium.org/")
Scrape Images URLs
images_preview(link, askRobot = FALSE)images_preview(link, askRobot = FALSE)
link |
the link of the web page |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
Images URLs
images_preview(link = "https://posit.co/")images_preview(link = "https://posit.co/")
Scrape Images from a Web Page
images_scrap(link, imgpath = getwd(), extn, askRobot = FALSE)images_scrap(link, imgpath = getwd(), extn, askRobot = FALSE)
link |
the link of the web page |
imgpath |
the path of the images. Defaults to the current directory |
extn |
the extension of the image: png, jpeg ... |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
called for the side effect of downloading images
## Not run: images_scrap(link = "https://posit.co/", extn = "jpg") ## End(Not run)## Not run: images_scrap(link = "https://posit.co/", extn = "jpg") ## End(Not run)
This function is used to scrape text paragraphs from a website.
paragraphs_scrap( link, contain = NULL, case_sensitive = FALSE, collapse = FALSE, askRobot = FALSE )paragraphs_scrap( link, contain = NULL, case_sensitive = FALSE, collapse = FALSE, askRobot = FALSE )
link |
the link of the web page to scrape |
contain |
filter the paragraphs according to the character string provided. |
case_sensitive |
logical. Should the contain argument be case sensitive ? defaults to FALSE |
collapse |
if TRUE the paragraphs will be collapsed into one element and the contain argument ignored. |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrap the web page ? Default is FALSE. |
a character vector.
# Extracting the paragraphs displayed on the health page of the New York Times link <- "https://www.nytimes.com/section/health" paragraphs_scrap(link)# Extracting the paragraphs displayed on the health page of the New York Times link <- "https://www.nytimes.com/section/health" paragraphs_scrap(link)
Scrape and download pdf files from a Web Page
pdf_scrap(link, path = getwd(), askRobot = FALSE)pdf_scrap(link, path = getwd(), askRobot = FALSE)
link |
the link of the web page |
path |
the path where to save the PDF files. Defaults to the current directory |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
called for the side effect of downloading PDF files from a website
This function is used to scrape one element from a website.
scrap(link, node, clean = FALSE, askRobot = FALSE)scrap(link, node, clean = FALSE, askRobot = FALSE)
link |
the link of the web page to scrape |
node |
the HTML or CSS element to consider, the SelectorGadget tool is highly recommended |
clean |
logical. Should the function clean the extracted vector or not ? Default is FALSE. |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a character vector
# Extracting imdb top 250 movie titles link <- "https://www.imdb.com/chart/top/" node <- "h3.ipc-title__text" scrap(link, node)# Extracting imdb top 250 movie titles link <- "https://www.imdb.com/chart/top/" node <- "h3.ipc-title__text" scrap(link, node)
This function is used to scrape an html table from a website.
table_scrap(link, choose = 1, header = TRUE, askRobot = FALSE)table_scrap(link, choose = 1, header = TRUE, askRobot = FALSE)
link |
the link of the web page containing the table to scrape |
choose |
an integer indicating which table to scrape |
header |
do you want the first line to be the leader (default to TRUE) |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a data frame object.
# Extracting premier ligue 2019/2020 top scorers link <- "https://www.topscorersfootball.com/premier-league" table_scrap(link)# Extracting premier ligue 2019/2020 top scorers link <- "https://www.topscorersfootball.com/premier-league" table_scrap(link)
This function is used to scrape a tibble from a website.
tidy_scrap(link, nodes, colnames, clean = FALSE, askRobot = FALSE)tidy_scrap(link, nodes, colnames, clean = FALSE, askRobot = FALSE)
link |
the link of the web page to scrape |
nodes |
the vector of HTML or CSS elements to consider, the SelectorGadget tool is highly recommended. |
colnames |
the names of the expected columns. |
clean |
logical. Should the function clean the extracted tibble or not ? Default is FALSE. |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a tidy data frame.
# Extracting imdb movie titles and rating link <- "https://www.imdb.com/chart/top/" my_nodes <- c("a > h3.ipc-title__text", "span.ratingGroup--imdb-rating") names <- c("title", "rating") tidy_scrap(link, my_nodes, names)# Extracting imdb movie titles and rating link <- "https://www.imdb.com/chart/top/" my_nodes <- c("a > h3.ipc-title__text", "span.ratingGroup--imdb-rating") names <- c("title", "rating") tidy_scrap(link, my_nodes, names)
This function is used to scrape titles (h1, h2 & h3 html tags) from a website. Useful for scraping daily electronic newspapers' titles.
titles_scrap(link, contain = NULL, case_sensitive = FALSE, askRobot = FALSE)titles_scrap(link, contain = NULL, case_sensitive = FALSE, askRobot = FALSE)
link |
the link of the web page to scrape |
contain |
filter the titles according to a character string provided. |
case_sensitive |
logical. Should the contain argument be case sensitive ? defaults to FALSE |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE |
a character vector
# Extracting the current titles of the New York Times link <- "https://www.nytimes.com/" titles_scrap(link)# Extracting the current titles of the New York Times link <- "https://www.nytimes.com/" titles_scrap(link)
This function is used to scrape web links from a website.
weblink_scrap(link, contain = NULL, case_sensitive = FALSE, askRobot = FALSE)weblink_scrap(link, contain = NULL, case_sensitive = FALSE, askRobot = FALSE)
link |
the link of the web page to scrape |
contain |
filter the web links according to the character string provided. |
case_sensitive |
logical. Should the contain argument be case sensitive ? defaults to FALSE |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a character vector.
# Extracting the web links within the World Bank research and publications page link <- "https://www.worldbank.org/en/research" weblink_scrap(link)# Extracting the web links within the World Bank research and publications page link <- "https://www.worldbank.org/en/research" weblink_scrap(link)
Scrape and download Excel xls files from a Web Page
xls_scrap(link, path = getwd(), askRobot = FALSE)xls_scrap(link, path = getwd(), askRobot = FALSE)
link |
the link of the web page |
path |
the path where to save the Excel xls files. Defaults to the current directory |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
called for the side effect of downloading Excel xls files from a website
Scrape and download Excel xlsx files from a Web Page
xlsx_scrap(link, path = getwd(), askRobot = FALSE)xlsx_scrap(link, path = getwd(), askRobot = FALSE)
link |
the link of the web page |
path |
the path where to save the Excel xlsx files. Defaults to the current directory |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
called for the side effect of downloading Excel xlsx files from a website
## Not run: xlsx_scrap( link = "https://www.rieter.com/investor-relations/results-and-presentations/financial-statements" ) ## End(Not run)## Not run: xlsx_scrap( link = "https://www.rieter.com/investor-relations/results-and-presentations/financial-statements" ) ## End(Not run)