Title: | Easy Web Scraping |
---|---|
Description: | The goal of 'ralger' is to facilitate web scraping in R. |
Authors: | Mohamed El Fodil Ihaddaden [aut, cre], Ezekiel Ogundepo [ctb], Romain François [ctb] |
Maintainer: | Mohamed El Fodil Ihaddaden <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.2.4 |
Built: | 2024-11-13 05:15:14 UTC |
Source: | https://github.com/feddelegrand7/ralger |
This function is used to scrape attributes from HTML elements
attribute_scrap(link, node, attr, askRobot = FALSE)
attribute_scrap(link, node, attr, askRobot = FALSE)
link |
the link of the web page to scrape |
node |
the HTML element to consider |
attr |
the attribute to scrape |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a character vector.
# Extracting the web links within the World Bank research and publications page link <- "https://ropensci.org/" # scraping the class attributes' names from all the anchor attribute_scrap(link = link, node = "a", attr = "class")
# Extracting the web links within the World Bank research and publications page link <- "https://ropensci.org/" # scraping the class attributes' names from all the anchor attribute_scrap(link = link, node = "a", attr = "class")
Scrape and download CSV files from a Web Page
csv_scrap(link, path = getwd(), askRobot = FALSE)
csv_scrap(link, path = getwd(), askRobot = FALSE)
link |
the link of the web page |
path |
the path where to save the CSV files. Defaults to the current directory |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
called for the side effect of downloading CSV files from a website
Scrape Images URLS that don't have 'alt' attributes
images_noalt_scrap(link, askRobot = FALSE)
images_noalt_scrap(link, askRobot = FALSE)
link |
the URL of the web page |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a character vector of images' URL without "alt" attribute
images_noalt_scrap(link = "https://www.r-consortium.org/")
images_noalt_scrap(link = "https://www.r-consortium.org/")
Scrape Images URLs
images_preview(link, askRobot = FALSE)
images_preview(link, askRobot = FALSE)
link |
the link of the web page |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
Images URLs
images_preview(link = "https://posit.co/")
images_preview(link = "https://posit.co/")
Scrape Images from a Web Page
images_scrap(link, imgpath = getwd(), extn, askRobot = FALSE)
images_scrap(link, imgpath = getwd(), extn, askRobot = FALSE)
link |
the link of the web page |
imgpath |
the path of the images. Defaults to the current directory |
extn |
the extension of the image: png, jpeg ... |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
called for the side effect of downloading images
## Not run: images_scrap(link = "https://posit.co/", extn = "jpg") ## End(Not run)
## Not run: images_scrap(link = "https://posit.co/", extn = "jpg") ## End(Not run)
This function is used to scrape text paragraphs from a website.
paragraphs_scrap( link, contain = NULL, case_sensitive = FALSE, collapse = FALSE, askRobot = FALSE )
paragraphs_scrap( link, contain = NULL, case_sensitive = FALSE, collapse = FALSE, askRobot = FALSE )
link |
the link of the web page to scrape |
contain |
filter the paragraphs according to the character string provided. |
case_sensitive |
logical. Should the contain argument be case sensitive ? defaults to FALSE |
collapse |
if TRUE the paragraphs will be collapsed into one element and the contain argument ignored. |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrap the web page ? Default is FALSE. |
a character vector.
# Extracting the paragraphs displayed on the health page of the New York Times link <- "https://www.nytimes.com/section/health" paragraphs_scrap(link)
# Extracting the paragraphs displayed on the health page of the New York Times link <- "https://www.nytimes.com/section/health" paragraphs_scrap(link)
Scrape and download pdf files from a Web Page
pdf_scrap(link, path = getwd(), askRobot = FALSE)
pdf_scrap(link, path = getwd(), askRobot = FALSE)
link |
the link of the web page |
path |
the path where to save the PDF files. Defaults to the current directory |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
called for the side effect of downloading PDF files from a website
This function is used to scrape one element from a website.
scrap(link, node, clean = FALSE, askRobot = FALSE)
scrap(link, node, clean = FALSE, askRobot = FALSE)
link |
the link of the web page to scrape |
node |
the HTML or CSS element to consider, the SelectorGadget tool is highly recommended |
clean |
logical. Should the function clean the extracted vector or not ? Default is FALSE. |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a character vector
# Extracting imdb top 250 movie titles link <- "https://www.imdb.com/chart/top/" node <- "h3.ipc-title__text" scrap(link, node)
# Extracting imdb top 250 movie titles link <- "https://www.imdb.com/chart/top/" node <- "h3.ipc-title__text" scrap(link, node)
This function is used to scrape an html table from a website.
table_scrap(link, choose = 1, header = TRUE, askRobot = FALSE)
table_scrap(link, choose = 1, header = TRUE, askRobot = FALSE)
link |
the link of the web page containing the table to scrape |
choose |
an integer indicating which table to scrape |
header |
do you want the first line to be the leader (default to TRUE) |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a data frame object.
# Extracting premier ligue 2019/2020 top scorers link <- "https://www.topscorersfootball.com/premier-league" table_scrap(link)
# Extracting premier ligue 2019/2020 top scorers link <- "https://www.topscorersfootball.com/premier-league" table_scrap(link)
This function is used to scrape a tibble from a website.
tidy_scrap(link, nodes, colnames, clean = FALSE, askRobot = FALSE)
tidy_scrap(link, nodes, colnames, clean = FALSE, askRobot = FALSE)
link |
the link of the web page to scrape |
nodes |
the vector of HTML or CSS elements to consider, the SelectorGadget tool is highly recommended. |
colnames |
the names of the expected columns. |
clean |
logical. Should the function clean the extracted tibble or not ? Default is FALSE. |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a tidy data frame.
# Extracting imdb movie titles and rating link <- "https://www.imdb.com/chart/top/" my_nodes <- c("a > h3.ipc-title__text", "span.ratingGroup--imdb-rating") names <- c("title", "rating") tidy_scrap(link, my_nodes, names)
# Extracting imdb movie titles and rating link <- "https://www.imdb.com/chart/top/" my_nodes <- c("a > h3.ipc-title__text", "span.ratingGroup--imdb-rating") names <- c("title", "rating") tidy_scrap(link, my_nodes, names)
This function is used to scrape titles (h1, h2 & h3 html tags) from a website. Useful for scraping daily electronic newspapers' titles.
titles_scrap(link, contain = NULL, case_sensitive = FALSE, askRobot = FALSE)
titles_scrap(link, contain = NULL, case_sensitive = FALSE, askRobot = FALSE)
link |
the link of the web page to scrape |
contain |
filter the titles according to a character string provided. |
case_sensitive |
logical. Should the contain argument be case sensitive ? defaults to FALSE |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE |
a character vector
# Extracting the current titles of the New York Times link <- "https://www.nytimes.com/" titles_scrap(link)
# Extracting the current titles of the New York Times link <- "https://www.nytimes.com/" titles_scrap(link)
This function is used to scrape web links from a website.
weblink_scrap(link, contain = NULL, case_sensitive = FALSE, askRobot = FALSE)
weblink_scrap(link, contain = NULL, case_sensitive = FALSE, askRobot = FALSE)
link |
the link of the web page to scrape |
contain |
filter the web links according to the character string provided. |
case_sensitive |
logical. Should the contain argument be case sensitive ? defaults to FALSE |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a character vector.
# Extracting the web links within the World Bank research and publications page link <- "https://www.worldbank.org/en/research" weblink_scrap(link)
# Extracting the web links within the World Bank research and publications page link <- "https://www.worldbank.org/en/research" weblink_scrap(link)
Scrape and download Excel xls files from a Web Page
xls_scrap(link, path = getwd(), askRobot = FALSE)
xls_scrap(link, path = getwd(), askRobot = FALSE)
link |
the link of the web page |
path |
the path where to save the Excel xls files. Defaults to the current directory |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
called for the side effect of downloading Excel xls files from a website
Scrape and download Excel xlsx files from a Web Page
xlsx_scrap(link, path = getwd(), askRobot = FALSE)
xlsx_scrap(link, path = getwd(), askRobot = FALSE)
link |
the link of the web page |
path |
the path where to save the Excel xlsx files. Defaults to the current directory |
askRobot |
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
called for the side effect of downloading Excel xlsx files from a website
## Not run: excel_scrap( link = "https://www.rieter.com/investor-relations/results-and-presentations/financial-statements" ) ## End(Not run)
## Not run: excel_scrap( link = "https://www.rieter.com/investor-relations/results-and-presentations/financial-statements" ) ## End(Not run)