---
title: "Functions Overview"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Functions Overview}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


The goal of **ralger** is to facilitate web scraping in R. For a quick video tutorial, I gave a talk at useR2020, which you can find [here](https://www.youtube.com/watch?v=OHi6E8jegQg) 

## Installation

You can install the `ralger` package from [CRAN](https://cran.r-project.org/) with:

```{r eval=FALSE}
install.packages("ralger")

```

or you can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("feddelegrand7/ralger")
```
## `scrap()`

This is an example which shows how to extract [top ranked universities' names](http://www.shanghairanking.com/ARWU2020.html) according to the ShanghaiRanking Consultancy:


```{r example}
library(ralger)

my_link <- "http://www.shanghairanking.com/ARWU2020.html"

my_node <- "#UniversityRanking a" # The class ID , we recommend SelectorGadget

best_uni <- scrap(link = my_link, node = my_node)

head(best_uni, 10)


```

Thanks to the [robotstxt](https://github.com/ropensci/robotstxt), you can set `askRobot = T` to ask the `robots.txt` file if it's permitted to scrape a specific web page.  


If you want to scrap multiple list pages, just use `scrap()` in conjunction with `paste0()`.


## `table_scrap()`

If you want to extract an __HTML Table__, you can use the `table_scrap()` function. Take a look at this [webpage](https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW) which lists the highest gross revenues in the cinema industry. You can extract the HTML table as follows: 

```{r}


data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")

head(data)


```

__When you deal with a web page that contains many HTML table you can use the `choose` argument to target a specific table__


## `tidy_scrap()`

Sometimes you'll find some useful information on the internet that you want to extract in a tabular manner however these information are not provided in an HTML format. In this context, you can use the `tidy_scrap()` function which returns a tidy data frame according to the arguments that you introduce. The function takes four arguments:

- **link** : the link of the website you're interested for;
- **nodes**: a vector of CSS elements that you want to extract. These elements will form the columns of your data frame;
- **colnames**: this argument represents the vector of names you want to assign to your columns. Note that you should respect the same order as within the **nodes** vector;
- **clean**: if true the function will clean the tibble's columns;
- **askRobot**: ask the robots.txt file if it's permitted to scrape the web page. 

### Example

We'll work on the famous [IMDb website](https://www.imdb.com/). Let's say we need a data frame composed of:

- The title of the 50 best ranked movies of all time
- Their release year
- Their rating

We will need to use the `tidy_scrap()` function as follows: 

```{r example3, message=FALSE, warning=FALSE}

my_link <- "https://www.imdb.com/search/title/?groups=top_250&sort=user_rating"

my_nodes <- c(
  ".lister-item-header a", # The title 
  ".text-muted.unbold", # The year of release 
  ".ratings-imdb-rating strong" # The rating)
  )

names <- c("title", "year", "rating") # respect the nodes order


tidy_scrap(link = my_link, nodes = my_nodes, colnames = names)


```

Note that all columns will be of *character* class. you'll have to convert them according to your needs. 


## `titles_scrap()`

Using `titles_scrap()`, one can efficiently scrape titles which correspond to the _h1, h2 & h3_ HTML tags. 


### Example 

If we go to the [New York Times](https://www.nytimes.com/), we can easily extract the titles displayed within a specific web page : 


```{r example4}


titles_scrap(link = "https://www.nytimes.com/")


```

Further, it's possible to filter the results using the `contain` argument: 


```{r}

titles_scrap(link = "https://www.nytimes.com/", contain = "TrUMp", case_sensitive = FALSE)


```


## `paragraphs_scrap()`


In the same way, we can use the `paragraphs_scrap()` function to extract paragraphs. This function relies on the `p` HTML tag.  

Let's get some paragraphs from the lovely [ropensci.org](https://ropensci.org/) website: 


```{r}

paragraphs_scrap(link = "https://ropensci.org/")

```

If needed, it's possible to collapse the paragraphs into one bag of words: 


```{r}

paragraphs_scrap(link = "https://ropensci.org/", collapse = TRUE)

```


## `weblink_scrap()`

`weblink_scrap()` is used to srape the web links available within a web page. Useful in some cases, for example, getting a list of the available PDFs: 


```{r}

weblink_scrap(link = "https://www.worldbank.org/en/access-to-information/reports/", 
              contain = "PDF", 
              case_sensitive = FALSE)


```

## `images_scrap() ` and `images_preview()` 


`images_preview()` allows you to scrape the URLs of the images available within a web page so that you can choose which images __extension__ (see below) you want to focus on.

Let's say we want to list all the images from the official [RStudio](https://rstudio.com/) website: 


```{r}

images_preview(link = "https://rstudio.com/")

```

`images_scrap()` on the other hand download the images. It takes the following arguments: 

+ **link**: The URL of the web page; 

+ **imgpath**: The destination folder of your images. It defaults to `getwd()`

+ **extn**: the extension of the image: jpg, png, jpeg ... among others; 

+ **askRobot**: ask the robots.txt file if it's permitted to scrape the web page. 


In the following example we extract all the `png` images from [RStudio](https://rstudio.com/)  : 


```{r, eval=FALSE}

# Suppose we're in a project which has a folder called my_images: 

images_scrap(link = "https://rstudio.com/", 
             imgpath = here::here("my_images"), 
             extn = "png") # without the .

```

The images will be downloaded into the folder `here::here("myimages")`.


## Code of Conduct

Please note that the ralger project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.