In this article, we are going to talk about how to collect URLs from the website we would like to scrape.
We will use some simple regex and XPath rules for this, and then we will jump into writing scripts to collect data from the website. We will also play with data, draw some plots, and create some charts.
We will collect a dataset from a blog, which is about big data (www.devveri.com). This website provides useful information about big data and data science domains. It is totally free of charge. People can visit this website and find use cases, exercises, and discussions regarding big data technologies.
Let's start collecting information to find out how many articles there are in each category. You can find this information on the main page of the blog, using the following URL: http://devveri.com/ .
As you see on the left-hand side, there are articles that were published recently. On the right-hand side, you can see categories and article counts of each categories:
To collect the information about how many articles we have for each category, we will use the landing page URL of the website. We will be interested in the right-hand side of the web page shown in the following image:
The following code could be used to load the library and store the URL to the variable:
library(rvest)urls <- "http://devveri.com/"
If we print the URLs variable, it will look like the following image on the R Studio:
Now let's talk about the comment counts of the articles. Because this web page is about sharing useful information about recent technologies regarding current development in the big data and data science domains, readers can easily ask questions to the author or discuss about the article with other readers just by commenting.
Also, it's easy to see comment counts for each article on the category page. You can see one of the articles that was already commented on by readers in the following screenshot. As you can see, this article was commented on three times:
In the following section, we will also write XPath rules to collect this information; then will write an R script and, finally, we will play with the data to create some charts and plots.
Writing XPath rules
In this part, we are going to create our XPath rules to parse the HTML document we will collect:
- First of all, we will write XPath rules to collect information from the left-hand side of the web page.
- Let's navigate to the landing page of the website devveri.com and use Google Developer Tools to create and test XPath rules.
- To use Google Developer Tools, we can right-click on the element that we are interested in.
- Click Inspect Element. In the following screenshot, we marked the elements regarding categories:
Let's write XPath rules to get the categories. We are looking for the information about how many articles there are for each category and the name of the categories:
$x('/html/body/div[3]/div/div[2]/div[1]/ul/li/a/text()')
If you type the XPath rule to the console on the Developer Tools, you will get the following elements. As you can see, we have eighteen text elements, because there are eighteen categories shown on the left-hand side of the page:
Let's open a text element and see how it looks. In the next part, we will experience how we can extract this information with R. As you can see from the wholeText section, we only have category names:
Still, we will need to collect article counts for each category:
- Use the following XPath rule; it will help to collect this information from the web page:
$x('/html/body/div[3]/div/div[2]/div[1]/ul/li/text()')
- If you type the XPath rule to the console on the Developer Tools, you will get the following elements:
As you can see, we have 18 text elements, because there are eighteen categories shown on the left-hand side of the page.
Now it's time to start collecting URLs for articles, since in this stage, we are going to collect the comment counts for articles that were written recently. For this issue, it would be good to have the article name and the date of the articles. If we write the name of the first article, we will get the element regarding the name of the article, as shown in the following screenshot:
- Let's write XPath rules to get the name of the article:
$x('/html/body/div[3]/div/div[1]/div/h2/a/text()')
- If you type the XPath rule to the Developer Tools console, you will get the following elements. As you can see, we have 15 text elements, because there are 15 article previews on this page:
- Let's open the first text element and see how it looks. As you see, we managed to get the text content that we are interested in. In the next part, we will experience how to extract this information with R:
- We have the names of the articles, as we decided we should also collect the date and comment counts of the articles. The following XPath rule will help us to collect the created date of the articles in text format:
$x('/html/body/div[3]/div/div[1]/div/p[1]/span[1]/text()')
- If you type the XPath rule on the Developer Tools console, you will get the elements, as shown in the following screenshot. As you can see, we have 15 text elements regarding dates, because there are 15 article previews on this page:
- Let's open the first text element and see how it looks. As you can see, we managed to get the text content that we are interested in:
- We have the names of the articles and the created dates of the articles. As we decided, we should still collect the comment counts of the articles. The following XPath rule will help us to collect comment counts:
$x('/html/body/div[3]/div/div[1]/div/p[1]/span[4]/a/text()')
- If you type the XPath rule to the Developer Tools console, you will get the elements, as shown in the following screenshot. As you can see, we have 15 text elements regarding comment counts, because there are 15 article previews on this page:
- Let's open the first text element and see how it looks. We managed to get the text content that we are interested in:
Writing your first scraping script
Let's start to write our first scraping script using R. In the previous sections, we have already created XPath rules and URLs that we are interested in. We will start by collecting category counts and information about how many articles there are for each article:
- First of all, we have called an rvest library using the library function. We should load the rvest library using the following command:
library(rvest)
- Now we need to create NULL variables, because we are going to save the article count for each categories and the name of the categories.
- For this purpose, we are creating category and count variables:
#creating NULL variablescategory<- NULLcount <- NULL
- Now it's time to create a variable that includes the URL that we would like to navigate and collect data. By using the following code block, we are assigning a URL to the URLs variable:
#links for pageurls <- "http://devveri.com/"
Now for the most exciting part: Collecting data!
The following script first visits the URL of the web page, collecting HTML nodes using the read_html function. To parse HTML nodes, we are using XPath rules that we have already created in the previous section. For this issue, we are using the html_nodes function, and we are defining our XPath rules, which we already have inside the function:
library(rvest)#creating NULL variablescategory<- NULLcount <- NULL#links for pageurls <- "http://devveri.com/"#reading main urlh <- read_html(urls)#getting categoriesc<- html_nodes(h, xpath = '/html/body/div[3]/div/div[2]/div[1]/ul/li/a/text()')#getting countscc<- html_nodes(h, xpath = '/html/body/div[3]/div/div[2]/div[1]/ul/li/text()')#saving results, converting XMLs to charactercategory<- as.matrix(as.character(c))count<- as.matrix(as.character(cc))
- We can use the data.frame function to see categories and counts together.
- You will get the following result on R, when you run the script on the first line as shown in the following code block:
> data.frame(category,count)category count1 Big Data (11)\n2 Cloud (3)\n3 docker (1)\n4 Doğal Dil İşleme (2)\n5 ElasticSearch (4)\n6 Graph (1)\n7 Haberler (7)\n8 Hadoop (24)\n9 HBase (1)\n10 Kitap (1)\n11 Lucene / Solr (3)\n12 Nosql (12)\n13 Ölçeklenebilirlik (2)\n14 Polyglot (1)\n15 Sunum (1)\n16 Veri Bilimi (2)\n17 Veri Madenciliği (4)\n18 Yapay Öğrenme (3)\n
- Now it's time to collect the name, comment counts, and the date of the articles that we wrote recently.
- We have called the rvest library using the library function and should load the rvest library using the following command:
library(rvest)
- Now we need to create the NULL variable. Because we are going to save the comment counts, the date, and the name of the articles, we are creating the name, date, and comment_count variables:
#creating NULL variablesname <- NULLdate <- NULLcomment_count <- NULL
The following script first of all visits the URL of the web page, collecting HTML nodes using the read_html function. To parse HTML nodes, we are using XPath rules that we have already created in the previous section. For this issue, we are using the html_nodes function, and we are defining our XPath rules, which we already have inside the function:
#creating NULL variablesname <- NULLdate <- NULLcomment_count <- NULL#links for pageurls <- "http://devveri.com/"#reading main urlh <- read_html(urls)#getting namesn<- html_nodes(h, xpath = '/html/body/div[3]/div/div[1]/div/h2/a/text()')#getting datesd<- html_nodes(h, xpath = '/html/body/div[3]/div/div[1]/div/p[1]/span[1]/text()')#getting comment countscomc<- html_nodes(h, xpath = '/html/body/div[3]/div/div[1]/div/p[1]/span[4]/a/text()')#saving resultsname<- as.matrix(as.character(n))date<- as.matrix(as.character(d))comment_count<- as.matrix(as.character(comc))
We managed to collect the name, comment counts, and the date of the articles:
- We can use the data.frame function to see the name, date, and comment_count variables together:
> data.frame(name,date,comment_count)name date comment_count1 Amazon EMR ile Spark 18 Ocak 2018 02 Amazon EMR 13 Ocak 2018 03 AWS ile Big Data 11 Ocak 2018 04 Apache Hadoop 3.0 10 Ocak 2018 05 Big Data Teknolojilerine Hızlı Giriş 19 Haziran 2017 16 Günlük Hayatta Yapay Zekâ Teknikleri – Yazı Dizisi (1) 29 Mart2016 07 Hive Veritabanları Arası Tablo Taşıma 18 Şubat 2016 08 Basit Lineer Regresyon 11 Şubat 2016 29 Apache Sentry ile Yetkilendirme 10 Ocak 2016 010 Hive İç İçe Sorgu Kullanımı 09 Aralık 2015 211 Kmeans ve Kmedoids Kümeleme 07 Aralık 2015 012 Veri analizinde yeni alışkanlıklar 25 Kasım 2015 013 Daha İyi Bir Veri Bilimcisi Olmanız İçin 5 İnanılmaz Yol 02Kasım 2015 114 R ile Korelasyon, Regresyon ve Zaman Serisi Analizleri 12 Ekim2015 315 Data Driven Kavramı ve II. Faz 28 Eylül 2015 0
Playing with the data
We have two different datasets. We’ve already collected categories and article counts for each category, and we have already collected the name, date, and comment counts of the articles that were written recently.
- We should implement basic text manipulation methods to have counts in a more proper format. Because counts look as shown here, we have to apply basic text to get rid of the characters:
> count[,1][1,] " (11)\n"[2,] " (3)\n"[3,] " (1)\n"[4,] " (2)\n"[5,] " (4)\n"[6,] " (1)\n"[7,] " (7)\n"[8,] " (24)\n"[9,] " (1)\n"[10,] " (1)\n"[11,] " (3)\n"[12,] " (12)\n"[13,] " (2)\n"[14,] " (1)\n"[15,] " (1)\n"[16,] " (2)\n"[17,] " (4)\n"[18,] " (3)\n"
- We should be replacing "\n", "(" and ")" with "". For this issue, we are going to use the str_replace_all function. To use the str_replace_all function, we need to install the stringr package and load it:
count <- str_replace_all(count,"\\(","")count <- str_replace_all(count,"\\)","")count <- str_replace_all(count,"\n","")
- Now we have the article counts in a better format. If we create the data frame using the new version of the count variable and article categories, we will get the following result:
> data.frame(category,count)category count1 Big Data 112 Cloud 33 docker 14 Doğal Dil İşleme 25 ElasticSearch 46 Graph 17 Haberler 78 Hadoop 249 HBase 110 Kitap 111 Lucene / Solr 312 Nosql 1213 Ölçeklenebilirlik 214 Polyglot 115 Sunum 116 Veri Bilimi 217 Veri Madenciliği 418 Yapay Öğrenme 3
- Let's assign this data frame to a variable and cast the count as numeric, because they are in string format. If we run the following code, we will create a new data frame and convert counts to the numeric format:
categories <- data.frame(category,count)categories$count<-as.numeric(categories$count)
Now we are ready to create some charts:
- To do this, we can use the interactive plotting library of R, plotly.
- You can install it using the install.packages("plotly") command.
- Then, of course, we have to call this library using the library(plotly) command:
plot_ly(categories, x = ~category, y = ~count, type = 'bar')
The following command will help us to create a bar chart to show article counts for each category:
- We can create some charts using our second dataset that is about the date, name, and comment counts of articles that were written recently. If you remember, we already collected the following data for this purpose:
> data.frame(name,date,comment_count)name date comment_count1 Amazon EMR ile Spark 18 Ocak 2018 02 Amazon EMR 13 Ocak 2018 03 AWS ile Big Data 11 Ocak 2018 0...
We are ready to create our final data frame. But, don't forget, comment counts are still in the string format:
- We have to cast them to numeric format. For this purpose, we can use the as.numeric function:
comments<- data.frame(name,date,comment_count)comments$comment_count<- as.numeric(comments$comment_count)
Now we're ready to go! Calcuate the comment counts per date:
- To do this, we can use the aggregate function:
avg_comment_counts <- aggregate(comment_count~date, data =comments, FUN= "mean")
- Now we have the daily average comment counts; let's create a line chart to see the changes in the daily average ratings:
plot(avg_comment_counts,type = "l")
The following line chart shows us the average comment counts based on dates:
Now, let's investigate more about the dataset. Seeing the summary statistics of the comment counts would be really good. In this part, we are going to calculate the minimum, maximum, mean, and median value of the comment counts and then create bar chart that shows those summary statistics.
By using the following commands, we can calculate those summary statistics:
min_comment_count<- min(comments$comment_count)max_comment_count<- max(comments$comment_count)avg_comment_count<- mean(comments$comment_count)median_comment_count<- median(comments$comment_count)
Let's create a dataframe that contains the metrics calculated:
summary<- data.frame(min_comment_count, max_comment_count,avg_comment_count, median_comment_count)
Now that we have the summary statistics, we can create a bar chart using those values by using the following commands. Because on our plot there will be more than one different category, we are going to use the add_trace function:
plot_ly(x = "min", y = summary$min_comment_count, type = 'bar',name='min') %>%add_trace(x = "max", y = summary$max_comment_count, type = 'bar',name='max')%>%add_trace(x = "avg", y = summary$avg_comment_count, type = 'bar',name='average')%>%add_trace(x = "median", y = summary$median_comment_count, type = 'bar',name='median')
As you can see, this bar chart is a summary of the statistics of the daily average of the ratings:
That’s it! If you enjoyed reading this article and want to learn more about web scraping with R, you can explore R Web Scraping Quick Start Guide. Written by Olgun Aydin, a PhD candidate at the Department of Statistics at Mimar Sinan University, R Web Scraping Quick Start Guide is for R programmers who want to get started quickly with web scraping, as well as data analysts who want to learn scraping using R.