Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
970 views
in Technique[技术] by (71.8m points)

r - 'NA' does not exist in current working directory (webscraping with for loop)

I am trying to web scrape the data from the tables of all German cities from this webpage (https://de.wikipedia.org/wiki/Liste_der_Orte_mit_Stolpersteinen#Deutschland). With the first 5 steps, I get the urls from all the cities, which works fine.

library(tidyverse)
library(rvest)
library(rlist)
library(stringi)
library(htmltab)
library(foreign)

#1 url Germany 
url = "https://de.wikipedia.org/wiki/Liste_der_Orte_mit_Stolpersteinen#Deutschland"

#2 get url endings of all Cities
city_urls = url %>%
  read_html() %>%
  html_nodes(xpath = '//td[7]/a') %>% 
  html_attr("title")

#3 subset German url endings 
city_urls = as.data.frame(city_urls[19:1013])

#4 concatenate url start and endings
URLs_germany = c()
for (cities in city_urls) {      
  URLs_germany <- paste0('https://de.wikipedia.org/wiki/', cities, '') 
}

#5 correction of urls -> add missing "_" between the words 
Stolpersteine_cities = as.factor(str_replace_all(URLs_germany, " ", "_"))

The problem occurs at step 6. With this for loop I want to get all the data from the respective pages as well as the geo data. If I execute it, I get the error “NA doesn’t exist in current working directory”. I’ve seen the related page on stackoverflow (Error: 'NA' does not exist in current working directory (Webscraping)), but I couldn’t apply the mentioned solutions on my case.

#6 loop through all 
for (i in Stolpersteine_cities) {
  
  city <- read_html(i)
  
  sample <- city %>%
    html_node(xpath = '//*[@id="mw-content-text"]/div/table') %>% 
    html_table()
  
  #find geolocation
  geo_link <- city %>%
    html_node(xpath = '//*[@text()="Standort"]')%>% 
    html_attr("href")
  
  geo_links <- city %>%
    html_nodes("table") %>% 
    # html_nodes("thead") %>% 
    html_nodes("tbody") %>% 
    html_nodes("tr") %>% html_nodes("td") %>% 
    html_nodes("small") %>%
    html_nodes("a") %>%
    html_attr("href")
    
  long_lat_list <- vector("list", nrow(sample)) 
  #find geo location
  for(k in 1:length(geo_links)){
    
    geo_info <- read_html(geo_links[k])
    
    lat <- geo_info%>%
      html_node(xpath = '//span[@class="latitude"]')%>%
      html_text()
    
    long <- geo_info%>%
      html_node(xpath = '//*[@class="longitude"]')%>%
      html_text()
    
    long_lat_list[[k]] <- list(latitude=lat, longitude=long)
    
  }
  
  sample$latitude <- lapply(long_lat_list, "[[", 1)
  sample$longitude <- lapply(long_lat_list, "[[", 2)
  
  #Save City X 
  saveRDS(sample, "filename.Rds")
  
}

I then tried to execute the for loop with just the first 4 cities/urls. While the first two urls work, the third url leads to the mentioned error. But I couldn’t identify any differences in the tables on Wikipedia and I don’t really get what’s the problem.

I would be grateful for any help you’re able to provide.

question from:https://stackoverflow.com/questions/65915412/na-does-not-exist-in-current-working-directory-webscraping-with-for-loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The error is occurring at the line geo_info <- read_html(geo_links[k]). The issue is that geo_links is empty. So when you do 1:length(geo_links) it returns the vector c(1, 0) and goes into the for for loop.

Then, in geo_info <- read_html(geo_links[k]) it tries to access the first element of the vector geo_links. Since the vector is empty, it returns NA. When read_html tries to read this url it returns this error message (I think it is trying to read the "file" NA in the working directory).

So you should test for the length of geo_links and only enter the for loop if length(geo_links) > 0.

  if (length(geo_links) > 0) {
    for(k in 1:length(geo_links)){
      
      geo_info <- read_html(geo_links[k])
      
      lat <- geo_info%>%
        html_node(xpath = '//span[@class="latitude"]')%>%
        html_text()
      
      long <- geo_info%>%
        html_node(xpath = '//*[@class="longitude"]')%>%
        html_text()
      
      long_lat_list[[k]] <- list(latitude=lat, longitude=long)
      
    }
    
    sample$latitude <- lapply(long_lat_list, "[[", 1)
    sample$longitude <- lapply(long_lat_list, "[[", 2)
  }

The reason you are getting empty lists in some of these links is because the tables are not exactly the same between the different links.

You look for the geolocation data in nodes with the tag "small". It works in the first two, but it does not in the third one. In the 3rd one there is no "small" node and the geolocation data is tagged differently...


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...