User:Tagishsimon/Junk4
Scraping Legislation.gov.uk
[edit]I'm wanting to improve my scraping of legislation.gov.uk.
Example: https://www.legislation.gov.uk/uksi/2020 - contains a couple of tables.
Right now I'm using a bit of nokogiri I found on sourceforge, to parse values from the table into a CSV. However it's only parsing the anchor text, and not the hyperlinks.
Sections below show
- the code I'm using
- sample of the pertinent table from https://www.legislation.gov.uk/uksi/2020
- current code output
- desired code output
The ask Might you be able either to amend the current ruby to something that does what I'm after, or, supply some other perl/python/??? code which supplies the desired output?
The documents I parse will have 10s to 100s of tables - I'm WGETting lots of pages from legislation.gov.uk, concatenating them into a single file, and running the current ruby across that. Downstream workflow - I put the CSV into a larger spreadsheet which generates the quickstatements required to append new items.
Current ruby
[edit] require 'nokogiri'
print_header_lines = ARGV[1]
File.open(ARGV[0]) do |f|
table_string=f
doc = Nokogiri::HTML(table_string)
doc.xpath('//table//tr').each do |row|
if print_header_lines
row.xpath('th').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
end
end
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
end
print "\n"
end
end
Data table
[edit]<table>
<thead>
<tr>
<th><a href="/uksi/2020?sort=title" class="sortAsc" title="Sort ascending by Title"><span class="accessibleText">Sort ascending by </span>Title</a></th>
<th><span>Years and Numbers</span></th>
<th><span>Legislation type</span></th>
</tr>
</thead>
<tbody>
<tr class="oddRow">
<td class="bilingual en"><a href="/wsi/2020/1064/contents/made">The Representation of the People (Electoral Register Publication Date) (Wales) (Coronavirus) Regulations 2020</a></td>
<td rowspan="2"><a href="/wsi/2020/1064/contents/made">2020 No. 1064 (W. 239)</a></td>
<td rowspan="2">Wales Statutory Instruments</td>
</tr>
<tr class="oddRow">
<td class="bilingual cy"><a href="/wsi/2020/1064/contents/made/welsh" xml:lang="cy">Rheoliadau Cynrychiolaeth y Bobl (Dyddiad Cyhoeddi’r Gofrestr Etholiadol) (Cymru) (Coronafeirws) 2020</a></td>
</tr>
<tr>
<td><a href="/uksi/2020/1063/contents/made">The Safety of Sports Grounds (Designation) (Amendment) (England) (No. 4) Order 2020</a></td>
<td><a href="/uksi/2020/1063/contents/made">2020 No. 1063</a></td>
<td>UK Statutory Instruments</td>
</tr>
</code>
Current output
[edit]The Representation of the People (Electoral Register Publication Date) (Wales) (Coronavirus) Regulations 2020 | 2020 No. 1064 (W. 239) | Wales Statutory Instruments |
Rheoliadau Cynrychiolaeth y Bobl (Dyddiad Cyhoeddi’r Gofrestr Etholiadol) (Cymru) (Coronafeirws) 2020 | ||
The Safety of Sports Grounds (Designation) (Amendment) (England) (No. 4) Order 2020 | 2020 No. 1063 | UK Statutory Instruments |
Desired output
[edit]/wsi/2020/1064/contents/made | The Representation of the People (Electoral Register Publication Date) (Wales) (Coronavirus) Regulations 2020 | /wsi/2020/1064/contents/made | 2020 No. 1064 (W. 239) | Wales Statutory Instruments |
/wsi/2020/1064/contents/made/welsh | Rheoliadau Cynrychiolaeth y Bobl (Dyddiad Cyhoeddi’r Gofrestr Etholiadol) (Cymru) (Coronafeirws) 2020 | |||
/uksi/2020/1063/contents/made | The Safety of Sports Grounds (Designation) (Amendment) (England) (No. 4) Order 2020 | /uksi/2020/1063/contents/made | 2020 No. 1063 | UK Statutory Instruments |