Jump to content

User:Tagishsimon/Junk4

From Wikipedia, the free encyclopedia

Scraping Legislation.gov.uk

[edit]

I'm wanting to improve my scraping of legislation.gov.uk.

Example: https://www.legislation.gov.uk/uksi/2020 - contains a couple of tables.

Right now I'm using a bit of nokogiri I found on sourceforge, to parse values from the table into a CSV. However it's only parsing the anchor text, and not the hyperlinks.

Sections below show

The ask Might you be able either to amend the current ruby to something that does what I'm after, or, supply some other perl/python/??? code which supplies the desired output?

The documents I parse will have 10s to 100s of tables - I'm WGETting lots of pages from legislation.gov.uk, concatenating them into a single file, and running the current ruby across that. Downstream workflow - I put the CSV into a larger spreadsheet which generates the quickstatements required to append new items.

Current ruby

[edit]
 require 'nokogiri'
 print_header_lines = ARGV[1]
 File.open(ARGV[0]) do |f|
  table_string=f
  doc = Nokogiri::HTML(table_string)
  doc.xpath('//table//tr').each do |row|
    if print_header_lines
      row.xpath('th').each do |cell|
        print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
      end
    end
    row.xpath('td').each do |cell|
      print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
    end
    print "\n"
  end
end

Data table

[edit]
<table>
  <thead>
    <tr>
      <th><a href="/uksi/2020?sort=title" class="sortAsc" title="Sort ascending by Title"><span class="accessibleText">Sort ascending by </span>Title</a></th>
      <th><span>Years and Numbers</span></th>
      <th><span>Legislation type</span></th>
    </tr>
  </thead>
  <tbody>
    <tr class="oddRow">
      <td class="bilingual en"><a href="/wsi/2020/1064/contents/made">The Representation of the People (Electoral Register Publication Date) (Wales) (Coronavirus) Regulations 2020</a></td>
      <td rowspan="2"><a href="/wsi/2020/1064/contents/made">2020 No. 1064 (W. 239)</a></td>
      <td rowspan="2">Wales Statutory Instruments</td>
    </tr>
    <tr class="oddRow">
      <td class="bilingual cy"><a href="/wsi/2020/1064/contents/made/welsh" xml:lang="cy">Rheoliadau Cynrychiolaeth y Bobl (Dyddiad Cyhoeddi’r Gofrestr Etholiadol) (Cymru) (Coronafeirws) 2020</a></td>
    </tr>
    <tr>
      <td><a href="/uksi/2020/1063/contents/made">The Safety of Sports Grounds (Designation) (Amendment) (England) (No. 4) Order 2020</a></td>
      <td><a href="/uksi/2020/1063/contents/made">2020 No. 1063</a></td>
      <td>UK Statutory Instruments</td>
    </tr>
</code>

Current output

[edit]
The Representation of the People (Electoral Register Publication Date) (Wales) (Coronavirus) Regulations 2020 2020 No. 1064 (W. 239) Wales Statutory Instruments
Rheoliadau Cynrychiolaeth y Bobl (Dyddiad Cyhoeddi’r Gofrestr Etholiadol) (Cymru) (Coronafeirws) 2020
The Safety of Sports Grounds (Designation) (Amendment) (England) (No. 4) Order 2020 2020 No. 1063 UK Statutory Instruments


Desired output

[edit]
/wsi/2020/1064/contents/made The Representation of the People (Electoral Register Publication Date) (Wales) (Coronavirus) Regulations 2020 /wsi/2020/1064/contents/made 2020 No. 1064 (W. 239) Wales Statutory Instruments
/wsi/2020/1064/contents/made/welsh Rheoliadau Cynrychiolaeth y Bobl (Dyddiad Cyhoeddi’r Gofrestr Etholiadol) (Cymru) (Coronafeirws) 2020
/uksi/2020/1063/contents/made The Safety of Sports Grounds (Designation) (Amendment) (England) (No. 4) Order 2020 /uksi/2020/1063/contents/made 2020 No. 1063 UK Statutory Instruments