User:Mdupont/Open content

Idea

Create a freely usable database of all known information about free software projects. Look for matching patterns of text that occur on different sources to connect and correlate them with each other.

Implementation

Source code :

https://github.com/h4ck3rm1k3/open-everything-library/tree/extractor

and download helpers:

https://github.com/h4ck3rm1k3/open-everything-library/tree/helpers

Todo

These are not used yet :

Algorithm

Starting with Category:Open content and following all subcats and pages. Extract all external links. Fetch all external pages.

Look at external websites, determine which are open content.

Look at software projects, extract information, cross reference to metadata from the following sources (see list below)

Storage of the data in json format in a mongo db, currently have 200gb plus data.

Merge various data sources base on : external urls, names, source control repos.

Goal is to push the merged data into buckets of data on archive.org so you can download them in zipped data files/parts if need.

github

Archive.org dumps:

github project data in avro format 7zipped https://archive.org/details/github-projects.1.avro.7z created with this script https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/import_gh2.py
https://archive.org/details/github_201602
https://archive.org/details/samples.github_nested.json

Git hub metadata is not free per se. it is limited by tos. There is an api you can use to get data. https://developer.github.com/v3/repos/#list-all-public-repositories

There is a dump of projects from the archiveteam that it outdated. https://archive.org/details/archiveteam-github-repository-index-201212 https://developer.github.com/v3/repos/#list-all-public-repositories

Pulling via authenticated api :https://api.github.com/repositories?since=%d&access_token=%s

Status : importing json, still downloading, at item id 46314762.

code for download https://github.com/h4ck3rm1k3/open-everything-library/blob/helpers/sources/github/get.py takes the last id downloaded as parameter. Requires authentication token. Code for import https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/import_gh2.py

As of Jan 25 have all projects, last ID was 50342814.

The browse does not contain the homepage, and the api browse is very verbose, another call to get the project details is needed to get the homepage.

Here is an example of what we can collect from the api and put into a wiki for hosting/editing

http://freedom-and-openness.wikia.com/wiki/GitHub_Projects/browning/chronic

sf.net

First need a list of projects. Project export : http://sourceforge.net/blog/project-data-export/

~~Nonfree data from 2014 http://srda.cse.nd.edu/mediawiki/index.php/Main_Page~~

First we get the list of projects with this schema

starting with https://sourceforge.net/directory/os:linux/?page=1

http://sourceforge.net/directory/os%3Alinux/?page=${page}"

We have gotten 1982 pages. It says that there are 16771 pages, but after the 1982, the web server stops. TODO : access via more categories.

The doap files are then extracted with this script

https://github.com/h4ck3rm1k3/open-everything-library/blob/helpers/sources/sf.net/doap.sh

Then the import of the doap is done by import_sf_doap.py https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/import_sf_doap.py

cats

The pulling of the categories was done by a simple recursive scan: https://github.com/h4ck3rm1k3/open-everything-library/blob/helpers/sources/sf.net/cats/doit.sh

But this got very slow, so I wrote a scraper and started to import them into mongo : https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/process_sf_cats.py After this is done importing all the pages downloaded so far, including the pages, the subcats and the facets and the number of pages. After that import runs, will scan the cat pages for pages that have not been imported and run those dynamically.

debian

software packages

https://wiki.debian.org/qa.debian.org/pts/RdfInterface packages.qa.debian.org:/srv/packages.qa.debian.org/www/web/full-dump.tar.bz2

rdf.debian.net is the newest. Find the https://wiki.debian.org/UltimateDebianDatabase with a great deal of info.

udd

https://udd.debian.org/ contains an sql database

wnpp

The intent to package bug reports contain information about packages not yet in debian. https://www.debian.org/devel/wnpp/

openhub

Open_Hub

api

https://github.com/blackducksoftware/ohloh_api/blob/master/reference/project.md

get key from https://www.openhub.net/accounts/<username>/api_keys

   API_KEY=XXXXX for each page:    curl --output projects.xml --verbose  "https://www.openhub.net/projects.xml?api_key=$API_KEY&page=${page}"

bitbucket

https://bitbucket.org/api/2.0/repositories/ extract the next with : URL=`jq .next $OUT -r`

Status: Downloaded 71763 pages of 10 projects via json, importing all pages.

gitlab

https://archive.org/details/gitlab-list-2016-01-30 snapshot

eclipse foundation

http://projects.eclipse.org/search/projects?page=1

php

https://packagist.org/

fossil

http://fossil.include-once.org/

freshcode

http://freshcode.club/

perl cpan

the 02packages.details.txt.gz file, is cached locally when you run cpan. ~/.cpan/sources/modules/02packages.details.txt.gz Source : http://www.cpan.org/modules/02packages.details.txt.gz

nix

http://nixos.org/ packages

arch

https://www.archlinux.org/packages/

npm

The npm util caches package information

run :

   npm search

http://www.sitepoint.com/beginners-guide-node-package-manager/

this will populate :

   ~/.npm/registry.npmjs.org/-/all/.cache.json

Snapshot : https://archive.org/download/npm.json

RubyGems

Snapshot https://archive.org/download/jamesmikedupont_gmail_Ruby https://rubygems.org/

python packages

API: https://www.python.org/dev/peps/pep-0503/

Get index of all packages : wget -m -r -l1 https://pypi.python.org/simple/ That will get you a full list of packages and versions. Get main page for each package : https://pypi.python.org/pypi/${PKG}"

wget -m --no-parent -r -l1 https://pypi.python.org/pypi/ each python package contains doap information that can be fetched via the main index page that looks like this : 'https://pypi.python.org/pypi?:action=doap&name=${PACKAGENAME}'

prismbreak

projects are here https://github.com/nylira/prism-break/tree/master/source/db https://github.com/nylira/prism-break https://prism-break.org/en/

fsf software directory

http://directory.fsf.org/wiki/Main_Page Download is here: http://static.fsf.org/nosvn/directory/directory.xml See : http://lists.gnu.org/archive/html/directory-discuss/2013-09/msg00001.html

rapper -o turtle file:directory.xml > directory.ttl

Dump of wiki pages : https://archive.org/details/directoryfsforg_w-20160215-wikidump

Wikidata

https://www.wikidata.org/wiki/Wikidata:Database_download https://dumps.wikimedia.org/wikidatawiki/entities/20160111/

Query access

https://wdq.wmflabs.org/wdq/

https://wikidata.metaphacts.com/sparql

Source Code

Code is : https://gerrit.wikimedia.org/r/#/admin/projects/wikidata/build-resources

Wikidata extension is here : https://git.wikimedia.org/summary/mediawiki%2Fextensions%2FWikidata

Access on the toolserver

mysql --defaults-file="${HOME}"/replica.my.cnf -h wikidatawiki.labsdb wikidatawiki_p^[1]

Database are :

wikidatawiki |
wikidatawiki_p

Tables

archive 2,144,243 rows

changes
- wb_changes 1,317,914 rows "all changes from client wiki" ^[2]
- wb_changes_dispatch
- wb_changes_subscription

wb_entity_per_page Shows what entity is on what page
wb_id_counters
wb_items_per_site This table holds links from items to Wikipedia articles.

ips_item_id is the page id.

Linux :

select * from wb_items_per_site where ips_item_id=388;

wb_property_info 2126 properties
wb_terms shows what terms map onto what entity ^[3]
wbc_entity_usage 11,338,092 ^[4]

The wbc_entity_usage is populated based on page_props^[5]

wbs_propertypairs used to suggest properties that occur together
valid_tag
tag_summary
redirect
page
page_props what page has what property

https://www.mediawiki.org/wiki/Manual:Page_props_table Contains properties about pages set by the parser via ParserOutput::setProperty(), such as the display title and the default category sortkey.

page_restrictions
pagelinks
category
- cat_id
- cat_title

https://www.mediawiki.org/wiki/Manual:Category_table

categorylinks
revision contains the data for the page

Example Data

Look up entities

select * from wb_terms where term_language='en' and term_type='label' and term_entity_type='item' and term_search_key = 'linux';

term_row_id	term_entity_id	term_entity_type	term_language	term_type	term_text	term_search_key	term_weight
142626398	900272	item	en	label	Linux	linux	0
271566667	388	item	en	label	Linux	linux	0.13
266367382	261593	item	en	label	Linux	linux	0.011

Look up pages

 select * from wb_entity_per_page where epp_entity_id in (900272, 261593, 388) limit  10;

epp_entity_id	epp_entity_type	epp_page_id	epp_redirect_target
388	item	591	NULL
261593	item	253747	NULL
900272	item	851079	NULL

Look up page props

select * from page_props where pp_page in (591, 253747, 851079);

pp_page	pp_propname	pp_value	pp_sortkey
591	page_image	GNU_and_Tux.svg	NULL
591	wb-claims	25	25
591	wb-sitelinks	155	155
253747	wb-claims	1	1
253747	wb-sitelinks	16	16
851079	wb-claims	1	1
851079	wb-sitelinks	5	5

Look up pagelinks

select * from pagelinks where pl_from in  (591, 253747, 851079);

https://www.mediawiki.org/wiki/Manual:Pagelinks_table

see data table on https://www.wikidata.org/wiki/User:Mdupont/Open_content#Look_up_page_props

page_props

http://quarry.wmflabs.org/query/7161

SELECT DISTINCT pp_propname FROM page_props;

pp_propname
wikibase_item
wb-status
- ? = 60
- STATUS_STUB = 100
- STATUS_EMPTY = 200
wb-sitelinks
- - count of site links
wb-claims
- number of statements
templatedata
staticredirect
page_top_level_section_count
page_image
notoc
nonewsectionlink
noindex
noeditsection
newsectionlink
index
hiddencat
graph_specs
forcetoc
displaytitle
defaultsort

Apache

http://svn.apache.org/viewvc/ svn co https://svn.apache.org/repos/asf/comdev/projects.apache.org List of doap files http://svn.apache.org/viewvc/comdev/projects.apache.org/data/projects.xml?view=markup

golang

https://golang.org/pkg/ http://go-search.org/search?q=&p=1

rlang

https://cran.rstudio.com/

Java

Maven

http://repo1.maven.org/maven2/ http://repo.maven.apache.org/maven2/ http://mvnrepository.com/open-source?p=2

Emacs

https://github.com/emacsmirror/emacswiki.org

git clone git://github.com/emacsmirror/emacswiki.org.git emacswiki
git checkout master

wikiapiary

https://wikiapiary.com

http://smw.referata.com/

Libre Planet

https://libreplanet.org/wiki/Main_Page

Dump of wiki can be found https://archive.org/details/LibreplanetDotOrgWikiDump20160123 here.

Without System D

http://without-systemd.org/wiki/index.php/Init

Fdroid

data :

git clone https://gitlab.com/fdroid/fdroiddata.git

see https://gitlab.com/fdroid/fdroiddata

server :

git clone https://gitlab.com/fdroid/fdroidserver.git

see https://gitlab.com/fdroid/fdroidserver

Other projects

to review

cii-census

https://github.com/linuxfoundation/cii-census

open-frameworks-analyses

https://github.com/wikiteams/open-frameworks-analyses

This project has ready downloaded open hub projects.

References

^ "wikitech Connecting_to_the_database_replicas".
^ https://meta.wikimedia.org/wiki/Wikidata/Notes/Change_propagation. {{cite web}}: Missing or empty |title= (help)
^ https://www.mediawiki.org/wiki/Wikibase/Schema/wb_terms. {{cite web}}: Missing or empty |title= (help)
^ https://www.mediawiki.org/wiki/Wikibase/Schema/wb_terms. {{cite web}}: Missing or empty |title= (help)
^ https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/client/maintenance/populateEntityUsage.php;3776748b7d177e654e7e2fc5ebe3bf2ab831da20$16. {{cite web}}: Missing or empty |title= (help)

[1] "wikitech Connecting_to_the_database_replicas".

[2] ttps://meta.wikimedia.org/wiki/Wikidata/Notes/Change_propagation. {{cite web}}: Missing or empty |title= (help)

[3] ttps://www.mediawiki.org/wiki/Wikibase/Schema/wb_terms. {{cite web}}: Missing or empty |title= (help)

[4] ttps://www.mediawiki.org/wiki/Wikibase/Schema/wb_terms. {{cite web}}: Missing or empty |title= (help)

[5] ttps://phabricator.wikimedia.org/diffusion/EWBA/browse/master/client/maintenance/populateEntityUsage.php;3776748b7d177e654e7e2fc5ebe3bf2ab831da20$16. {{cite web}}: Missing or empty |title= (help)

[1]

[2]

[3]

[4]

[5]