User:Mdupont/Open content
Idea
[edit]Create a freely usable database of all known information about free software projects. Look for matching patterns of text that occur on different sources to connect and correlate them with each other.
Implementation
[edit]Source code :
https://github.com/h4ck3rm1k3/open-everything-library/tree/extractor
and download helpers:
https://github.com/h4ck3rm1k3/open-everything-library/tree/helpers
DOAP
[edit]DOAP see http://oss-watch.ac.uk/resources/doap project page https://github.com/edumbill/doap/wiki validator http://www.w3.org/RDF/Validator/ generate doap from java class annotations https://github.com/thebrianmanley/doapamine doap generator in python http://bzr.mfd-consult.dk/bzr-doap/bzr-doap.py command line doap tool https://pypi.python.org/pypi/doapfiend/0.3.3 apache doap categories https://projects-old.apache.org/categories.html many github doap files https://github.com/search?l=xml&q=doap.rdf&ref=searchresults&type=Code&utf8=%E2%9C%93 https://github.com/search?l=turtle&q=doap.rdf&ref=searchresults&type=Code&utf8=%E2%9C%93 http://www.w3.org/wiki/SemanticWebDOAPBulletinBoard
doap generator from pom http://maven.apache.org/plugins/maven-doap-plugin/ form based generator for doap http://crschmidt.net/semweb/doapamatic/ sparql queries http://opendatahacklab.github.io/sparql_suite/#doap moap http://thomas.apestaart.org/moap/trac http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.input.doap/
Data
[edit]All data dumps are now on archive.org https://archive.org/details/@h4ck3rm1k32
https://archive.org/details/freshcode https://archive.org/details/python_201602 https://archive.org/details/oel-2016-02-debian https://archive.org/details/oel-2016-01-sf.net https://archive.org/details/oel-2016-01-bitbucket https://archive.org/details/openhub https://archive.org/details/oel-2016-01-maven-projects https://archive.org/details/oel-2016-01-golang https://archive.org/details/oel_2016-02-apache_201602 https://archive.org/details/jamesmikedupont_gmail_Data https://archive.org/details/gitlab-list-2016-01-30 https://archive.org/details/npm.json https://archive.org/details/jamesmikedupont_gmail_Ruby
https://archive.org/details/LibreplanetDotOrgWikiDump20160123
Hosting
[edit]- https://freedomopenness.miraheze.org/wiki/Main_Page the wikipedia articles are going here first
- http://freedomandopenness.referata.com/wiki/Main_Page the fsd articles are going here first
Todo
[edit]These are not used yet :
- http://freedomopenness.wiki-site.com/index.php/Main_Page
- http://freedom-and-openness.wikia.com
- http://freedom-and-openness.mwzip.com/
Algorithm
[edit]Starting with Category:Open content and following all subcats and pages. Extract all external links. Fetch all external pages.
Look at external websites, determine which are open content.
Look at software projects, extract information, cross reference to metadata from the following sources (see list below)
Storage of the data in json format in a mongo db, currently have 200gb plus data.
Merge various data sources base on : external urls, names, source control repos.
Goal is to push the merged data into buckets of data on archive.org so you can download them in zipped data files/parts if need.
github
[edit]Archive.org dumps:
- github project data in avro format 7zipped https://archive.org/details/github-projects.1.avro.7z created with this script https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/import_gh2.py
- https://archive.org/details/github_201602
- https://archive.org/details/samples.github_nested.json
Git hub metadata is not free per se. it is limited by tos. There is an api you can use to get data. https://developer.github.com/v3/repos/#list-all-public-repositories
There is a dump of projects from the archiveteam that it outdated. https://archive.org/details/archiveteam-github-repository-index-201212 https://developer.github.com/v3/repos/#list-all-public-repositories
Pulling via authenticated api :https://api.github.com/repositories?since=%d&access_token=%s
Status : importing json, still downloading, at item id 46314762.
code for download https://github.com/h4ck3rm1k3/open-everything-library/blob/helpers/sources/github/get.py takes the last id downloaded as parameter. Requires authentication token. Code for import https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/import_gh2.py
As of Jan 25 have all projects, last ID was 50342814.
The browse does not contain the homepage, and the api browse is very verbose, another call to get the project details is needed to get the homepage.
Here is an example of what we can collect from the api and put into a wiki for hosting/editing
http://freedom-and-openness.wikia.com/wiki/GitHub_Projects/browning/chronic
sf.net
[edit]First need a list of projects. Project export : http://sourceforge.net/blog/project-data-export/
Nonfree data from 2014
http://srda.cse.nd.edu/mediawiki/index.php/Main_Page
First we get the list of projects with this schema
starting with https://sourceforge.net/directory/os:linux/?page=1
http://sourceforge.net/directory/os%3Alinux/?page=${page}"
We have gotten 1982 pages. It says that there are 16771 pages, but after the 1982, the web server stops. TODO : access via more categories.
The doap files are then extracted with this script
https://github.com/h4ck3rm1k3/open-everything-library/blob/helpers/sources/sf.net/doap.sh
Then the import of the doap is done by import_sf_doap.py https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/import_sf_doap.py
cats
[edit]The pulling of the categories was done by a simple recursive scan: https://github.com/h4ck3rm1k3/open-everything-library/blob/helpers/sources/sf.net/cats/doit.sh
But this got very slow, so I wrote a scraper and started to import them into mongo : https://github.com/h4ck3rm1k3/open-everything-library/blob/extractor/process_sf_cats.py After this is done importing all the pages downloaded so far, including the pages, the subcats and the facets and the number of pages. After that import runs, will scan the cat pages for pages that have not been imported and run those dynamically.
debian
[edit]software packages
[edit]https://wiki.debian.org/qa.debian.org/pts/RdfInterface packages.qa.debian.org:/srv/packages.qa.debian.org/www/web/full-dump.tar.bz2
rdf.debian.net is the newest. Find the https://wiki.debian.org/UltimateDebianDatabase with a great deal of info.
udd
[edit]https://udd.debian.org/ contains an sql database
wnpp
[edit]The intent to package bug reports contain information about packages not yet in debian. https://www.debian.org/devel/wnpp/
openhub
[edit]api
[edit]https://github.com/blackducksoftware/ohloh_api/blob/master/reference/project.md
get key from https://www.openhub.net/accounts/<username>/api_keys
API_KEY=XXXXX for each page: curl --output projects.xml --verbose "https://www.openhub.net/projects.xml?api_key=$API_KEY&page=${page}"
bitbucket
[edit]https://bitbucket.org/api/2.0/repositories/ extract the next with : URL=`jq .next $OUT -r`
Status: Downloaded 71763 pages of 10 projects via json, importing all pages.
gitlab
[edit]https://archive.org/details/gitlab-list-2016-01-30 snapshot
eclipse foundation
[edit]http://projects.eclipse.org/search/projects?page=1
php
[edit]fossil
[edit]http://fossil.include-once.org/
freshcode
[edit]perl cpan
[edit]the 02packages.details.txt.gz file, is cached locally when you run cpan. ~/.cpan/sources/modules/02packages.details.txt.gz Source : http://www.cpan.org/modules/02packages.details.txt.gz
nix
[edit]http://nixos.org/ packages
arch
[edit]https://www.archlinux.org/packages/
npm
[edit]The npm util caches package information
run :
npm search
http://www.sitepoint.com/beginners-guide-node-package-manager/
this will populate :
~/.npm/registry.npmjs.org/-/all/.cache.json
Snapshot : https://archive.org/download/npm.json
Snapshot https://archive.org/download/jamesmikedupont_gmail_Ruby https://rubygems.org/
python packages
[edit]API: https://www.python.org/dev/peps/pep-0503/
Get index of all packages : wget -m -r -l1 https://pypi.python.org/simple/ That will get you a full list of packages and versions. Get main page for each package : https://pypi.python.org/pypi/${PKG}"
wget -m --no-parent -r -l1 https://pypi.python.org/pypi/ each python package contains doap information that can be fetched via the main index page that looks like this : 'https://pypi.python.org/pypi?:action=doap&name=${PACKAGENAME}'
prismbreak
[edit]projects are here https://github.com/nylira/prism-break/tree/master/source/db https://github.com/nylira/prism-break https://prism-break.org/en/
fsf software directory
[edit]http://directory.fsf.org/wiki/Main_Page Download is here: http://static.fsf.org/nosvn/directory/directory.xml See : http://lists.gnu.org/archive/html/directory-discuss/2013-09/msg00001.html
rapper -o turtle file:directory.xml > directory.ttl
Dump of wiki pages : https://archive.org/details/directoryfsforg_w-20160215-wikidump
Wikidata
Wikidata
[edit]https://www.wikidata.org/wiki/Wikidata:Database_download https://dumps.wikimedia.org/wikidatawiki/entities/20160111/
Query access
[edit]https://wikidata.metaphacts.com/sparql
see also http://wikidataldf.wmflabs.org/
Source Code
[edit]Code is : https://gerrit.wikimedia.org/r/#/admin/projects/wikidata/build-resources
Wikidata extension is here : https://git.wikimedia.org/summary/mediawiki%2Fextensions%2FWikidata
Access on the toolserver
[edit]mysql --defaults-file="${HOME}"/replica.my.cnf -h wikidatawiki.labsdb wikidatawiki_p[1]
Database are :
- wikidatawiki |
- wikidatawiki_p
Tables
[edit]- archive 2,144,243 rows
- changes
- wb_changes 1,317,914 rows "all changes from client wiki" [2]
- wb_changes_dispatch
- wb_changes_subscription
- wb_entity_per_page Shows what entity is on what page
- wb_id_counters
- wb_items_per_site This table holds links from items to Wikipedia articles.
ips_item_id is the page id.
Linux :
select * from wb_items_per_site where ips_item_id=388;
- wb_property_info 2126 properties
- wb_terms shows what terms map onto what entity [3]
- wbc_entity_usage 11,338,092 [4]
The wbc_entity_usage is populated based on page_props[5]
- wbs_propertypairs used to suggest properties that occur together
- valid_tag
- tag_summary
- redirect
- page
- page_props what page has what property
https://www.mediawiki.org/wiki/Manual:Page_props_table Contains properties about pages set by the parser via ParserOutput::setProperty(), such as the display title and the default category sortkey.
- page_restrictions
- pagelinks
- category
- cat_id
- cat_title
https://www.mediawiki.org/wiki/Manual:Category_table
- categorylinks
- revision contains the data for the page
Example Data
[edit]Look up entities
[edit]select * from wb_terms where term_language='en' and term_type='label' and term_entity_type='item' and term_search_key = 'linux';
term_row_id | term_entity_id | term_entity_type | term_language | term_type | term_text | term_search_key | term_weight |
---|---|---|---|---|---|---|---|
142626398 | 900272 | item | en | label | Linux | linux | 0 |
271566667 | 388 | item | en | label | Linux | linux | 0.13 |
266367382 | 261593 | item | en | label | Linux | linux | 0.011 |
Look up pages
[edit]select * from wb_entity_per_page where epp_entity_id in (900272, 261593, 388) limit 10;
epp_entity_id | epp_entity_type | epp_page_id | epp_redirect_target |
---|---|---|---|
388 | item | 591 | NULL |
261593 | item | 253747 | NULL |
900272 | item | 851079 | NULL |
Look up page props
[edit]select * from page_props where pp_page in (591, 253747, 851079);
pp_page | pp_propname | pp_value | pp_sortkey |
---|---|---|---|
591 | page_image | GNU_and_Tux.svg | NULL |
591 | wb-claims | 25 | 25 |
591 | wb-sitelinks | 155 | 155 |
253747 | wb-claims | 1 | 1 |
253747 | wb-sitelinks | 16 | 16 |
851079 | wb-claims | 1 | 1 |
851079 | wb-sitelinks | 5 | 5 |
Look up pagelinks
[edit]select * from pagelinks where pl_from in (591, 253747, 851079);
https://www.mediawiki.org/wiki/Manual:Pagelinks_table
see data table on https://www.wikidata.org/wiki/User:Mdupont/Open_content#Look_up_page_props
page_props
[edit]http://quarry.wmflabs.org/query/7161
SELECT DISTINCT pp_propname FROM page_props;
- pp_propname
- wikibase_item
- wb-status
- ? = 60
- STATUS_STUB = 100
- STATUS_EMPTY = 200
- wb-sitelinks
- count of site links
- wb-claims
- number of statements
- templatedata
- staticredirect
- page_top_level_section_count
- page_image
- notoc
- nonewsectionlink
- noindex
- noeditsection
- newsectionlink
- index
- hiddencat
- graph_specs
- forcetoc
- displaytitle
- defaultsort
Apache
[edit]http://svn.apache.org/viewvc/ svn co https://svn.apache.org/repos/asf/comdev/projects.apache.org List of doap files http://svn.apache.org/viewvc/comdev/projects.apache.org/data/projects.xml?view=markup
golang
[edit]https://golang.org/pkg/ http://go-search.org/search?q=&p=1
rlang
[edit]Java
[edit]Maven
[edit]http://repo1.maven.org/maven2/ http://repo.maven.apache.org/maven2/ http://mvnrepository.com/open-source?p=2
Emacs
[edit]https://github.com/emacsmirror/emacswiki.org
git clone git://github.com/emacsmirror/emacswiki.org.git emacswiki git checkout master
wikiapiary
[edit]
Libre Planet
[edit]https://libreplanet.org/wiki/Main_Page
Dump of wiki can be found https://archive.org/details/LibreplanetDotOrgWikiDump20160123 here.
Without System D
[edit]http://without-systemd.org/wiki/index.php/Init
Fdroid
[edit]data :
git clone https://gitlab.com/fdroid/fdroiddata.git
see https://gitlab.com/fdroid/fdroiddata
server :
git clone https://gitlab.com/fdroid/fdroidserver.git
see https://gitlab.com/fdroid/fdroidserver
Other projects
[edit]to review
cii-census
[edit]https://github.com/linuxfoundation/cii-census
open-frameworks-analyses
[edit]https://github.com/wikiteams/open-frameworks-analyses
This project has ready downloaded open hub projects.
References
[edit]- ^ "wikitech Connecting_to_the_database_replicas".
- ^ https://meta.wikimedia.org/wiki/Wikidata/Notes/Change_propagation.
{{cite web}}
: Missing or empty|title=
(help) - ^ https://www.mediawiki.org/wiki/Wikibase/Schema/wb_terms.
{{cite web}}
: Missing or empty|title=
(help) - ^ https://www.mediawiki.org/wiki/Wikibase/Schema/wb_terms.
{{cite web}}
: Missing or empty|title=
(help) - ^ https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/client/maintenance/populateEntityUsage.php;3776748b7d177e654e7e2fc5ebe3bf2ab831da20$16.
{{cite web}}
: Missing or empty|title=
(help)