Project Proposal During IT Summit - Normalization of Space Mission Data Sets
See also Mission Data - Developments
I just posted about my work on Space Mission Data Sets (particularly Mission Parameters). Since I will be in San Francisco for the NASA IT Summit for nearly a week (Aug 15th-19th), and there are at least a few folks who will be there who have expressed interest in doing a hackathon of sorts, I thought I would at least throw my hat into the ring for a proposal of what we actually work on.
My proposal is to have folks work on writing code, importing data into tools like Excel or whatever they are familiar with to normalize sets of data in order to import it into a collaborative, crowdsourcing framework that I am working on. I list a couple of sources of this data below.
Much of this data is what I will call "lightly verified". Enthusiasts like those who run the sites listed below curate this information from multiple sources, but a few people can only do so much, especially since there have been over 4,800 orbital launches since Sputnik. The challenge is to somehow fill in each data set's gaps by correlating and overlaying the data.
So somehow we need a set of tools written in something like Python or Java while can pull all these data sets in, and do some simple coorelation among the Flight Missions and their parameters. Perhaps this could be done with some simple Bayesian analysis, root mean square, or something similar to that. The whole outcome here is to automatically approve data elements which are the same among 3-4 different data sets, and then report on ones which different data sets disagree.
Examples
Lets say that two data set say that the Voyager 2 spacecraft was launched on a August 22, 1977, but one says it was launched on August 21. Somehow we should be able to automatically set some thresholds and throw out the August 21 data point. Lets say that one data point says it was the "Voyager 2" spacecraft and one say "Voyager2". We should somehow be able to deduce that these are really the same spacecraft.
Software libraries & packages
I just did some simple Google searches in order to find some packages that we may be able to use to accomplish this task:
- http://code.google.com/p/google-refine/ - An open source package for dealing with "messy data" which is definately the state of this data, its an installed platform, which you then use as a service by uploading the data and then using the interface to parse and collapse the data. I have actually used this package a little in the past and it is pretty incredible.
- http://code.google.com/p/bayesian-inference/ - python package for Bayesian Interference
Data Sets
1. NSSDC Spacecraft Data
My intern Justin for the summer actually wrote a great scraper for the NSSDC web site. You can find it here: http://scraperwiki.com/scrapers/nasa_nssdc_id_table_scraper/ But it only currently scrapes the NSSDC ID, Launch Date, and Spacecraft Name. The next step is to drive one level deeper and scrape the actual pages themselves, like this http://nssdc.gsfc.nasa.gov/nmc/spacecraftDisplay.do?id=2009-031A
2. Jonathan's Space Report
JSR's page contains many fixed width text files which contains thousands of lines of data. Here are some of them:
- Main index of data files - http://planet4589.org/space/
- Master Orbital Launch log (Updated May 2010) - http://planet4589.org/space/log/launchlog.txt
- Satellite Catalog Number Index (Updated May 2010) - http://planet4589.org/space/logs/satcat.txt
- UN Registry of Space Objects - http://planet4589.org/space/un/un_taba.html
3. Other Sources
Unfortunately, many of the other sites found at the link below are not data at all. So that means that this material would need to be scraped from the sites. ScraperWiki is an incredible tool which allows for the easy extracting of data systematically from websites. Check it out here: http://www.scraperwiki.org The site provides a "cron" type system so that every day the tool runs your Python, Ruby, or Java code to do the scraping and the data extracted is literally hosted on their site in a few different formats (CSV, JSON, etc.) and you can download the new data very easily. Check it out.
I have spent many hours combing the web and doing clever web searches to find places on the web where folks (mostly enthusiasts) have assembled Space Mission Data on the web. I have found over 30 different sites, in varying levels of maturity, messiness, and accuracy. See: http://nasatweet.com/wiki/Mission_Data_-_Sources


