Managing Data

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • Why do I need to manage data?

  • How do I manage different types of data?

  • How do I store and access structured data?

  • How do I store and access unstructured data?

Objectives
  • Distinguish between different types of data

  • Understand how to work with structured and unstructured data

  • Be comfortable with SQL select and filter queries

  • Understand what meta-data is

Managing Data

All of our research is fundamentally reliant on data so it’s important to understand the different types of data, how to store them, and how to work with them. In this lesson we’ll learn all the above!

Types of data

Fundamentally computers store data as 0s and 1s, representing floats, integers, boolean, or character data types. However this is not what we are going to talk about here. Here we are more concerned with a higher level of abstraction, which includes the following data types:

The following describes a generic data processing workflow DataProcessing

Discussion

In your groups have a look at the different types of data and stages of the data processing workflow. Think about:

  • Where do you get your data from?
    • Are there more than one source of data?
  • How do you keep track of which data is in which stage of processing?
  • When you publish your work, which data products do you want to keep and which can be deleted?

Record some notes in the shared document.

We can classify data by the processing stage as was done above, but also in an orthogonal dimension of how the data is organized. The two most general considerations here are either structured or unstructured data.

StructredUnstructredData

The way that we can store and access data depends on how it’s structured.

Structured data

One of the easiest ways to store structured data are in one or more tables. The tables themselves can be stored as .csv, .fits, or .vot tables which can be read by many astronomy software tools such as topcat. Google sheets, Microsoft Excel, or LibreOffice Calc can also be beneficial for storing structured data, particularly because they allow you to do some data exploration, cleaning, filtering, and analysis within the file itself.

When your data volumes become large, or you find yourself needing to have multiple related tables, then you probably want to start exploring databases as a storage option. See the Bonus Content

Unstructured data

Unstructured data cannot be aggregated or combined into a single data format due to their lack of common structure. In such cases it is often best to make use of a regular directory/file structure to store the individual files in a way that can be navigated easily. A well planned directory/file structure it is very useful to store meta-data about each of the files that you are storing. For example you could have:

Exercise

  • Download example data example.zip
  • Unzip the file and have a look at the content
  • See if you can infer the purpose of each file
  • Make a plan for how you would store and manage this data
  • Note in the shared document what you have decided for at least one of the files

Data cleaning

In the previous lesson we worked with a file Australian_Meteorite_Landings.csv. For your benefit this had already been cleaned to remove invalid entries, attributes that we didn’t need, and entries that were not relevant.

Let’s now start with the original data set and go through the cleaning process.

wget -O Meteorite_Landings.csv https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD
# or
curl -o Meteorite_Landings.csv -O https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD

Now we’ll explore the data set using pandas.

View the raw data

Using your viewer of choice, have a look at the csv file that we just downloaded. Think about and discuss:

  • Are there any missing data?
  • What can be done about the missing data?
  • Are there any columns that are not useful?

Note in our shared document which tool you used to inspect the data, and one that you would take to clean this data.

Solution with pandas

import pandas as pd
df = pd.read_csv('Meteorite_Landings.csv')
print("Quick look of data")
print(df)
print("Summary of data")
print(df.describe())
Quick look of data
             name     id nametype              recclass  mass (g)   fall    year    > reclat    reclong             GeoLocation
0          Aachen      1    Valid                    L5      21.0   Fell  1880.0  50.> 77500    6.08333       (50.775, 6.08333)
1          Aarhus      2    Valid                    H6     720.0   Fell  1951.0  56.> 18333   10.23333    (56.18333, 10.23333)
2            Abee      6    Valid                   EH4  107000.0   Fell  1952.0  54.> 21667 -113.00000      (54.21667, -113.0)
3        Acapulco     10    Valid           Acapulcoite    1914.0   Fell  1976.0  16.> 88333  -99.90000       (16.88333, -99.9)
4         Achiras    370    Valid                    L6     780.0   Fell  1902.0 -33.> 16667  -64.95000     (-33.16667, -64.95)
...           ...    ...      ...                   ...       ...    ...     ...       ..> .        ...                     ...
45711  Zillah 002  31356    Valid               Eucrite     172.0  Found  1990.0  29.> 03700   17.01850       (29.037, 17.0185)
45712      Zinder  30409    Valid  Pallasite, ungrouped      46.0  Found  1999.0  13.> 78333    8.96667     (13.78333, 8.96667)
45713        Zlin  30410    Valid                    H4       3.3  Found  1939.0  49.> 25000   17.66667       (49.25, 17.66667)
45714   Zubkovsky  31357    Valid                    L6    2167.0  Found  2003.0  49.> 78917   41.50460     (49.78917, 41.5046)
45715  Zulu Queen  30414    Valid                  L3.7     200.0  Found  1976.0  33.> 98333 -115.68333  (33.98333, -115.68333)

[45716 rows x 10 columns]
Summary of data
                 id      mass (g)          year        reclat       reclong
count  45716.000000  4.558500e+04  45425.000000  38401.000000  38401.000000
mean   26889.735104  1.327808e+04   1991.828817    -39.122580     61.074319
std    16860.683030  5.749889e+05     25.052766     46.378511     80.647298
min        1.000000  0.000000e+00    860.000000    -87.366670   -165.433330
25%    12688.750000  7.200000e+00   1987.000000    -76.714240      0.000000
50%    24261.500000  3.260000e+01   1998.000000    -71.500000     35.666670
75%    40656.750000  2.026000e+02   2003.000000      0.000000    157.166670
max    57458.000000  6.000000e+07   2101.000000     81.166670    354.473330

solution with bash

head Meteorite_Landings.csv
name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
Aachen,1,Valid,L5,21,Fell,1880,50.775000,6.083330,"(50.775, 6.08333)"
Aarhus,2,Valid,H6,720,Fell,1951,56.183330,10.233330,"(56.18333, 10.23333)"
Abee,6,Valid,EH4,107000,Fell,1952,54.216670,-113.000000,"(54.21667, -113.0)"
Acapulco,10,Valid,Acapulcoite,1914,Fell,1976,16.883330,-99.900000,"(16.88333, -99.9)"
Achiras,370,Valid,L6,780,Fell,1902,-33.166670,-64.950000,"(-33.16667, -64.95)"
Adhi Kot,379,Valid,EH4,4239,Fell,1919,32.100000,71.800000,"(32.1, 71.8)"
Adzhi-Bogdo (stone),390,Valid,LL3-6,910,Fell,1949,44.833330,95.166670,"(44.83333, 95.> 16667)"
Agen,392,Valid,H5,30000,Fell,1814,44.216670,0.616670,"(44.21667, 0.61667)"
Aguada,398,Valid,L6,1620,Fell,1930,-31.600000,-65.233330,"(-31.6, -65.23333)"
wc Meteorite_Landings.csv
  45717  166894 3952161 Meteorite_Landings.csv

You may have noticed when inspecting the file above, that the header names are occasionally useful, but don’t do a good job of explaining what the data represent. For example:

Here is where we need to think about the meta-data attached to this table. We need context (how/why the table was created), and also a description of what the data represent. If we go to the NASA website which provides this data, they have a description of the column data. However, it’s not so helpful:

MeteoriteColumnDescriptions

Wuh-wah! The only column with a description is the one that was most obvious. Not only that, but the data type listed for the columns is incorrect in many cases. For example, id, reclat, and reclat are listed as being “plain text” but they are clearly numeric (int, and floats), while “year” is listed as being “date and time”.

Enough complaining, lets start fixing.

Challenge

We want to plot all the Australian meteorites, with a measured mass value, and a nametype of “Valid”

  • Using pandas, load the meteorite dataset
  • Delete the geolocation column as it’s a duplicate of the reclat/reclong columns
  • Select all rows with a non-zero and non-null mass value
    • Count the rows and record this number
    • Delete these rows
  • Repeat the above for:
    • rows with a reclat/reclong that are identically 0,0
    • rows with a blank reclat/reclong
    • rows with nametype that is not “Valid”
  • Delete the nametype column as it’s now just has a single value for all rows
  • Choose a bounding box in lat/long for Australia (don’t forget Tassie!)
  • Select, count, and delete all rows that are outside of this bounding box
  • Save our new dataframe as a .csv file
  • Run our previously created script on this file and view the results

Data storage and access

The way that you store and access data depends on many things including the type of data, the structure of the data, and the ways that you will access and use the data.

We’ll focus here on the two data types introduced at the start of this lesson: structured and unstructured data.

Structured data

These are the kinds of data that we can store as rows in a .csv file or Excel spreadsheet. The two options for storing these data would be flat files or a database. A flat-file is, as the name suggests, a single file which contains all the information that you are storing. You need an independent program to load/sort/update the data that is stored in a flat-file. As the data volume becomes large, or the relation between data in different flat-files becomes more intertwined the creation, retrieval and management of these files can become time consuming and unreliable.

A database is a solution that combines the data storage, retrieval, and management in one program. A database can allow you to separate the way that the data are stored logically (tables and relations) and physically (on disks, over a network).

We’ll explore some basics of data bases using sqlite3. This program supports the core functionality of a relational database without features such as authentication, concurrent access, or scalability. For more advanced features you can use more full featured solutions such as PostgreSQL or MySQL. One of the nice things about sqlite3 is that it stores all the data in a single file, which can be easily shared or backed up. For this lesson we’ll be working with a ready made database Meteorite_Landings.db.

Using SQL to select data

There are two choices of how you can use sqlite3 for this lesson:

  1. run sqlite3 Meteorite_Landings.db from your terminal
  2. Go to sqliteonline.com and upload the file

The data in a database is stored in one or more tables, each with rows and columns. To view the structure of a table we can use:

.schema
CREATE TABLE IF NOT EXISTS "landings" (
"index" INTEGER,
  "name" TEXT,
  "id" INTEGER,
  "nametype" TEXT,
  "recclass" TEXT,
  "fall" TEXT,
  "year" REAL,
  "reclat" REAL,
  "reclong" REAL,
  "GeoLocation" TEXT,
  "States" REAL,
  "Counties" REAL,
  "mass" REAL
);

What we see in the output here are the instructions on how the table was created. The name of the table is landings, and there are a bunch of column names which we recognize from previous work. We can see that each column has a particular data type (TEXT, INTEGER, REAL). In a database the types of data that can be stored in a given column are fixed and strictly enforced. If you tried to add “one” to a column with type REAL, then the database software would reject your command and not change the table.

If we want to see the content of the table, we need to SELECT some data. In SQL we can do this using the SELECT command as follows:

SELECT <columns> FROM <table>;
SELECT name, mass, year FROM landings;
SELECT * FROM landings;

The last command is the most common command for just looking at the table content as it will show all the columns without you having to list them all. If we want to show only the first 10 lines (like head) then we can choose a LIMIT of lines using:

SELECT * FROM landings LIMIT 10;

Note that if you are using the command line version of sqlite3 the default output can be a little hard to read. Firstly, you’ll get all the data speeding by. Secondly, the data are not well formatted for humans to read. We can fix this with a few commands:

.mode column
.headings on

Combining all this together we have the following:

sqlite> SELECT * FROM landings LIMIT 10;
index       name        id          nametype    recclass    fall        year        reclat      reclong     GeoLocation        States      Counties    mass      
----------  ----------  ----------  ----------  ----------  ----------  ----------  ----------  ----------  -----------------  ----------  ----------  ----------
0           Aachen      1           Valid       L5          Fell        1880.0      50.775      6.08333     (50.775, 6.08333)                          21.0      
1           Aarhus      2           Valid       H6          Fell        1951.0      56.18333    10.23333    (56.18333, 10.233                          720.0     
2           Abee        6           Valid       EH4         Fell        1952.0      54.21667    -113.0      (54.21667, -113.0                          107000.0  
3           Acapulco    10          Valid       Acapulcoit  Fell        1976.0      16.88333    -99.9       (16.88333, -99.9)                          1914.0    
4           Achiras     370         Valid       L6          Fell        1902.0      -33.16667   -64.95      (-33.16667, -64.9                          780.0     
5           Adhi Kot    379         Valid       EH4         Fell        1919.0      32.1        71.8        (32.1, 71.8)                               4239.0    
6           Adzhi-Bogd  390         Valid       LL3-6       Fell        1949.0      44.83333    95.16667    (44.83333, 95.166                          910.0     
7           Agen        392         Valid       H5          Fell        1814.0      44.21667    0.61667     (44.21667, 0.6166                          30000.0   
8           Aguada      398         Valid       L6          Fell        1930.0      -31.6       -65.23333   (-31.6, -65.23333                          1620.0    
9           Aguila Bla  417         Valid       L           Fell        1920.0      -30.86667   -64.55      (-30.86667, -64.5                          1440.0    

By convention we use all caps for the keywords, and lowercase for table/column names. This is by convention only, and sqlite3 is not case-sensitive.

We can filter the data that we retrieve by using the WHERE clause.

SELECT <columns> FROM <table> WHERE <condition> LIMIT 10;

Challenge

  • Find all the meteorites that fell between 1980 and 1990, and have a mass between 100-1,000g.
  • List the name and reclat/reclong for 15 of these meteorites

Solution

sqlite> SELECT name, reclat, reclong FROM landings WHERE year >= 1980 AND year <=1990 LIMIT 15;
name        reclat      reclong   
----------  ----------  ----------
Akyumak     39.91667    42.81667  
Aomori      40.81056    140.78556 
Bawku       11.08333    -0.18333  
Binningup   -33.15639   115.67639 
Burnwell    37.62194    -82.23722 
Ceniceros   26.46667    -105.23333
Chela       -3.66667    32.5      
Chiang Kha  17.9        101.63333 
Chisenga    -10.05944   33.395    
Claxton     32.1025     -81.87278 
Dahmani     35.61667    8.83333   
El Idrissi  34.41667    3.25      
Gashua      12.85       11.03333  
Glanerbrug  52.2        6.86667   
Guangnan    24.1        105.0     

You can interact with a database programmatically using the python module sqlite3. With the combination of sqlite3 and pandas modules, you can save a pandas data frame directly to a database file. This is one of the easiest ways to create a simple database file.

import sqlite3 as sql
import pandas as pd

# read the .csv file
df = pd.read_csv("Meteorite_Landings.csv")
# create a connection to our new (empty) database
con = sql.connect("Meteorite_Landings.db")

# update the name of mass column to avoid spaces and special chars
db['mass'] = db['mass (g)']
del db['mass (g)']

# save the dataframe to the database via the connection
df.to_sql(name='landings', con)

Unstructured data

Data that doesn’t conform to a standard tabular format are referred to as unstructured data. For example:

For these data we still need to be able to know where it is and where to put new data. In this case we can impose some structure using a directory/file structure. The structure that you use ultimately depends on how you’ll use the data.

For example, imagine that we have a photograph of each of the meteorites represented in our landings table from the previous exercise. The images could be organized in one of the following ways:

name.png
<name>/<location>/id.png
<continent>/<year>/id.png
<recclass>/<year>/name.png
<state>/<county>/<year>/<recclass>/id.png

In each of the above examples you can see that there is an intentional grouping and sub-grouping of the files which make sense for a given type of analysis, but will be confusing or tedious to work with for a different type of analysis. For example, <continent>/<year>/id.png will be convenient if you are looking at how meteorite monitoring/recovery has progressed in various countries over time, but it will be annoying to work with if you just want to select meteorites with a particular composition. Whatever choice you make for storing your data, it is a good idea to make a note of this scheme in a file like README.md or file_structure.txt, so that you and others can easily navigate it.

You could start to encode extra information into the filenames, and then describe this naming convention in the README.md file.

A very nice hybrid solution is to use a database to store some information about your image files in a table, and have a column which is the filename/location of the image file. In our example, this would mean appending a single new column to our existing database table or .csvfile. In the case where the data are the images, then you would need to start extracting information about the data (meta-data) from each of the files, and then store this in a database or table. For example if you have a set of video files you could extract the filming location (lat/long), time, duration, filename, and some keywords describing the content. This solution allows you to query a database to find the files that match some criteria, locate them on disk, and then use some video analysis software on each.

Meta-data

  • Think about some of the unstructured data that you collect or create as part of your research work
  • Write a short description of what these data are
  • Choose 4-5 pieces of metadata that you would store about these data
  • Use our shared document to share you description and meta-data

Key Points

  • Invest time in cleaning and transforming your data

  • Meta data gives context and makes your data easier to understand

  • How and where you store your data is important