Many open-source tools can be used to manipulate datasets and perform data mining analyses from the command line. In this lab, we will look at using Python for simple data analysis from a set of files.
Resources:
Python code can be run interactively from the python
environment
or through a script containing Python code. Scripts typically end in the
extension .py and contain the line
import sysat the top of the file.
There are many ways to get data in to Python. One can use the built in file tools to parse the file, import data from a datastore such as importing JSON objects (e.g. data from MongoDB or similar databases), and so forth. For simplicity in this lab, we will take the US Baby Names example in "Python for Data Analysis" and modify it for another dataset.
If you have Linux on your own laptop, go to the Preliminaries->Essential Python Libraries section of "Python for Data Analysis" and make sure you have all the indicated packages installed. The book also contains instructions for installing the necessary Python libraries on Windows and Mac OS X.
Update: Steve got python-pandas installed on Sleipnir, so you can use Sleipnir instead of the Debian VM for this assignment.
If you are using the lab machines, download the following virtual machine, which is a basic Debian 7.3.0 image with command-line Python installed: debian-vm.tar.bz2 (2.3GB)
Extract with the command
tar -xjf debian-vm.tar.bz2Note that any files saved on the student account in Rm 311 will be deleted at the end of the day, so you might want to save the VM on a flash drive.
Use the daily weather records for Meadows Field in the following file: weather-data.csv
The data is in CSV format from 2004-01-01 to 2013-12-31 for Meadows Field airport in Bakersfield, CA. It contains the precipitation, temperature, sun, and wind data for each day during that time range. See NOAA description of daily summary for the description of each field. Note in particular that the temperature fields are an integer representation of decimal data, e.g. if the original value was 8.9C, then the CSV file contains 89.
If you wish to download a different dataset, go to NOAA National Climate Data Center and search for "Meadows Field" as the weather station to get the data for the Bakersfield airport (for some reason, searching for Bakersfield as the city brings up Visalia and Canada before it brings up Meadows Field).
describe
functionality to display the overall
statistics for the entire dataset
Create a writeup with the commands used to perform the above actions, INCLUDING any prepatory commands you used to load the data into Python. You can make this a Python script file (.py file) if you like. Submit the Python commands on Moodle as your writeup for this lab.