Data Processing

Since its inception, CDIP has been committed to using a fully-automated, realtime processing system for all of the data the program collects. This emphasis allows us to provide the most up-to-date and relevant coastal data to a wide range of users, from surfers and boaters to the National Weather Service. But the data are not only timely; the processing system includes a range of rigorous quality control tests, ensuring that CDIP data are highly accurate and reliable as well.

System Organization

Every 30 minutes of every day, each buoy and shore station contacts the virtual computer located within Amazon’s cloud. The acquisition software compiles that data into files which are lightly processed and forwarded to CDIP’s central computing facility at SIO. Those files are then passed through CDIP’s automated processing and distribution system. This system performs a wide range of analyses and data transformations, producing everything from error reports and diagnostic e-mail to condensed paramters and web tables.

When the data first arrive in the Lab, they are ‘raw’ data files, that is they are data directly as read from the sensors in the field, without any significant modification or editing. The data in these files have not been decoded or calibrated; they are effectively a byte-by-byte record of a sensor’s output. There are many different types of raw data files depending on the sensor and method of data acquisition.

Turning this raw data into to all of the valuable products and information that are found on the CDIP website is basically a two-step process. First, after verifying the source and timing of the file, it is either decoded, or in the case of non-buoy data (pressure sensor, anemometer, etc.), calibrated and used to produce a calibrated version of the original file, called a ‘df’ - diskfarm - file.

The next step of the processing addresses the question: how well is the sensor measuring what it’s supposed to? A range of quality control checks are performed on the data, to check if they are suitable for further processing. If so, a variety of calculations and transformations are performed on the data, and finally the results are distributed to all the appropriate products. (For a more detailed version of the image below, please see the processing flowchart.)

image0

Software

At the heart of CDIP’s processing system is a FORTRAN program called meta_proc. This program processes hundreds of data files a day, supplying thousands of users with the information they need. Meta_proc relies on two important archives of information to know exactly how to process each file originating from a specific sensor at a specific station: the sensor archive and the processing archive.

The sensor archive

The sensor archive is a database that holds complete descriptions of all the sensors that have been deployed in the field to collect data for CDIP. Serial numbers, deployment and recovery dates, water depths: they are all detailed in the sensor archive.

Sample information from the sensor archive image1

The processing archive

Like the sensor archive, the processing archive is a database containing a number of time frames for each station. These time frames give complete processing instructions for a station at any point in its history. For example, for an array of pressure sensors, the processing archive will state which of the sensors should undergo “Gauge comparisons”, a rigorous set of quality control checks designed to confirm that the data from closely-associated pressure sensors is in agreement. The processing archive will also state which sensors should be used for directional processing. For a station like Harvest Platform (063), with eight pressure sensors in close proximity, the sensors used for directional processing may vary considerably over the life of the station.

Sample information from the processing archive image2

The processing archive also spells out which sensors should be used to produce all of the plots, tables, and other products that are generated from the diskfarm and made accessible on the web. By setting up a number of “parameter streams” in the processing archive, different products can be generated for a single station and presented to web users as separate data sets.

Sensor and processing archive descriptions

META_PROC: Getting data to our users

Once meta_proc has consulted the processing archive and assembled all the instructions for handling a station’s data, it proceeds through three main stages of processing. First comes quality control and editing. The data are subjected to a range of tests - checks for extreme values, spikes, abnormal distributions, etc. (For a more detailed description of the tests used, please see the following section, on quality control.) When problems are found in the data, the values may be edited - perhaps removing a spike from a time series - or they may be rejected as unfit for further processing.

If the data pass the QC tests, processing continues to the next stage: decoding and transformation. This is where FORTRAN’s number-crunching facilities are used to great advantage, as the bulk of the calculations occur at this point. In wave analysis, for instance, the time series from pressure sensors undergo spectral and directional analyses, and a range of algorithms transform the data into spectral coefficients, condensed parameters (Hs, Tp, etc.), and the like.

At this stage of the processing - as at most others - Datawell directional buoys are treated somewhat differently than other sensors. Datawell buoys perrform spectral and directional analyses internally, and the buoys output this spectral data along with the buoy’s displacement time series. For this reason, CDIP does not actually do any number-crunching for the Datawell buoys. Instead, meta_proc simply decodes the spectral information produced by the buoy itself.

Once all of the calculations have been completed, further QC is performed: are the results reasonable? If so, there is one remaining stage to the processing: distributing the results to all of the relevant products. Since a single station’s sensors may be used to generate a number of different data sets at any point in time, meta_proc carefully follows the processing archive’s instructions to ensure that its results are added to all of the appropriate files and databases. Once this last stage is complete, all of CDIP’s standard products are easily accessible to web users, the National Weather Service, and a number of research insitutions.

More: Full meta_proc documentation

Quality Control

CDIP needs to provide its users with data which are not only timely, but accurate as well; this is a responsibility that is taken very seriously. Rigorous quality controls are implemented at several stages in the processing, and catch the vast majority of problematic files.

As described above, the first quality control checks ensure that each data file is properly attributed, with its full provenance - both time and place - accounted for. Then, as the data are processed by meta_proc, CDIP’s full suite of QC algorithms and analyses is deployed. For time series data, a wide range of analyses are used, with different tests applied to different data types. For water column and vertical displacement time series - i.e. wave measurements - the checks include: extreme values test, spike test, mean shift test, flat episodes test, mean crossing test, equal peaks test, acceleration test, and period distribution test. Some of the tests edit the time series, cleaning up the data where possible; others simply flag it bad. Where multiple sensors are deployed in close proximity, the above tests are followed by a battery of comparison tests, to ensure that the sensors are in agreement. For a full description of the tests used and the data types to which they are applied, please refer to our QC documentation.

image3
Time series from a buoy with a bad hippy (heave-pitch-yaw) sensor

Once the time series have been processed, the resulting values - condensed parameters and spectral information - also undergo QC checks. For Datawell buoys these checks are quite extensive, since the buoys perform their own time-series handling and spectral processing internally. For example, the time series above shows a problem - spikes in a large, long-period waveform - that CDIP’s time series tests could easily identify. But since this data is from a Datawell buoy, the time series was processed internally, so no CDIP editing was applied.

Nonetheless, CDIP’s post-processing checks correctly identify the problem with this file, since the resulting spectral distribution and parameter values (in this case Tp) are skewed. Here the long-period spikes result in a spectral shift to lower frequencies, and in an unnaturally high Tp value (approximately 28 seconds).

image4

Two spectral plots. The upper plot shows the shifted distribution of the time series data above; the lower plot shows a more typical spectral distribution.

In addition to the checks outlined above, there is one more full stage of automated QC applied to CDIP data. While all the previous analyses are applied to a single file’s data, the final QC tests address a station’s data over longer time periods, dozens of files. Once per day, all of a station’s recently acquired condensed parameters are compiled and compared. Once again, a range of tests check for spikes, unusual values, and the like, notifying CDIP staff via e-mail if anything seems amiss. (Please read our post-processing documentation for more details.) All of these automated measures, combined with periodic visual inspections, are very effective in preventing the distribution of erroneous data.

More: Automated QC Non-automated QC QC summary table Time series editor