Package Exports
- data-forge
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (data-forge) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
data-forge
JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Works in both NodeJS and the browser.
This project is a work in progress, please don't use unless you want to be an early adopter. Please expect API changes. Please contribute and help guide the direction of data-forge.
Note that some features described in this README are not yet implemented, although that list grows smaller every day.
Generated API docs
See here for generated API docs that are taking shape.
Contents
Table of Contents generated with DocToc
- Project Overview
- Installation
- Key Concepts
- Basic Usage
- Working with data
- Immutability and Chained Functions
- Data exploration and visualization
- Data transformation
- Node.js examples
- Browser examples
Project Overview
A short overview of the aims, principles and implementation methods for this project.
Project Aims
The aims of this project:
- To combine the best aspects of Pandas and LINQ and make them available in JavaScript.
- To be able to load, transform and save data.
- To be able to prepare data for visualization.
- Be able to load massive data files.
Driving Principles
The principles that drive decision making and tradeoffs:
- The API should be simple, easy to learn and easy to use.
- Minimize the magic, everything should be understandable, the API should be orthogonal.
- The library should have high performance.
- Be able to use a similar API in both Javascript and C#.
- The code you build during interactive data exploration should be transplantable to an webapp, server or microservice.
Implementation
General implementation goals:
- Immutable: every operation generates a new immutable data set.
- Lazy evaluation, to make the performance of immutability acceptable.
- Should be easily extensible.
- All the core code is created through test driven development.
The rest of the README defines the setup and usage of Data-Forge. Certain features described here are not implemented yet.
Installation
NodeJS installation and setup
Install via NPM:
npm install --save data-forge
Require the module into your script:
var dataForge = require('data-forge');
Data-Forge plugins under Node.js
Plugins are typically loaded into the Data-Forge namespace as follows, using data-forge-from-yahoo as an example.
Install via NPM:
npm install --save data-forge-from-yahoo
Import the module and use it:
var dataForge = require('data-forge');
dataForge.use(require('data-forge-from-yahoo'));
You can use functions defined by the plugin, eg
dataForge.fromYahoo('MSFT')
.then(function (dataFrame) {
// ... use the data returned from Yahoo ...
});
Browser installation and setup
Install via Bower:
bower install --save data-forge
Include the main script in your HTML file:
<script src="bower_components/data-forge/data-forge.js"></script>
You can now access the global dataForge
variable.
Data-Forge plugins under the browser
As in the Node.js example, plugins are typically loaded into the Data-Forge namespace. Example using data-forge-from-yahoo.
Install via Bower:
bower install --save data-forge-from-yahoo
Include in your HTML file:
<script src="bower_components/data-forge/data-forge.js"></script>
<script src="bower_components/data-forge-from-yahoo/data-forge-from-yahoo.js"></script>
Use functions defined by the plugin, eg:
dataForge.fromYahoo('MSFT')
.then(function (dataFrame) {
// ... use the data returned from Yahoo ...
});
Meteor installation and setup
Coming in the future. Can anyone help with that?
Getting the code
Install via NPM and Bower as described in previous sections or clone, fork or download the code from GitHub:
https://github.com/data-forge/data-forge-js
Key Concepts
This section explains the key concepts of Data-Forge.
Data Frame
This is the main concept. A matrix of data structured as rows and columns. Can be considered a sequence of rows. Has an implicit or explicit index. Think of it as a spreadsheet in memory.
A data-frame internal representation depends on how the data-frame was constructed. It can be stored as a collection of columns (when constructed manually) or a collection of rows (as is the case when constructed from a CSV file).
A data-frame can be easily constructed from various formats and it can be exported to various formats.
Row
A single row of data in a data-frame. Contains a slice of data across columns. Has an implicit or explicit index. A row is most commonly represented as a JavaScript object (with column names as fields). A row can also be represented as an array of values.
Series
A sequence of indexed values. These are often, but not always, time-series, where the values are indexed by date/time.
All values in a series are generally expected to have the same type, although this is not a requirement of data-forge-js.
Column
A single named series of data in a data-frame. Each column is simply a series with a name, the values of the series are the values of the column. A column is a slice of data through all rows.
Index
A sequence of values that is used to index a data-frame or series. When the data is a time-series the index is expected to contain Date values.
Used for operations that search and merge data-farmes and series.
If not specified an integer index (starting at 0) is generated based on row position. An index can be explicitly by specifying a column by name.
Lazy Evaluation
Data-frames, series and index are only fully evaluated when necessary. Operations are queued up and only fully evaluated as needed and when required, for example when serializing to csv or json (toCSV
or toJSON
) or when baking to values (toValues
or toObjects
).
A data-frame, series or index can be forcibly evaluated by calling the bake
function.
Iterator
Iterates the rows of a data-frame, series or index. Iterators allow lazy evaluation (row by row evaluation) of data frames, series and index. This is the same concept as an iterator in JavaScript or an enumerator in C#.
Basic Usage
Getting data in
The DataFrame constructor is passed a config object that specifies the initial contents of the data frame.
Create a data frame from column names and rows:
var dataFrame = new dataForge.DataFrame({
columnNames: ["Col1", "Col2", "Col3"],
rows: [
[1, 'hello', new Date(...)],
[5, 'computer', new Date(...)],
[10, 'good day', new Date(...)]
]
});
A data frame can also be created from an array of JavaScript objects:
var dataFrame = new dataForge.DataFrame({
rows: [
{
Col1: 1,
Col2: 'hello',
Col3: new Date(....)
},
{
Col1: 5,
Col2: 'computer',
Col3: new Date(....)
},
{
Col1: 10,
Col2: 'good day',
Col3: new Date(....)
}
]
});
Getting data out
To get back the names of columns:
var columnNames = dataFrame.getColumnNames();
To get back an array of rows:
var rows = dataFrame.toValues();
To get back an array of objects (whith column names as field names):
var objects = dataFrame.toObjects();
Setting an index
In the previous examples of creating a data-frame, an index was generated with values starting at zero.
An index can also be set explicitly when creating a data frame:
var dataFrame = new dataForge.DataFrame({
columnNames: <column-names>,
rows: <rows>,
index: new dataForge.Index([5, 10, 100])
});
Or an existing column can be promoted to an index:
var dataFrame = new dataForge.DataFrame(someConfig).setIndex("Col3");
Be aware that promoting a column to an index in Data-Forge doesn't remove the column (as it does in Pandas). You can easily achieve this by calling dropColumn
:
var dataFrame = new dataForge.DataFrame(someConfig).setIndex("Col3").dropColumn("Col3");
An index is required for certain operations like merge
.
Working with data
Data-Forge has built-in support for serializing and deserializing common data formats.
CSV
var dataFrame = dataForge.fromCSV("<csv-string-data>");
var csvTextData = dataFrame.toCSV();
JSON
var dataFrame = dataForge.fromJSON("<json-string-data>");
var jsonTextData = dataFrame.toJSON();
XML
var dataFrame = dataForge.fromXML("<xml-string-data>");
var xmlTextData = dataFrame.toXML();
YAML
var dataFrame = dataForge.fromYAML("<yaml-string-data>");
var yamlTextData = dataFrame.toYAML();
Reading and writing files in Node.js
The from / to functions can be used in combination with Node.js fs
functions for reading and writing files, eg:
var fs = require('fs');
var inputCsvData = fs.readFileSync('some-csv-file.csv', 'utf8');
var dataFrame = dataForge.fromCSV(inputCsvData );
var outputCsvData = dataFrame.toCSV();
fs.writeFileSync('some-other-csv-file.csv', outputCsvData);
See the examples section for more examples of loading various data sources and formats.
Extracting rows from a data frame
Rows can be extracted from a data-frame in several ways.
First we can lazily iterate using an iterator. This is the lowest-level method of accessing the rows of a data frame. Iterators allow lazy evaluation (like LINQ in C#).
var iterator = dataFrame.getIterator();
while (iterator.moveNext()) {
var row = iterator.getCurrent();
// do something with the row.
}
There are higher-level ways to extract the rows. Under the hood these use iterators. These force lazy evaluation to complete (like the toArray function in LINQ).
var arrayOfArrays = dataFrame.toValues();
and
var arrayOfObjects = dataFrame.toObjects();
A new data-frame can be created from a slice of rows:
var startIndex = ... // Starting row index to include in subset.
var endIndex = ... // Ending row index to include in subset.
var rowSubset = dataFrame.slice(startIndex, endIndex);
Extracting columns and series from a data frame
Get the names of the columns:
var arrayOfColumnNames = dataFrame.getColumnNames();
Get an array of all columns:
var arrayOfColumns = dataFrame.getColumns();
for (var column in columns) {
var name = column.name;
var series = column.series;
// ... so something with the column ...
}
Get the series for a column by name:
var series = dataFrame.getSeries('some-series');
Get the series for a column by index:
var series = dataFrame.getSeries(5);
Create a new dataframe from a sub-set of columns:
var columnSubset = df.subset(["Some-Column", "Some-Other-Column"]);
Enumerating a series
Use an iterator to lazily enumerate a series:
var iterator = someSeries.getIterator();
while (iterator.moveNext()) {
var row = iterator.getCurrent();
// do something with the row.
}
Extract the values from the series as an array. Note that this could be an expensive operation. Lazy evaluation of the entire-data frame may be forced to complete.
var arrayOfValues = someSeries.toValues();
Enumerating an index
Getting the index from a data-frame:
var index = dataFrame.getIndex();
Getting the index from a series:
var index = someSeries.getIndex();
The index can be enumerated lazily:
var iterator = dataFrame.getIndex().getIterator();
while (iterator.moveNext()) {
var row = iterator.getCurrent();
// do something with the row.
}
Retrieve value from the index as an array:
var arrayOfValues = index.toValues();
Adding a column
New columns can be added to a data-frame. This doesn't change the original data-frame, it generates a new data-frame that contains the additional column.
var newDf = df.setSeries("Some-New-Column", someNewSeries);
Replacing a column
setSeries
can also replace an existing column:
var newDf = df.setSeries("Some-Existing-Column", someNewSeries);
Again note that it is only the new data frame that includes the modified column.
Generating a column
setSeries
can be passed a function that is used to generate a new column from the existing contents of the date-frame:
var newDf = df.setSeries("Generated-Column", function (row) {
var someValue = ...
// ... generate values for the new column from the contents of the data-frame ...
return someValue;
});
Removing a column
A column can easily be removed:
var newDf = df.dropColumn('Column-to-be-dropped');
Also works for multiple columns:
var newDf = df.dropColumn(['col1', 'col2']);
Immutability and Chained Functions
You may have noticed in previous examples that multiple functions have been chained.
Data-Forge supports only immutable operations. Each operation returns a new immutable data frame or column. No in place operations are supported (one of the things I found confusing about Pandas).
This is why, in the following example, the final data frame is captured after all operations are applied:
var df = new dataForge.DataFrame(config).setIndex("Col3").dropColumn("Col3");
Consider an alternate structure:
var df1 = new dataForge.DataFrame(config);
var df2 = df1.setIndex("Col3");
var df3 = df2.dropColumn("Col3");
Here df1, df2 and df3 are separate data-frames with the results of the previous operations applied. These data-frames are all immutable and cannot be changed. Any function that transforms a data-frame returns a new and independent data frame. If you are not used to this sort of thing, it may require some getting used to!
Lazy Evaluation
todo: Lazy evaluation in Data-Forge is implemented through iterators.
todo: move the examples of manually iteration here. todo: how do we get an iterator. todo: there are constraints on what the iterators can return.
A data-frame can be created from an iterable:
var df = new dataForge.DataFrame({ rows: someIterable});
The optional index, can also be an iterable:
var df = new dataForge.DataFrame({
rows: someIterable,
index: someOtherIterable
});
A series can also be created from iterables:
var series = new dataForge.Series({
rows: someIterable,
index: someOtherIterable
});
todo: show how lazy evaluation makes it possible to work with very large files.
Data exploration and visualization
In order to understand the data we are working with we must explore it, understand the data types involved and the composition of the values.
Console output
Data-frame, index and series all provide a toString
function that can be used to dump data to the console in a readable format.
Use the LINQ functions skip
and take
to preview a subset of the data (more on LINQ functions soon):
// Skip 10 rows, then dump 20 rows.
console.log(df.skip(10).take(20).toString());
Or more conveniently:
// Get a range of rows starting at row index 10 and ending at (but not including) row index 20.
console.log(df.slice(10, 20).toString());
As you explore a data set you may want to understand what data types you are working with. You can use the detectTypes
function to produce a new data frame with information on the data types in the data frame you are exploring:
// Create a data frame with details of the types from the source data frame.
var typesDf = df.detectTypes();
console.log(typesDf.toString());
For example, here is the output with data from Yahoo:
__index__ Type Frequency Column
---------------- ------ --------- ---------
0 date 100 Date
1 number 100 Open
2 number 100 High
3 number 100 Low
4 number 100 Close
5 number 100 Volume
6 number 100 Adj Close
You also probably want to understand the composition of values in the data frame. This can be done using detectValues
that examines the values and reports on their frequency:
// Create a data frame with the information on the frequency of values from the source data frame.
var valuesDf = df.detectValues();
console.log(valuesDf.toString());
Visual output
The Github repo has examples showing how to use data-forge with Flot.
There is a Code Project article on using Highstock with Data-Forge to chart Yahoo financial data.
Data transformation
Data frame transformation
An data-frame can be transformed using the LINQ-style select
function:
var transformedDataFrame = sourceDataFrame
.select(function (row) {
return {
NewColumn: row.OldColumn * 2, // <-- Transform existing column to create a new column.
AnotherNewColumn: rand(0, 100) // <-- Create a new column (in this cause just use random data).
};
});
This produces an entirely new data-frame. However the new data-frame has the same index as the source data-frame, so both can be merged back together, if required:
var mergedDataFrame = dataForge.merge(sourceDataFrame, transformedDataFrame);
The more advanced selectMany
function is also available.
Note: Data-frames are immutable, the source data-frame remains unmodified.
Series transformation
Series can be transformed using select
:
var oldSeries = df.getSeries("Some-Column");
var newSeries = oldSeries
.select(function (value) {
return transform(value); // <-- Apply a transformation to each value in the column.
});
The source index is preserved to the transformed series.
Note: Series are immutable, the original series is unmodified.
Data-frame and series filtering
Data-frames and series can be filtered using the LINQ-style where
function:
var newDf = df
.where(function (row) {
// ... return true to include the row in the new data frame, return false to exclude it ...
});
LINQ functions
Most of the other LINQ functions are or will be available.
More documentation will be here soon on supported LINQ functions.
Summarization and Aggregation
Aggregate
todo
Merge
todo
Window
The window
function allows you to produce a new series for a data-frame or series by considering only a window of rows at a time. The window passes over the data-frame or series batch-by-batch, taking the first N rows for the first window, then the second N rows for the next window and so on.
var windowSize = 5; // Looking at 5 rows at a times.
var newSeries = sourceSeriesOrDataFrame.window(windowSize,
function (window) {
var index = ... compute index for row ...
var value = ... compute value for row ...
return [index, value]; // Generate a row in the new series.
}
);
Each invocation of the selector function is passed a data-frame or series that represents the window of time requested. The return value of the selector produces a row in the new series: an array is returned that contains a pair of values. The pair specifies the index and value for the row.
The window function can be used for summarization and aggregation of data.
As an example consider computing the weekly totals for daily sales data:
var salesData = ... series containing amount sold on each day ...
var weeklySales = salesData.window(7,
function (window) {
return [
window.getIndex().last(), // Week ending.
window.sum(), // Total the amount sold during the week.
];
},
);
Rolling window
Like the window function, rollingWindow
considers a window of rows at a time. This function however rolls across the data-frame or series row-by-row rather than batch-by-batch. The selector function produces a new series with summarized or aggregated data.
The percentChange
function that is included is probably the simplest example use of rollingWindow
. It computes a new series with the percentage increase of each value in the source series.
The implementation of percentChange
looks a bit like this:
var windowSize = 2;
var pctChangeSeries = sourceSeries.rollingWindow(windowSize,
function (indices, values) {
var amountChange = values[1] - values[0]; // Compute amount of change.
var pctChange = amountChange / values[0]; // Compute % change.
return [indices[1], pctChange]; // Return new index and value.
}
);
percentChange
is simple because it only considers a window size of 2.
Now consider an example that requires a variable window size. Here is some code that computes a simple moving average (derived from data-forge-indicators):
var Enumerable = require('linq');
var smaPeriod = ... variable window size ...
var smSeries = sourceSeries.rollingWindow(smaPeriod,
function (window) {
return [
window.getIndex().last(),
window.sum() / period,
}
);
Node.js examples
Working with CSV files
var fs = require('fs');
var dataForge = require('data-forge');
var inputFilePath = "input-file.csv";
var outputFilePath = "output-file.csv";
var inputDataFrame = dataForge.fromCSV(fs.readFileSync(inputFilePath, 'utf8'));
var outputDataFrame = inputDataFrame.select(... some transformation ...);
fs.writeFileSync(outputFilePath, outputDataFrame.toCSV());
Working with JSON files
var fs = require('fs');
var dataForge = require('data-forge');
var inputFilePath = "input-file.json";
var outputFilePath = "output-file.json";
var inputDataFrame = dataForge.fromJSON(fs.readFileSync(inputFilePath, 'utf8'));
var outputDataFrame = inputDataFrame.select(... some transformation ...);
fs.writeFileSync(outputFilePath, outputDataFrame.toJSON());
Working a massive CSV file
When working with large text files use FileReader and FileWriter. FileReader is an iterator, it allows the specified file to be loaded piecemeal, in chunks, as required. FileWriter allows iterative output. These work in combination with lazy evaluation so to incrementally read, process and write massive files that are too large or too slow to work with in memory in their entirety.
var dataForge = require('data-forge');
var FileReader = require('data-forge/file-reader');
var FileWriter = require('data-forge/file-writer');
var inputFilePath = "input-file.csv";
var outputFilePath = "output-file.csv";
// Read the file as it is processed.
var inputDataFrame = dataForge.from(new FileReader(inputFilePath));
var outputDataFrame = inputDataFrame.select(... some transformation ...);
dataForge.to(new FileWriter(outputDataFrame));
Working with a MongoDB collection
var pmongo = require('promised-mongo');
var db = pmongo('localhost/some-database', ['someCollection', 'someOtherCollection']);
db.someCollection.find().toArray()
.then(function (documents) {
var inputDataFrame = new dataForge.DataFrame({ rows: documents });
var outputDataFrame = inputDataFrame.select(... some transformation ...);
return db.someOtherCollection.insert(outputDataFrame.toObjects());
})
.then(function () {
console.log('Done!');
})
.catch(function (err) {
console.error(err);
});
Working with a massive MongoDB collection
Same as previous example, except use skip and take to only process a window of the collection.
var pmongo = require('promised-mongo');
var db = pmongo('localhost/some-database', ['someCollection', 'someOtherCollection']);
db.someCollection.find()
.skip(300)
.take(100)
.toArray()
.then(function (documents) {
var inputDataFrame = new dataForge.DataFrame({ rows: documents });
var outputDataFrame = inputDataFrame.select(... some transformation ...);
return db.someOtherCollection.insert(outputDataFrame.toObjects());
})
.then(function () {
console.log('Done!');
})
.catch(function (err) {
console.error(err);
});
Working with HTTP
var request = require('request-promise');
request(
{
method: 'GET',
uri: "http://some-host/a/rest/api',
json: true,
})
.then(function (data) {
var inputDataFrame = new DataFrame({ rows: data });
var outputDataFrame = inputDataFrame.select(... some transformation ...);
return request(
{
method: 'POST',
uri: "http://some-host/another/rest/api',
body: {
data: outputDataFrame.toObjects()
},
json: true,
});
})
.then(function () {
console.log('Done!');
})
.catch(function (err) {
console.error(err);
});
Browser examples
Working with HTTP in the browser
This example depends on the jQuery get function.
Note the differences in the way plugins are referenced than in the NodeJS version.
HTML
<script src="bower_components/jquery/dist/jquery.js"></script>
<script src="bower_components/data-forge/data-forge.js"></script>
Javascript for JSON
var url = "http://somewhere.com/rest/api";
$.get(url,
function (data) {
var dataFrame = new dataForge.DataFrame({ rows: data });
// ... work with the data frame ...
}
);
var someDataFrame = ...
$.post(url, someDataFrame.toObjects(),
function (data) {
// ...
}
);
Javascript for CSV
var url = "http://somewhere.com/rest/api";
$.get(url,
function (data) {
var dataFrame = dataForge.fromCSV();
// ... work with the data frame ...
}
);
var someDataFrame = ...
$.post(url, someDataFrame.toCSV(),
function (data) {
// ...
}
);
Working with HTTP in AngularJS
HTML
<script src="bower_components/angular/angular.js"></script>
<script src="bower_components/data-forge/data-forge.js"></script>
Javascript
// Assume [$http](https://docs.angularjs.org/api/ng/service/$http) is injected into your controller.
var url = "http://somewhere.com/rest/api";
$http.get(url)
.then(function (data) {
var dataFrame = new dataForge.DataFrame(data);
// ... work with the data frame ...
})
.catch(function (err) {
// ... handle error ...
});
var someDataFrame = ...
$http.post(url, someDataFrame.toObjects())
.then(function () {
// ... handle success ...
})
.catch(function (err) {
// ... handle error ...
});
Visualisation with Flot
todo
Visualisation with Highstock
todo