Package Exports
- data-forge
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (data-forge) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
data-forge
JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Works in both NodeJS and the browser.
This project is a work in progress, please don't use unless you want to be an early adopter. Please expect API changes. Please contribute and help guide the direction of data-forge.
Generated API docs
See here for generated API docs that are taking shape.
Contents
Table of Contents generated with DocToc
- Project Aims
- Driving Principles
- Implementation
- Installation
- Key Concepts
- Basic Usage
- Working with data
- Immutability and Chained Functions
- Data exploration and visualization
- Data transformation
- Examples
Project Aims
The aims of this project:
- To combine the best aspects of Pandas and LINQ and make them available in JavaScript and C#.
- To be able to load data, transform and save data.
- To be able to prepare data for visualization.
- Be able to load massive data files.
Driving Principles
The principles that drive decision making and tradeoffs:
- Simple, easy to learn, easy to use.
- Minimize the magic, everything should be understandable, the API should be orthogonal.
- High performance.
- Be able to use the same (or very similar) API in both Javascript and C#.
- The code you build during interactive data exploration should be transplantable to an app or microservice.
Implementation
General implementation goals:
- Immutable, every operation generates a new immutable data set.
- Lazy evaluation, to make the performance of immutability acceptable.
- Extensible via plugins for data sources and formats.
The rest of the README defines the setup and usage of Data-Forge. Certain features described here are not implemented yet.
Installation
NodeJS installation and setup
Install via NPM:
npm install --save data-forge
Require the module into your script:
var dataForge = require('data-forge');
Data-Forge plugins under Node.js
Plugins are typically loaded into the Data-Forge namespace as follows, using data-forge-from-yahoo (todo: link to repo) as an example.
Install via NPM:
npm install --save data-forge-from-yahoo
Required and use:
var dataForge = require('data-forge');
dataForge.use(require('data-forge-from-yahoo'));
You can use functions defined by the plugin, eg
dataForge.fromYahoo('MSFT')
.then(function (dataFrame) {
// ... use the data returned from Yahoo ...
});
Browser installation and setup
Install via Bower:
bower install --save data-forge
Include the main script in your HTML file:
<script src="bower_components/data-forge/data-forge.js"></script>
You can now access the global dataForge
variable.
Data-Forge plugins under the browser
As in the Node.js example, plugins are typically loaded into the Data-Forge namespace. Example using data-forge-from-yahoo (todo: link to repo).
Install via Bower:
bower install --save data-forge-from-yahoo
Include in your HTML file:
<script src="bower_components/data-forge/data-forge.js"></script>
<script src="bower_components/data-forge-from-yahoo/data-forge-from-yahoo.js"></script>
Use functions defined by the plugin, eg:
dataForge.fromYahoo('MSFT')
.then(function (dataFrame) {
// ... use the data returned from Yahoo ...
});
Getting the code
Install via NPM and Bower as described in previous sections or clone, fork or download the code from GitHub:
https://github.com/Real-Serious-Games/data-forge-js
Key Concepts
This section explains the key concepts of Data-Forge.
Data Frame
This is the main concept. A matrix of data structured as rows and columns. Can be considered a sequence of rows. Has an implicit or explicit index. Think of it as a spreadsheet in memory.
Row
A single row of data in a data frame. Contains a slice of data across columns. Has an implicit or explicit index. An JavaScript object or an array of values is associated with each row.
Column
A single named column of data in a data frame. Contains a slice of data through all rows. A sequence of values is associated with a column. All values in a column are generally expected to have the same type, although this is not a requirement of data-forge-js.
Index
Used to index a data frame for operations such as merge. If not specified an integer index (starting at 0) is generated based on row position. An index can be explicitly set by promoting a column to an index.
Lazy Evaluation
Data frames and columns are only fully evaluated when necessary. Operations that are applied to data frames and columns are queued up and only executed when the full data is required, for example when serializing to csv or json (toCSV
or toJSON
) or when baking to values (toValues
or toObjects
). A data frame or column can be forcibly evaluated by calling the bake
function.
Iterator
An object that iterates the rows of a data frame or column. Iterators allow lazy evaluation (row by row evaluation) of data frames and columns. This is the same concept as an iterator in JavaScript or an enumerator in C#.
Basic Usage
Creating a Data Frame
The DataFrame constructor is passed a config object that specifies the initial contents of the data frame.
Create a data frame from column names and rows:
var dataFrame = new dataForge.DataFrame({
columnNames: ["Col1", "Col2", "Col3"],
rows: [
[1, 'hello', new Date(...)],
[5, 'computer', new Date(...)],
[10, 'good day', new Date(...)]
]
});
A data frame can also be created from an array of JavaScript objects:
var dataFrame = new dataForge.DataFrame({
rows: [
{
Col1: 1,
Col2: 'hello',
Col3: new Date(....)
},
{
Col1: 5,
Col2: 'computer',
Col3: new Date(....)
},
{
Col1: 10,
Col2: 'good day',
Col3: new Date(....)
}
]
});
Setting an index
The previous examples each generated an index with the values 0, 1, 2.
An index can explicitly be provided when creating a data frame:
var dataFrame = new dataForge.DataFrame({
columnNames: <column-names>,
rows: <rows>,
index: new dataForge.Index([5, 10, 100])
});
Or an existing column can be promoted to an index:
var dataFrame = new dataForge.DataFrame(someConfig).setIndex("Col3");
Be aware that promoting a column to an index in Data-Forge doesn't remove the column (as it does in Pandas). You can easily achieve this by calling dropColumn
:
var dataFrame = new dataForge.DataFrame(someConfig).setIndex("Col3").dropColumn("Col3");
An index is required for certain operations like merge
.
Working with data
Data-Forge has built-in support for serializing and deserializing common data formats.
CSV
var dataFrame = dataForge.fromCSV("<csv-string-data>");
var csvTextData = dataFrame.toCSV();
JSON
var dataFrame = dataForge.fromJSON("<json-string-data>");
var jsonTextData = dataFrame.toJSON();
XML
var dataFrame = dataForge.fromXML("<xml-string-data>");
var xmlTextData = dataFrame.toXML();
YAML
var dataFrame = dataForge.fromYAML("<yaml-string-data>");
var yamlTextData = dataFrame.toYAML();
Reading and writing files in Node.js
The from / to functions can be used in combination with Node.js fs
functions for reading and writing files, eg:
var fs = require('fs');
var dataFrame = dataForge.fromCSV(fs.readFileSync('some-csv-file.csv', 'utf8'));
fs.writeFileSync('some-other-csv-file.csv', dataFrame.toCSV());
See the examples section for examples of loading various data sources and formats.
Enumerating rows
Rows can be extracted from a data frame in several ways.
First we can lazily iterate using an iterator. This is the lowest-level method of accessing the rows of a data frame. Using iterators allows data frames and columns to be lazily evaluated (same as with LINQ in C#).
var iterator = dataFrame.getIterator();
while (iterator.moveNext()) {
var row = iterator.getCurrent();
// do something with the row.
}
There are higher-level ways to extract the rows. Under the hood these use iterators. These force lazy evaluation to complete (like the toArray function in LINQ).
var arrayOfArrays = dataFrame.toValues();
and
var arrayOfObjects = dataFrame.toObjects();
Create a new data frame from a subset of rows:
var startIndex = ... // Starting row index to include in subset.
var endIndex = ... // Ending row index to include in subset.
var rowSubset = dataFrame.getRowsSubset(startIndex, endIndex);
Enumerating columns
Get the names of the columns:
var arrayOfColumnNames = dataFrame.getColumnNames();
Get an array of all columns:
var arrayOfColumns = dataFrame.getColumns();
Use an iterator to lazily iterate an individual column:
var iterator = someColumn.getIterator();
while (iterator.moveNext()) {
var row = iterator.getCurrent();
// do something with the row.
}
Slice out an array of values for an individual column. Note that this could be an expensive operation. Lazy evaluation of the entire data frame will be forced to complete.
var arrayOfValues = someColumn.toValues();
Create a new data frame from a sub-set of columns:
var columnSubset = df.getColumnsSubset(["Some-Column", "Some-Other-Column"]);
Enumerating the index
The index can also be lazily iterated:
var iterator = dataFrame.getIndex().getIterator();
while (iterator.moveNext()) {
var row = iterator.getCurrent();
// do something with the row.
}
Retrieve an array of an index's values:
var arrayOfValues = dataFrame.getIndex().toValues();
Direct column access
Individual columns can be extracted by name:
var column = dataFrame.getColumn("some-column");
Or by zero-based index:
var column = dataFrame.getColumn(5);
Adding a column
New columns can be added to a data frame. This doesn't change the original data frame, it generates a new data frame that contains the additional column.
var newDf = df.setColumn("Some-New-Column", newColumnObject);
Replacing a column
setColumn
can also replace an existing column:
var newDf = df.setColumn("Some-Existing-Column", newColumnObject);
Again note that it is only the new data frame that includes the modified column.
Removing a column
A column can easily be removed:
var newDf = df.dropColumn('Column-to-be-dropped');
Immutability and Chained Functions
You may have noticed in previous examples that multiple functions have been chained.
data-forge supports only immutable operations. Each operation returns a new immutable data frame or column. No in place operations are supported (one of the things I found confusing about Pandas).
This is why, in the following example, the final data frame is captured after all operations are applied:
var df = new dataForge.DataFrame(config).setIndex("Col3").dropColumn("Col3");
Consider an alternate structure:
var df1 = new dataForge.DataFrame(config);
var df2 = df1.setIndex("Col3");
var df3 = df2.dropColumn("Col3");
Here df1, df2 and df3 are separate data frames with the results of the previous operations applied. These data frames are all immutable and cannot be changed. Any function that transforms a data frame returns a new and independent data frame. This is great, but may require some getting used to!
Data exploration and visualization
In order to understand the data we are working with we must explore it, understand the data types involved and composition of the values.
Console output
Data frame, index and column all provide a toString
function that can be used to dump data to the console in a readable format.
Use the LINQ functions skip
and take
to preview a subset of the data (more on LINQ functions soon):
// Skip 10 rows, then dump 20 rows.
console.log(df.skip(10).take(20).toString());
Or more conveniently:
// Get a range of rows starting at row index 10 and ending at (but not including) row index 20.
console.log(df.getRowsSubset(10, 20).toString());
As you explore a data set you may want to understand what data types you are working with. You can use the detectTypes
function to produce a new data frame with information on the data types in the data frame you are exploring:
// Create a data frame with details of the types from the source data frame.
var typesDf = df.detectTypes();
console.log(typesDf.toString());
todo: show example output here.
You also probably want to understand the composition of values in the data frame. This can be done using detectValues
that examines the values and reports on their frequency:
// Create a data frame with the information on the frequency of values from the source data frame.
var valuesDf = df.detectValues();
console.log(valuesDf.toString());
todo: show example output here.
Visual output
More on this soon. If you need to get started now the Github repo has examples showing how to use data-forge with Flot.
Data transformation
Data frame transformation
An entire data frame can be transformed using the LINQ-style select
function:
var transformedDataFrame = df
.select(function (row) {
return {
NewColumn: row.OldColumn * 2, // <-- Transform existing column to create a new column.
AnotherNewColumn: rand(0, 100) // <-- Create a new column (in this cause just use random data).
};
});
The assigned index is maintained for the transformed data frame.
The more advanced selectMany
function is also available.
Note: Data frames are immutable, the original data frame is unmodified.
Column transformation
Columns can also be transformed using select
:
var oldColumn = df.getColumn("Some-Column");
var newColumn = oldColumn
.select(function (value) {
return transform(value); // <-- Apply a transformation to each value in the column.
});
The column index is maintained for the transformed column.
Note: Columns are immutable, the original column is unmodified.
Data frame and column filtering
Data frames and columns can be filtered using the LINQ-style where
function:
var newDf = df
.where(function (row) {
// .. return true to include the row in the new data frame, return false to exclude it ...
});
LINQ functions
Most of the other LINQ functions are or will be available.
More documentation will be here soon on supported LINQ functions.
Data frame aggregation
todo
Data frame window
todo
Examples
Working with CSV files
var fs = require('fs');
var dataForge = require('data-forge');
var inputFilePath = "input-file.csv";
var outputFilePath = "output-file.csv";
var inputDataFrame = dataForge.fromCSV(fs.readFileSync(inputFilePath, 'utf8'));
var outputDataFrame = inputDataFrame.select(... some transformation ...);
fs.writeFileSync(outputFilePath, outputDataFrame.toCSV());
Working with JSON files
var fs = require('fs');
var dataForge = require('data-forge');
var inputFilePath = "input-file.json";
var outputFilePath = "output-file.json";
var inputDataFrame = dataForge.fromJSON(fs.readFileSync(inputFilePath, 'utf8'));
var outputDataFrame = inputDataFrame.select(... some transformation ...);
fs.writeFileSync(outputFilePath, outputDataFrame.toJSON());
Working a massive CSV file
When working with large text files use FileReader and FileWriter. FileReader is an iterator, it allows the specified file to be loaded piecemeal, in chunks, as required. FileWriter allows iterative output. These work in combination with lazy evaluation so to incrementally read, process and write massive files that are too large or too slow to work with in memory in their entirety.
var dataForge = require('data-forge');
var FileReader = require('data-forge/file-reader');
var FileWriter = require('data-forge/file-writer');
var inputFilePath = "input-file.csv";
var outputFilePath = "output-file.csv";
// Read the file as it is processed.
var inputDataFrame = dataForge.from(new FileReader(inputFilePath));
var outputDataFrame = inputDataFrame.select(... some transformation ...);
dataForge.to(new FileWriter(outputDataFrame));
Working with a MongoDB collection
var pmongo = require('promised-mongo');
var db = pmongo('localhost/some-database', ['someCollection', 'someOtherCollection']);
db.someCollection.find().toArray()
.then(function (documents) {
var inputDataFrame = new dataForge.DataFrame({ rows: documents });
var outputDataFrame = inputDataFrame.select(... some transformation ...);
return db.someOtherCollection.insert(outputDataFrame.toObjects());
})
.then(function () {
console.log('Done!');
})
.catch(function (err) {
console.error(err);
});
Working with a massive MongoDB collection
Same as previous example, except use skip and take to only process a window of the collection.
var pmongo = require('promised-mongo');
var db = pmongo('localhost/some-database', ['someCollection', 'someOtherCollection']);
db.someCollection.find()
.skip(300)
.take(100)
.toArray()
.then(function (documents) {
var inputDataFrame = new dataForge.DataFrame({ rows: documents });
var outputDataFrame = inputDataFrame.select(... some transformation ...);
return db.someOtherCollection.insert(outputDataFrame.toObjects());
})
.then(function () {
console.log('Done!');
})
.catch(function (err) {
console.error(err);
});
Working with HTTP
var request = require('request-promise');
request(
{
method: 'GET',
uri: "http://some-host/a/rest/api',
json: true,
})
.then(function (data) {
var inputDataFrame = new DataFrame({ rows: data });
var outputDataFrame = inputDataFrame.select(... some transformation ...);
return request(
{
method: 'POST',
uri: "http://some-host/another/rest/api',
body: {
data: outputDataFrame.toObjects()
},
json: true,
});
})
.then(function () {
console.log('Done!');
})
.catch(function (err) {
console.error(err);
});
Working with HTTP in the browser
todo: this section needs to be replaced
Note the differences in the way plugins are referenced than in the NodeJS version.
HTML:
<script src="bower_components/data-forge/data-forge.js"></script>
Javascript:
var url = "http://somewhere.com/rest/api";
dataForge.from(dataForge.http(url)) // <-- HTTP GET data from REST API.
.as(dataForge.json()) // <-- Deserialize the file from JSON.
.then(function (dataFrame) {
// ... transform the data frame ... // <-- Transform the data.
return dataFrame.as(dataForge.json()) // <-- Serialize the file to JSON.
.to(dataForge.http(url)); // <-- HTTP POST data to REST API.
})
.then(function () {
console.log('Success!'); // <-- Success!
});
.catch(function (err) {
console.error(err && err.stack || err); // <-- Handle errors.
});
Working with HTTP in AngularJS
todo: this section needs to be replaced
HTML:
<script src="bower_components/data-forge/data-forge.js"></script>
Javascript:
var url = "http://somewhere.com/rest/api";
dataForge.from(dataForge.http(url)) // <-- HTTP GET data from REST API.
.as(dataForge.json()) // <-- Deserialize the file from JSON.
.then(function (dataFrame) {
// ... transform the data frame ... // <-- Transform the data.
return dataFrame.as(dataForge.json()) // <-- Serialize the file to JSON.
.to(dataForge.http(url)); // <-- HTTP POST data to REST API.
})
.then(function () {
console.log('Success!'); // <-- Success!
});
.catch(function (err) {
console.error(err && err.stack || err); // <-- Handle errors.
});