Datapot
=======
|Build Status| *Open source tool for machine learning on semi-structured
data that creates numeric object-feature matrix from JSON. The idea of
Datapot is to make the process of data preparation and feature
extraction automatic, easy and effective.*
Usage
-----
**Install Datapot:**
.. code:: bash
$ git clone https://github.com/bashalex/datapot.git
$ cd datapot
$ pip install .
To **create a Datapot** object simply write the following:
.. code:: python
>>> import datapot as dp
>>> data = dp.DataPot()
DataPot has two main methods:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- fit()
- transform()
Method ``fit(self, data, limit)`` goes through the first N objects (N =
limit), passes the possible features to Transformers. Each Transformer
evaluates if a feature from current field or a number of fields can be
created. As a result a dict of features and Transformers is created.
To apply ``fit()`` to JSON file:
.. code:: python
>>> f = open('data/matches_test.jsonlines', 'r')
>>> data.fit(f, limit=100)
>>> data
DataPot class instance
- number of features without transformation: 806
- number of new features: 315
features to transform:
(u'players.0.gold_t', [ComplexTransformer])
(u'picks_bans.0.is_pick', [BoolToIntTransformer])
(u'players.0.kills_log.0.unit', [TfidfTransformer])
(u'players.1.xp_t', [ComplexTransformer])
(u'picks_bans.1.is_pick', [BoolToIntTransformer])
(u'players.1.kills_log.0.unit', [TfidfTransformer])
...
Method ``transform(self, data, verbose)`` generates a pandas. DataFrame
with new features that were detected on the fit() call. If parameter
verbose is true, progress description is printed during the feature
extraction.
.. code:: python
>>> df = data.transform(f, verbose=False)
fit transformers...OK
num of new features: 315
Examples
--------
Look for `more examples <https://github.com/bashalex/datapot/tree/master/notebooks/>`__ of using Datapot with
different datasets and more Transformer specific.
Features
--------
Datapot provides many ways of extracting features from JSON-s.
Data types that can be processed: - Boolean - Numerical array (transform
array to their sum divided by average length of array in training set) -
Time series (сalculate descriptive statistical properties of a given
time series) - Timestamp (date, time, day of week, day of month etc.) -
Text (bag of words tf-idf, word2vec) - Categorial (one-hot encoding,
dimension reduction)
Authors
-------
- Alex Bash
- Yuriy Mokriy
- Nikita Savelyev
- Michal Rozenwald
- Peter Romov
Datapot is a course work project of `the Faculty of Computer
Science <https://cs.hse.ru/en/>`__ of `the Higher School of
Economics <https://www.hse.ru/en/>`__.
.. |Build Status| image:: https://travis-ci.org/bashalex/datapot.svg?branch=master
:target: https://travis-ci.org/bashalex/datapot