# CUOCO
Cuoco is a tool for automatic processing of data.
## Example
Json example:
```
{
"input_format": "csv",
"output_format": "csv",
"new_fileName": "new file",
"new_file_route": "path/you/want/to/save/the/file",
"index": "True",
"header": "yes",
"separator": ",",
"num_nans": "mean",
"str_nans": "yes",
"caps": "lower",
"normalize_method": "min_max",
"normalize": [
"Age"
],
"balance_data": "yes",
"balance_params": {
"balance_method": "random",
"y_col": "Age"
}
}
```
Import the library
```
import cuoco
from cuoco import dataPipeline
```
Use the dataPipeline
```
dataPipeline.readJson('/content/biostats.csv', '/content/jsonTESTFILE.json')
```
## Documentation
How it works:
Cuoco uses a json created by the user to automatically apply data-processing functions to the desired dataset. The Json has the next values:
- input_format: format of the input dataset. Can be csv, parquet, orc and txt
- output_format: format of the resulted dataset. Can be csv, parquet, orc and txt
- new_fileName: name of the new dataset the DataChef will write
- new_file_route: route where to store the new data file
- index: if you want your final dataset to have a row index. Can be:
- True
- False
- header: if yor datasets has a header. Can be yes or none
- separator: the separator of your dataset. Only applies if its csv o txt format.
- num_nans: method you want to use against possible numerical nans (include empties). Can be:
- drop: drop rows that contains nans
- yes: dont do anything with rows that contains nans
- mean: fill nans with the mean value of the column
- median: fill nans with the median value of the column
- mode: fill nans with the mode value of the column
- str_nans: method you want to use against possible string nans (include empties). Can be:
- yes: keep nans columns
- no: drop nans columns
- caps: method you want to use with strings that contains Upper and Lower case letters:
- no: dont do anything
- upper: put all strings of string columns to uppercase
- lower: put all strings of string columns to lowercase
- normalize_method: method to use to normalize numerical columns. Can be:
- no: dont normalize
- max_abs: uses max absolute value to normalize
- min_max: uses min - max value method to normalize
- z_score: uses z-score value method to normalize
- normalize:
- write the name of the columns you want to normalize
- Note: if yor dataset does not have a header, you must write the columns's names you want to
normalize in number format, if it has a header you must write the columns's names between ""
- balance_data: if you want to balance your data (recomended for AI datasets). Can be:
- yes
- no
- Inside balance_params there are two items:
- balance_method: mehod you want for oversampling. Can be:
- random: random oversampling
- smote: perform SMOTE technique for oversampling.
- y_col: column of the dataset you want to use as target for the balance