معرفی شرکت ها


awk-dataframe-0.2.9


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Wrapper around awk to use as a dataframe implementation in python
ویژگی مقدار
سیستم عامل -
نام فایل awk-dataframe-0.2.9
نام awk-dataframe
نسخه کتابخانه 0.2.9
نگهدارنده []
ایمیل نگهدارنده []
نویسنده Carlos Molinero
ایمیل نویسنده -
آدرس صفحه اصلی -
آدرس اینترنتی https://pypi.org/project/awk-dataframe/
مجوز MIT
# awk_dataframe This library is intended to use for fast and low RAM memory consumption of very large .csv files. It works by accumulating a sequence of awk commands that will be executed through pipes in bash when the actual values of the dataframe are to be consumed. ## First things first This is an wrapper around AWK for its use as a dataframe implementation, therefore, it won't work unless you are using a Linux distribution that can run BASH and GAWK. It could work on a macOS but I haven't tested it. I am personally running Ubuntu 22.04. It also compiles two C++ files (to read and save) and needs g++ in the system in order to compile them. It will also not work if your regional settings use the comma as the decimal separator. One way to change this is to set the regional settings to the UK as follows: ``` sudo update-locale LC_NUMERIC="en_GB.UTF-8" ``` and then logout of your system. ## Disk usage Not only this library runs directly from the hard drive trying to avoid RAM limitations which will degrade your hard drive. It also stores temporary files into the ```~/.tmp``` folder. Altough they should be deleted automatically, please check from time to time, since there might be some bug that leaves files behind. Also, although it will be automatically created, make sure that it already exists in your disk, in order to avoid failures at creation time due to permissions. Most commands do not take time, but be advised that ```df.shape()```, ```print(df)```, ```df.head()```, ```df.values()```, ```df.to_pandas()```, ```df.to_npd()``` and ```df.to_csv(path)``` will run the full set of commands and might take time, avoid using them as much as possible. The best implementation would be to generate all conditions for subsetting and then either going to a pandas/numpy_dataframe object if you want to further manipulate in RAM or using ```df.to_csv(path)``` to save the results to the hard drive. ## Authors and acknowledgment Idea and implementation by Carlos Molinero. ## License MIT license. ## Project status Currently this is an early implementation, meaning that it is in a very unstable state, and the syntax might change and bugs may arise. I do not recommend installing it, I am publishing it for my personal use. --- # Usage: In the following lines I paste examples of the main commands, as some sort of documentation until a more complete documentation is available. # INIT ```python import os import sys import numpy as np import numpy_dataframe as npd import pandas as pd import awk_dataframe as ad import importlib import time import random import string ``` # Functions ```python def get_random_string(length): # thanks to https://pynative.com/python-generate-random-string/ # choose from all lowercase letter letters = string.ascii_lowercase result_str = ''.join(random.choice(letters) for i in range(length)) return result_str ``` ## create test csv ```python n = 20 names = np.array([get_random_string(5) for x in range(5)]) t = npd.DataFrame() t.values = np.random.rand(n) t.some_ints = np.floor(np.random.rand(n)*50).astype(int) t.name = names[np.random.choice(range(5),n)] path = os.path.expanduser("~") + "/.tmp/example.csv" t.to_csv(path) ``` # Gawk dataframe ## contruct object / read csv ```python path = os.path.expanduser("~") + "/.tmp/example.csv" df = ad.read_csv(path) ``` or ```python path = os.path.expanduser("~") + "/.tmp/example.csv" df = ad.DataFrame() df.read_csv(path) ``` ## select rows/columns ```python df[0:10,:] ``` values,some_ints,name 0.8633713864104114,21,mlyas 0.7797461486214806,32,efcsm 0.9415640935512127,24,tcbtr 0.7678670658611503,44,efcsm 0.8223095493965592,2,bgmqd 0.06832888693954564,31,bgmqd 0.717887788740719,44,mlyas 0.6559538612897098,49,efcsm 0.08927654448880817,44,tcbtr 0.7338088959472877,36,tcbtr ```python df.some_ints ``` some_ints 21 32 24 44 2 31 44 49 44 36 9 11 12 31 7 40 26 38 35 26 ```python df["some_ints"] ``` some_ints 21 32 24 44 2 31 44 49 44 36 9 11 12 31 7 40 26 38 35 26 ```python df[:,"some_ints"] ``` some_ints 21 32 24 44 2 31 44 49 44 36 9 11 12 31 7 40 26 38 35 26 ```python df[:,["some_ints","name"]] ``` some_ints,name 21,mlyas 32,efcsm 24,tcbtr 44,efcsm 2,bgmqd 31,bgmqd 44,mlyas 49,efcsm 44,tcbtr 36,tcbtr 9,mlyas 11,lmqgd 12,bgmqd 31,mlyas 7,efcsm 40,tcbtr 26,mlyas 38,efcsm 35,mlyas 26,mlyas ```python df[:,1:2] ``` some_ints 21 32 24 44 2 31 44 49 44 36 9 11 12 31 7 40 26 38 35 26 ```python df[:,1:] ``` some_ints,name 21,mlyas 32,efcsm 24,tcbtr 44,efcsm 2,bgmqd 31,bgmqd 44,mlyas 49,efcsm 44,tcbtr 36,tcbtr 9,mlyas 11,lmqgd 12,bgmqd 31,mlyas 7,efcsm 40,tcbtr 26,mlyas 38,efcsm 35,mlyas 26,mlyas ```python df[:10,:] ``` values,some_ints,name 0.8633713864104114,21,mlyas 0.7797461486214806,32,efcsm 0.9415640935512127,24,tcbtr 0.7678670658611503,44,efcsm 0.8223095493965592,2,bgmqd 0.06832888693954564,31,bgmqd 0.717887788740719,44,mlyas 0.6559538612897098,49,efcsm 0.08927654448880817,44,tcbtr 0.7338088959472877,36,tcbtr ```python df[3,2] ``` name efcsm ```python df[3,:] ``` values,some_ints,name 0.7678670658611503,44,efcsm ```python df[:,1] ``` some_ints 21 32 24 44 2 31 44 49 44 36 9 11 12 31 7 40 26 38 35 26 ```python df[[0,3],1] ``` some_ints 21 44 ```python df[:,[0,1]] ``` values,some_ints 0.8633713864104114,21 0.7797461486214806,32 0.9415640935512127,24 0.7678670658611503,44 0.8223095493965592,2 0.06832888693954564,31 0.717887788740719,44 0.6559538612897098,49 0.08927654448880817,44 0.7338088959472877,36 0.701592457357365,9 0.04338298448626843,11 0.3297427254572507,12 0.1011755876260031,31 0.8235585667750851,7 0.5829664850991141,40 0.14016453615592928,26 0.43436632265162967,38 0.4097725083991255,35 0.6379834519420476,26 ```python df[range(5),:] ``` values,some_ints,name 0.8633713864104114,21,mlyas 0.7797461486214806,32,efcsm 0.9415640935512127,24,tcbtr 0.7678670658611503,44,efcsm 0.8223095493965592,2,bgmqd ```python df.get_rows(range(10)) ``` values,some_ints,name 0.8633713864104114,21,mlyas 0.7797461486214806,32,efcsm 0.9415640935512127,24,tcbtr 0.7678670658611503,44,efcsm 0.8223095493965592,2,bgmqd 0.06832888693954564,31,bgmqd 0.717887788740719,44,mlyas 0.6559538612897098,49,efcsm 0.08927654448880817,44,tcbtr 0.7338088959472877,36,tcbtr ```python df.get_cols(range(2)) ``` values,some_ints 0.8633713864104114,21 0.7797461486214806,32 0.9415640935512127,24 0.7678670658611503,44 0.8223095493965592,2 0.06832888693954564,31 0.717887788740719,44 0.6559538612897098,49 0.08927654448880817,44 0.7338088959472877,36 0.701592457357365,9 0.04338298448626843,11 0.3297427254572507,12 0.1011755876260031,31 0.8235585667750851,7 0.5829664850991141,40 0.14016453615592928,26 0.43436632265162967,38 0.4097725083991255,35 0.6379834519420476,26 ## names, shape, head ```python df.names() ``` array(['values', 'some_ints', 'name'], dtype='<U9') ```python df.shape() ``` array([20, 3]) ```python df.head() ``` values,some_ints,name 0.8633713864104114,21,mlyas 0.7797461486214806,32,efcsm 0.9415640935512127,24,tcbtr 0.7678670658611503,44,efcsm 0.8223095493965592,2,bgmqd 0.06832888693954564,31,bgmqd 0.717887788740719,44,mlyas 0.6559538612897098,49,efcsm 0.08927654448880817,44,tcbtr 0.7338088959472877,36,tcbtr 0.701592457357365,9,mlyas 0.04338298448626843,11,lmqgd ```python df.head(5) ``` values,some_ints,name 0.8633713864104114,21,mlyas 0.7797461486214806,32,efcsm 0.9415640935512127,24,tcbtr 0.7678670658611503,44,efcsm 0.8223095493965592,2,bgmqd 0.06832888693954564,31,bgmqd 0.717887788740719,44,mlyas ## where ```python df.where("values < 0.8 && values > .6") ``` gawk: warning: escape sequence `\/' treated as plain `/' values,some_ints,name 0.7797461486,32,efcsm 0.7678670659,44,efcsm 0.7178877887,44,mlyas 0.6559538613,49,efcsm 0.7338088959,36,tcbtr 0.7015924574,9,mlyas 0.6379834519,26,mlyas ```python df.where(""" name == "mlyas" """) ``` gawk: warning: escape sequence `\/' treated as plain `/' values,some_ints,name 0.8633713864104114,21,mlyas 0.717887788740719,44,mlyas 0.701592457357365,9,mlyas 0.1011755876260031,31,mlyas 0.14016453615592928,26,mlyas 0.4097725083991255,35,mlyas 0.6379834519420476,26,mlyas ```python df.where(""" name in ["efcsm","tcbtr"] """) ``` gawk: warning: escape sequence `\/' treated as plain `/' values,some_ints,name 0.7797461486214806,32,efcsm 0.9415640935512127,24,tcbtr 0.7678670658611503,44,efcsm 0.6559538612897098,49,efcsm 0.08927654448880817,44,tcbtr 0.7338088959472877,36,tcbtr 0.8235585667750851,7,efcsm 0.5829664850991141,40,tcbtr 0.43436632265162967,38,efcsm ## unique ```python df.unique("name") ``` name lmqgd efcsm mlyas bgmqd tcbtr ## add column ```python df_with_new_column = df.add_column("new_column") df_with_new_column ``` values,some_ints,name,new_column 0.8633713864104114,21,mlyas, 0.7797461486214806,32,efcsm, 0.9415640935512127,24,tcbtr, 0.7678670658611503,44,efcsm, 0.8223095493965592,2,bgmqd, 0.06832888693954564,31,bgmqd, 0.717887788740719,44,mlyas, 0.6559538612897098,49,efcsm, 0.08927654448880817,44,tcbtr, 0.7338088959472877,36,tcbtr, 0.701592457357365,9,mlyas, 0.04338298448626843,11,lmqgd, 0.3297427254572507,12,bgmqd, 0.1011755876260031,31,mlyas, 0.8235585667750851,7,efcsm, 0.5829664850991141,40,tcbtr, 0.14016453615592928,26,mlyas, 0.43436632265162967,38,efcsm, 0.4097725083991255,35,mlyas, 0.6379834519420476,26,mlyas, # modify ```python df.modify("values = values + some_ints * 3") ``` gawk: warning: escape sequence `\/' treated as plain `/' gawk: warning: escape sequence `\/' treated as plain `/' values,some_ints,name 63.86337139,21,mlyas 96.77974615,32,efcsm 72.94156409,24,tcbtr 132.7678671,44,efcsm 6.822309549,2,bgmqd 93.06832889,31,bgmqd 132.7178878,44,mlyas 147.6559539,49,efcsm 132.0892765,44,tcbtr 108.7338089,36,tcbtr 27.70159246,9,mlyas 33.04338298,11,lmqgd 36.32974273,12,bgmqd 93.10117559,31,mlyas 21.82355857,7,efcsm 120.5829665,40,tcbtr 78.14016454,26,mlyas 114.4343663,38,efcsm 105.4097725,35,mlyas 78.63798345,26,mlyas The previous operation does not modify the values in df ```python df ``` values,some_ints,name 0.8633713864104114,21,mlyas 0.7797461486214806,32,efcsm 0.9415640935512127,24,tcbtr 0.7678670658611503,44,efcsm 0.8223095493965592,2,bgmqd 0.06832888693954564,31,bgmqd 0.717887788740719,44,mlyas 0.6559538612897098,49,efcsm 0.08927654448880817,44,tcbtr 0.7338088959472877,36,tcbtr 0.701592457357365,9,mlyas 0.04338298448626843,11,lmqgd 0.3297427254572507,12,bgmqd 0.1011755876260031,31,mlyas 0.8235585667750851,7,efcsm 0.5829664850991141,40,tcbtr 0.14016453615592928,26,mlyas 0.43436632265162967,38,efcsm 0.4097725083991255,35,mlyas 0.6379834519420476,26,mlyas You need to assign it in order for the modifications to be "saved" ```python df_mod = df.modify("values = values + some_ints * 3") df_mod ``` gawk: warning: escape sequence `\/' treated as plain `/' gawk: warning: escape sequence `\/' treated as plain `/' values,some_ints,name 63.86337139,21,mlyas 96.77974615,32,efcsm 72.94156409,24,tcbtr 132.7678671,44,efcsm 6.822309549,2,bgmqd 93.06832889,31,bgmqd 132.7178878,44,mlyas 147.6559539,49,efcsm 132.0892765,44,tcbtr 108.7338089,36,tcbtr 27.70159246,9,mlyas 33.04338298,11,lmqgd 36.32974273,12,bgmqd 93.10117559,31,mlyas 21.82355857,7,efcsm 120.5829665,40,tcbtr 78.14016454,26,mlyas 114.4343663,38,efcsm 105.4097725,35,mlyas 78.63798345,26,mlyas Modify accepts conditional equations, and its results are 1 for true and 0 for false, which allows to construct functions such as sign() or abs(), etc as an example: ```python df_mod = df_with_new_column.modify("new_column = values - .5") df_mod ``` gawk: warning: escape sequence `\/' treated as plain `/' gawk: warning: escape sequence `\/' treated as plain `/' values,some_ints,name,new_column 0.8633713864104114,21,mlyas,0.3633713864 0.7797461486214806,32,efcsm,0.2797461486 0.9415640935512127,24,tcbtr,0.4415640936 0.7678670658611503,44,efcsm,0.2678670659 0.8223095493965592,2,bgmqd,0.3223095494 0.06832888693954564,31,bgmqd,-0.4316711131 0.717887788740719,44,mlyas,0.2178877887 0.6559538612897098,49,efcsm,0.1559538613 0.08927654448880817,44,tcbtr,-0.4107234555 0.7338088959472877,36,tcbtr,0.2338088959 0.701592457357365,9,mlyas,0.2015924574 0.04338298448626843,11,lmqgd,-0.4566170155 0.3297427254572507,12,bgmqd,-0.1702572745 0.1011755876260031,31,mlyas,-0.3988244124 0.8235585667750851,7,efcsm,0.3235585668 0.5829664850991141,40,tcbtr,0.0829664851 0.14016453615592928,26,mlyas,-0.3598354638 0.43436632265162967,38,efcsm,-0.06563367735 0.4097725083991255,35,mlyas,-0.0902274916 0.6379834519420476,26,mlyas,0.1379834519 now, let's get the absolute value of new column and put it in a new column ```python df_mod_2 = df_mod.add_column("abs_new_column").modify("abs_new_column = new_column * ((new_column >= 0) * 2 - 1)") df_mod_2 ``` gawk: warning: escape sequence `\/' treated as plain `/' gawk: warning: escape sequence `\/' treated as plain `/' values,some_ints,name,new_column,abs_new_column 0.8633713864104114,21,mlyas,0.3633713864,0.3633713864 0.7797461486214806,32,efcsm,0.2797461486,0.2797461486 0.9415640935512127,24,tcbtr,0.4415640936,0.4415640936 0.7678670658611503,44,efcsm,0.2678670659,0.2678670659 0.8223095493965592,2,bgmqd,0.3223095494,0.3223095494 0.06832888693954564,31,bgmqd,-0.4316711131,0.4316711131 0.717887788740719,44,mlyas,0.2178877887,0.2178877887 0.6559538612897098,49,efcsm,0.1559538613,0.1559538613 0.08927654448880817,44,tcbtr,-0.4107234555,0.4107234555 0.7338088959472877,36,tcbtr,0.2338088959,0.2338088959 0.701592457357365,9,mlyas,0.2015924574,0.2015924574 0.04338298448626843,11,lmqgd,-0.4566170155,0.4566170155 0.3297427254572507,12,bgmqd,-0.1702572745,0.1702572745 0.1011755876260031,31,mlyas,-0.3988244124,0.3988244124 0.8235585667750851,7,efcsm,0.3235585668,0.3235585668 0.5829664850991141,40,tcbtr,0.0829664851,0.0829664851 0.14016453615592928,26,mlyas,-0.3598354638,0.3598354638 0.43436632265162967,38,efcsm,-0.06563367735,0.06563367735 0.4097725083991255,35,mlyas,-0.0902274916,0.0902274916 0.6379834519420476,26,mlyas,0.1379834519,0.1379834519 Modify accepts conditions, there are two ways: ```python df_mod_2.modify("values = 0",condition = "new_column < 0") ``` gawk: warning: escape sequence `\/' treated as plain `/' gawk: warning: escape sequence `\/' treated as plain `/' values,some_ints,name,new_column,abs_new_column 0.8633713864104114,21,mlyas,0.3633713864,0.3633713864 0.7797461486214806,32,efcsm,0.2797461486,0.2797461486 0.9415640935512127,24,tcbtr,0.4415640936,0.4415640936 0.7678670658611503,44,efcsm,0.2678670659,0.2678670659 0.8223095493965592,2,bgmqd,0.3223095494,0.3223095494 0,31,bgmqd,-0.4316711131,0.4316711131 0.717887788740719,44,mlyas,0.2178877887,0.2178877887 0.6559538612897098,49,efcsm,0.1559538613,0.1559538613 0,44,tcbtr,-0.4107234555,0.4107234555 0.7338088959472877,36,tcbtr,0.2338088959,0.2338088959 0.701592457357365,9,mlyas,0.2015924574,0.2015924574 0,11,lmqgd,-0.4566170155,0.4566170155 0,12,bgmqd,-0.1702572745,0.1702572745 0,31,mlyas,-0.3988244124,0.3988244124 0.8235585667750851,7,efcsm,0.3235585668,0.3235585668 0.5829664850991141,40,tcbtr,0.0829664851,0.0829664851 0,26,mlyas,-0.3598354638,0.3598354638 0,38,efcsm,-0.06563367735,0.06563367735 0,35,mlyas,-0.0902274916,0.0902274916 0.6379834519420476,26,mlyas,0.1379834519,0.1379834519 ```python conditional_equation = ad.Conditional_equation() conditional_equation.condition = "new_column < 0" conditional_equation.equation = "values = 0" df_mod_2.modify(conditional_equation) ``` gawk: warning: escape sequence `\/' treated as plain `/' gawk: warning: escape sequence `\/' treated as plain `/' values,some_ints,name,new_column,abs_new_column 0.8633713864104114,21,mlyas,0.3633713864,0.3633713864 0.7797461486214806,32,efcsm,0.2797461486,0.2797461486 0.9415640935512127,24,tcbtr,0.4415640936,0.4415640936 0.7678670658611503,44,efcsm,0.2678670659,0.2678670659 0.8223095493965592,2,bgmqd,0.3223095494,0.3223095494 0,31,bgmqd,-0.4316711131,0.4316711131 0.717887788740719,44,mlyas,0.2178877887,0.2178877887 0.6559538612897098,49,efcsm,0.1559538613,0.1559538613 0,44,tcbtr,-0.4107234555,0.4107234555 0.7338088959472877,36,tcbtr,0.2338088959,0.2338088959 0.701592457357365,9,mlyas,0.2015924574,0.2015924574 0,11,lmqgd,-0.4566170155,0.4566170155 0,12,bgmqd,-0.1702572745,0.1702572745 0,31,mlyas,-0.3988244124,0.3988244124 0.8235585667750851,7,efcsm,0.3235585668,0.3235585668 0.5829664850991141,40,tcbtr,0.0829664851,0.0829664851 0,26,mlyas,-0.3598354638,0.3598354638 0,38,efcsm,-0.06563367735,0.06563367735 0,35,mlyas,-0.0902274916,0.0902274916 0.6379834519420476,26,mlyas,0.1379834519,0.1379834519 A future update of the library will allow to include several conditional equations in the same modify statement # add_index ```python df_mod = df.add_index() df_mod ``` index,values,some_ints,name 0,0.8633713864104114,21,mlyas 1,0.7797461486214806,32,efcsm 2,0.9415640935512127,24,tcbtr 3,0.7678670658611503,44,efcsm 4,0.8223095493965592,2,bgmqd 5,0.06832888693954564,31,bgmqd 6,0.717887788740719,44,mlyas 7,0.6559538612897098,49,efcsm 8,0.08927654448880817,44,tcbtr 9,0.7338088959472877,36,tcbtr 10,0.701592457357365,9,mlyas 11,0.04338298448626843,11,lmqgd 12,0.3297427254572507,12,bgmqd 13,0.1011755876260031,31,mlyas 14,0.8235585667750851,7,efcsm 15,0.5829664850991141,40,tcbtr 16,0.14016453615592928,26,mlyas 17,0.43436632265162967,38,efcsm 18,0.4097725083991255,35,mlyas 19,0.6379834519420476,26,mlyas ## sort_by ```python df_mod.sort_by("values") ``` index,values,some_ints,name 11,0.04338298448626843,11,lmqgd 5,0.06832888693954564,31,bgmqd 8,0.08927654448880817,44,tcbtr 13,0.1011755876260031,31,mlyas 16,0.14016453615592928,26,mlyas 12,0.3297427254572507,12,bgmqd 18,0.4097725083991255,35,mlyas 17,0.43436632265162967,38,efcsm 15,0.5829664850991141,40,tcbtr 19,0.6379834519420476,26,mlyas 7,0.6559538612897098,49,efcsm 10,0.701592457357365,9,mlyas 6,0.717887788740719,44,mlyas 9,0.7338088959472877,36,tcbtr 3,0.7678670658611503,44,efcsm 1,0.7797461486214806,32,efcsm 4,0.8223095493965592,2,bgmqd 14,0.8235585667750851,7,efcsm 0,0.8633713864104114,21,mlyas 2,0.9415640935512127,24,tcbtr ## to_pandas ```python pdf = df_mod.to_pandas() pdf ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>index</th> <th>values</th> <th>some_ints</th> <th>name</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0</td> <td>0.863371</td> <td>21</td> <td>mlyas</td> </tr> <tr> <th>1</th> <td>1</td> <td>0.779746</td> <td>32</td> <td>efcsm</td> </tr> <tr> <th>2</th> <td>2</td> <td>0.941564</td> <td>24</td> <td>tcbtr</td> </tr> <tr> <th>3</th> <td>3</td> <td>0.767867</td> <td>44</td> <td>efcsm</td> </tr> <tr> <th>4</th> <td>4</td> <td>0.822310</td> <td>2</td> <td>bgmqd</td> </tr> <tr> <th>5</th> <td>5</td> <td>0.068329</td> <td>31</td> <td>bgmqd</td> </tr> <tr> <th>6</th> <td>6</td> <td>0.717888</td> <td>44</td> <td>mlyas</td> </tr> <tr> <th>7</th> <td>7</td> <td>0.655954</td> <td>49</td> <td>efcsm</td> </tr> <tr> <th>8</th> <td>8</td> <td>0.089277</td> <td>44</td> <td>tcbtr</td> </tr> <tr> <th>9</th> <td>9</td> <td>0.733809</td> <td>36</td> <td>tcbtr</td> </tr> <tr> <th>10</th> <td>10</td> <td>0.701592</td> <td>9</td> <td>mlyas</td> </tr> <tr> <th>11</th> <td>11</td> <td>0.043383</td> <td>11</td> <td>lmqgd</td> </tr> <tr> <th>12</th> <td>12</td> <td>0.329743</td> <td>12</td> <td>bgmqd</td> </tr> <tr> <th>13</th> <td>13</td> <td>0.101176</td> <td>31</td> <td>mlyas</td> </tr> <tr> <th>14</th> <td>14</td> <td>0.823559</td> <td>7</td> <td>efcsm</td> </tr> <tr> <th>15</th> <td>15</td> <td>0.582966</td> <td>40</td> <td>tcbtr</td> </tr> <tr> <th>16</th> <td>16</td> <td>0.140165</td> <td>26</td> <td>mlyas</td> </tr> <tr> <th>17</th> <td>17</td> <td>0.434366</td> <td>38</td> <td>efcsm</td> </tr> <tr> <th>18</th> <td>18</td> <td>0.409773</td> <td>35</td> <td>mlyas</td> </tr> <tr> <th>19</th> <td>19</td> <td>0.637983</td> <td>26</td> <td>mlyas</td> </tr> </tbody> </table> </div> ## to_numpy_dataframe ```python ndf = df_mod.to_npd() print(ndf) print(ndf.shape) ``` index values some_ints name 0 0 0.863371 21 mlyas 1 1 0.779746 32 efcsm 2 2 0.941564 24 tcbtr 3 3 0.767867 44 efcsm 4 4 0.822310 2 bgmqd 5 5 0.068329 31 bgmqd 6 6 0.717888 44 mlyas 7 7 0.655954 49 efcsm 8 8 0.089277 44 tcbtr 9 9 0.733809 36 tcbtr 10 10 0.701592 9 mlyas 11 11 0.043383 11 lmqgd 12 12 0.329743 12 bgmqd 13 13 0.101176 31 mlyas 14 14 0.823559 7 efcsm 15 15 0.582966 40 tcbtr 16 16 0.140165 26 mlyas 17 17 0.434366 38 efcsm 18 18 0.409773 35 mlyas 19 19 0.637983 26 mlyas (20, 4) ## to_csv ```python path_output = os.path.expanduser("~") + "/.tmp/example_mod.csv" df_mod.to_csv(path_output) ``` # For future development ``` add_rows(n) keep_na = False rename ``` # Internal gawk command example (where command): ``` function is_digit_field(field){ if (field ~ /^[\\+-]?[0-9]+$|^[\\+-]*[0-9]*[\\.]?[0-9]+$|^[\\+-]*[0-9]*[\\.]?[0-9]+[eE]?[\\+-]?[0-9]+$|#N\/A|#N\/A|N\/A|#NA|-1.#IND|-1.#QNAN|-NaN|-nan|1.#IND|1.#QNAN|<NA>|<na>|NA|NULL|NaN|n\/a|nan|null||Nan|-Nan/){ return 1 }else{ return 0 } } function needs_quotes(field){ #postgres_standard: delimiter character, the QUOTE character, the NULL string, a carriage return, or line feed character, t regex = "(" FS ")+|(\\n)+|(" input_quote ")+|\r+" if (field ~ regex){ return 1 }else{ return 0 } } function transform_field(field,column){ if (is_first_command && !is_last_command && original_level_simplicity_csv != 2){ gsub(/\n/,delimiter_internal_notation "n" delimiter_internal_notation,field) gsub(FS,delimiter_internal_notation "d" delimiter_internal_notation,field) gsub(input_quote, delimiter_internal_notation "q" delimiter_internal_notation, field) gsub("\r", delimiter_internal_notation "r" delimiter_internal_notation, field) } if (is_last_command && !is_first_command && original_level_simplicity_csv != 2){ gsub(delimiter_internal_notation "n" delimiter_internal_notation,"\n",field) gsub(delimiter_internal_notation "d" delimiter_internal_notation,FS,field) gsub(delimiter_internal_notation "q" delimiter_internal_notation,input_quote, field) gsub(delimiter_internal_notation "r" delimiter_internal_notation,"\r", field) } if (is_digit_field(field) || (level_simplicity_csv == 2 && (!is_last_command || original_level_simplicity_csv == 2))){ if (!is_digit_field()){ field = transform_string(field,column) if (input_quote != output_quote){ gsub(input_quote,output_quote,field) } }else{ return transform_digit(field,column) } return field }else{ field = transform_string(field,column) if (quotes_have_been_removed){ if (full_string_quoting){ return output_quote field output_quote }else{ if (needs_quotes(field) && minimal_string_quoting || field == delimiter_internal_notation "NULL" delimiter_internal_notation){ if (field == delimiter_internal_notation "NULL" delimiter_internal_notation){ gsub(delimiter_internal_notation "NULL" delimiter_internal_notation,"NULL",field) } return output_quote field output_quote }else{ return field } } }else{ return field } } } function print_field(field,column){ field = transform_field(field,column) if (FNR == NR && find_unique){ unique[field] = 1 }else{ if (print_fields){ output_separator = OFS if (i == max_col() && !add_new_column){ output_separator = ORS } if (i == min_col()){ if (add_index){ if (!save){ printf("%s" output_separator,print_line_number() + index_based_on) }else{ printf("%s" output_separator,print_line_number() + index_based_on) > path_output } } } if (!save){ printf "%s" output_separator,field }else{ printf "%s" output_separator,field > path_output } } } } function print_line(line){ line = transform_line(line) if (print_fields){ if (add_index){ if (!save){ printf("%s" OFS,print_line_number() + index_based_on) }else{ printf("%s" OFS,print_line_number() + index_based_on) > path_output } } if (input_quote != output_quote){ gsub(input_quote,output_quote,line) } if (!save){ if (sort){ sort_command = "sort -t'" OFS "' --parallel " num_cores " -nk" sort_column printf "%s" ORS,line | sort_command }else{ printf "%s" ORS,line } }else{ printf "%s" ORS,line > path_output } } } function print_line_number(){ if (output_header){ return FNR-records_skipped - 2 }else{ return FNR-records_skipped - 1 } } function line_number(){ if (header){ return FNR-records_skipped - 1 }else{ return FNR-records_skipped } } function print_header(){ if (add_index){ field = transform_field("index",1) if (!save){ printf("%s" OFS,field) }else{ printf("%s" OFS,field) > path_output } } if (fields_based_code){ for (i=min_col();i<=max_col();i++) { if ((number_files_input == 1 || input_file_rows) || (input_file_cols && v[i])){ if (is_first_command && !is_last_command && level_simplicity_csv != 2){ gsub(/\n/,delimiter_internal_notation "n" delimiter_internal_notation,$i) gsub(FS,delimiter_internal_notation "d" delimiter_internal_notation,$i) gsub(input_quote, delimiter_internal_notation "q" delimiter_internal_notation, $i) gsub("\r", delimiter_internal_notation "r" delimiter_internal_notation, $i) } if (is_last_command && !is_first_command && level_simplicity_csv != 2){ gsub(delimiter_internal_notation "n" delimiter_internal_notation,"\n",$i) gsub(delimiter_internal_notation "d" delimiter_internal_notation,FS,$i) gsub(delimiter_internal_notation "q" delimiter_internal_notation,input_quote, fldStr) gsub(delimiter_internal_notation "r" delimiter_internal_notation,"\r", $i) } output_separator = OFS if (i == max_col() && !add_new_column){ output_separator = ORS } if (!save){ printf "%s" output_separator,transform_header($i,i) }else{ printf "%s" output_separator,transform_header($i,i) > path_output } } } add_column() }else{ gsub(input_quote,output_quote,$0) if (!save){ printf "%s" ORS,$0 }else{ printf "%s" ORS,$0 > path_output } add_column() } } function min_col(){ if (1 > col_start){ return 1 }else{ return col_start } } function max_col(){ if (col_end == INF){ return NF }else{ if (NF < col_end){ return NF }else{ return col_end } } } function add_column(){ if (add_new_column){ if (line_number() == 0){ printf "%s" ORS,name_new_column }else{ printf "%s" ORS,"" } } } function transform_header(field,column){ return field } ################################# ################################# function transform_column(field,column){ return field } function transform_digit(field,column){ field = transform_column(field,column) return field } function transform_string(field,column){ field = transform_column(field,column) return field } function transform_line(line){ return line } function condition(){ statement = 1 statement = ($3 ~ /^"?crlsb"?$/) return statement } BEGIN { PREC=100 OFMT="%.10g" CONVFMT="%.10g" is_first_command = 0 is_last_command = 0 delimiter_internal_notation = ":" if (FS == delimiter_internal_notation){ delimiter_internal_notation = "#" } input_quote = "\"" output_quote = "\"" level_simplicity_csv = 2 FS ="," OFS="," original_level_simplicity_csv = 0 if (level_simplicity_csv <= 1){ FPAT="([^" FS "]*)|(" input_quote "([^" input_quote "]|" input_quote input_quote ")+" input_quote ")" } records_skipped = 0 minimal_string_quoting = 1 full_string_quoting = !minimal_string_quoting is_unix_input = 1 is_unix_output = 1 if (is_unix_input){ RS = "\n" }else{ RS = "\r\n" } if (is_unix_output){ ORS = "\n" }else{ ORS = "\r\n" } null_input_string = "NULL" null_output_string = "" header = 1 output_header = 1 if (!header){ output_header = 0 } add_index = 0 index_based_on = 0 line_based_code = 0 fields_based_code = !line_based_code col_start = 1 col_end = INF row_start = 1 row_end = INF number_files_input = 1 input_file_rows = 1 input_file_cols = !input_file_rows find_unique = 0 print_fields = 1 save = 0 path_output = "" sort = 0 sort_column = 1 num_cores = 4 sort_in_parallel = 1 calculate_shape = 0 check_complexity = 0 add_new_column = 0 name_new_column = "" if (check_complexity){ has_maximum_complexity = 0 print_fields = 0 } quotes_have_been_removed = 0 number_fields = 0 continue_next_line = 0 } { if (!calculate_shape){ if (check_complexity || number_files_input == 1){ if (check_complexity){ if (line_number() == 0){ number_fields = max_col() has_maximum_complexity = 0 }else{ if (max_col() < number_fields){ has_maximum_complexity = 1 exit } } }else{ if (line_number() == 0 && header){ number_fields = NF if (output_header){ print_header() } }else{ if (fields_based_code){ if (line_number() >= row_start){ for (i=min_col();i<=max_col();i++) { if (condition()){ print_field($i,i) } } i=min_col() if (min_col() == max_col() && $i == ""){ print(":n:") } add_column() } }else{ if (line_number() >= row_start){ if (condition()){ print_line($0) } } } if (row_end != INF && line_number() > row_end){ exit } } } }else{ if (FNR == NR){ v[$0] = 1 }else{ if (FNR!=NR && FNR == 1){ records_skipped = 0 } if (line_number() == 0 && header){ number_fields = max_col() if (output_header){ print_header() } }else{ if (fields_based_code){ if (line_number() >= row_start){ for (i=min_col();i<=max_col();i++) { if (condition()){ field = $i if (input_file_rows && v[line_number()]){ print_field(field,i) }else{ if (input_file_cols && v[i]){ print_field(field,i) } } } } i=min_col() if (min_col() == max_col() && $i == ""){ print(":n:") } add_column() } }else{ if (line_number() >= row_start){ if (condition()){ if (input_file_rows && v[line_number()]){ print_line($0) } } } } if (row_end != INF && line_number() > row_end){ exit } } } } } } END{ if (find_unique){ for (key in unique){ print(key) } } if (calculate_shape){ print print_line_number() print NF } if (check_complexity){ print has_maximum_complexity } } ```


نحوه نصب


نصب پکیج whl awk-dataframe-0.2.9:

    pip install awk-dataframe-0.2.9.whl


نصب پکیج tar.gz awk-dataframe-0.2.9:

    pip install awk-dataframe-0.2.9.tar.gz