معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Get repeated capture groups, search without having to fear Exceptions in any df/Series, convert results to appropriate dtypes, use fast Trie regex

ویژگی	مقدار
سیستم عامل	-
نام فایل	a-pandas-ex-regex-enhancements-0.11
نام	a-pandas-ex-regex-enhancements
نسخه کتابخانه	0.11
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	Johannes Fischer
ایمیل نویسنده	<aulasparticularesdealemaosp@gmail.com>
آدرس صفحه اصلی	https://github.com/hansalemaos/a_pandas_ex_regex_enhancements
آدرس اینترنتی	https://pypi.org/project/a-pandas-ex-regex-enhancements/
مجوز	MIT

# Regex enhancements for Pandas DataFrame / Series ###Installation ```python pip install a-pandas-ex-regex-enhancements ``` ###Usage ```python from a_pandas_ex_regex_enhancements import pd_add_regex_enhancements pd_add_regex_enhancements() import pandas as pd ``` ##### The code above will add some methods to! You can use pandas like you did before, but you will have a couple of methods more: - pandas.DataFrame.ds_trie_regex_search / pandas.Series.ds_trie_regex_search - pandas.DataFrame.ds_trie_regex_sub / pandas.Series.ds_trie_regex_sub - pandas.DataFrame.ds_trie_regex_find_all / pandas.Series.ds_trie_regex_find_all - pandas.DataFrame.ds_regex_find_all / pandas.Series.ds_regex_find_all - pandas.DataFrame.ds_regex_find_all_special / pandas.Series.ds_regex_find_all_special - pandas.DataFrame.ds_regex_search / pandas.Series.ds_regex_search - pandas.DataFrame.ds_regex_sub / pandas.Series.ds_regex_sub ###How to use the new methods ##### pandas.DataFrame.ds_regex_find_all_special / pandas.Series.ds_regex_find_all_special* ```python #Using this method, you can get each match from REPEATED CAPTURE GROUPS! (A very rare feature in regex engines) #Besides that, you will see the exact position of each group/match. df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv") special=df.ds_regex_find_all_special(r'\b(Ma(\w)+)(\w+)\b', dtype_string=False) aa_start_0 ... aa_match_6 aa_index aa_column aa_whole_match aa_whole_start aa_whole_end aa_group ... 7 Name Master 9 15 0 9 ... NaN 1 9 ... NaN 2 11 ... NaN 3 14 ... NaN 10 Name Marguerite 17 27 0 17 ... NaN ... ... ... 885 Name Margaret 20 28 3 27 ... NaN 887 Name Margaret 14 22 0 14 ... NaN 1 14 ... NaN 2 16 ... NaN 3 21 ... NaN #If you use any common regex engine, you can't get the repeated capture groups, since every new result overwrites the old one: import re re.findall('(Ma(\w)+)', 'Margaret') Out[11]: [('Margaret', 't')] #Using this method you will get all repeated capture groups, they won't be overwritten! #Results for index 887 aa_start_0 aa_start_1 aa_start_2 aa_start_3 aa_start_4 aa_start_5 aa_start_6 aa_stop_0 aa_stop_1 aa_stop_2 aa_stop_3 aa_stop_4 aa_stop_5 aa_stop_6 aa_match_0 aa_match_1 aa_match_2 aa_match_3 aa_match_4 aa_match_5 aa_match_6 aa_column aa_whole_match aa_whole_start aa_whole_end aa_group Name Margaret 14 22 0 14 <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA> <NA> <NA> <NA> <NA> <NA> Margaret <NA> <NA> <NA> <NA> <NA> <NA> 1 14 <NA> <NA> <NA> <NA> <NA> <NA> 21 <NA> <NA> <NA> <NA> <NA> <NA> Margare <NA> <NA> <NA> <NA> <NA> <NA> 2 16 17 18 19 20 <NA> <NA> 17 18 19 20 21 <NA> <NA> r g a r e <NA> <NA> 3 21 <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA> <NA> <NA> <NA> <NA> <NA> t <NA> <NA> <NA> <NA> <NA> <NA> If you want to convert the results to the best available dtype, use: df.ds_regex_find_all_special(r'\b(Ma(\w)+)(\w+)\b', dtype_string=False) Out[3]: aa_start_0 ... aa_match_6 #aa_index aa_column aa_whole_match aa_whole_start aa_whole_end aa_group ... 7 Name Master 9 15 0 9 ... <NA> 1 9 ... <NA> 2 11 ... <NA> 3 14 ... <NA> 10 Name Marguerite 17 27 0 17 ... <NA> ... ... ... 885 Name Margaret 20 28 3 27 ... <NA> 887 Name Margaret 14 22 0 14 ... <NA> 1 14 ... <NA> 2 16 ... <NA> 3 21 ... <NA> [764 rows x 21 columns] aa_start_0 uint8 aa_start_1 Int64 aa_start_2 Int64 aa_start_3 Int64 aa_start_4 Int64 aa_start_5 Int64 aa_start_6 Int64 aa_stop_0 uint8 aa_stop_1 Int64 aa_stop_2 Int64 aa_stop_3 Int64 aa_stop_4 Int64 aa_stop_5 Int64 aa_stop_6 Int64 aa_match_0 category aa_match_1 category aa_match_2 category aa_match_3 category aa_match_4 category aa_match_5 category aa_match_6 category Parameters: df: Union[pd.DataFrame, pd.Series] regular_expression: str Syntax from https://pypi.org/project/regex/ flags:int You can use any flag that is available here: https://pypi.org/project/regex/ (default =regex.UNICODE) dtype_string:bool If True, it returns all results as a string If False, data types are converted to the best available (default =True) Returns: Union[pd.Series, pd.DataFrame] ``` ##### pandas.DataFrame.ds_regex_find_all / pandas.Series.ds_regex_find_all ```python #Use regex.findall against a DataFrame/Series without having to fear any exception! You can get #the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever #fits best! #Some examples df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv") df.Name.ds_regex_find_all(regular_expression=r'(\bM\w+\b)\s+(\bW\w+\b)') result_0 result_1 426 Name Maria Winfield 472 Name Mary Worth 862 Name Margaret Welles multilinetest=df.Name.map(lambda x: f'{x}\n' * 3) #Every name 3x in each cell multilinetest.ds_regex_find_all(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=False) Out[3]: result_0 result_1 58 Name Mirium West 426 Name Maria Winfield 472 Name Mary Worth 862 Name Margaret Welles multilinetest.ds_regex_find_all(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=True) Out[7]: result_0 result_1 426 Name Maria Winfield Name Maria Winfield Name Maria Winfield 472 Name Mary Worth Name Mary Worth Name Mary Worth 862 Name Margaret Welles Name Margaret Welles Name Margaret Welles #By using line_by_line=True you can be sure that the regex engine will check every single line! Parameters: df: Union[pd.DataFrame, pd.Series] regular_expression: str Syntax from https://pypi.org/project/regex/ flags:int You can use any flag that is available here: https://pypi.org/project/regex/ (default =regex.UNICODE) dtype_string:bool If True, it returns all results as a string If False, data types are converted to the best available (default =True) line_by_line:bool If you want to split the line before searching. Useful, if you want to use ^....$ more than once. (default =False) Returns: Union[pd.Series, pd.DataFrame] ``` ##### pandas.DataFrame.ds_trie_regex_find_all / pandas.Series.ds_trie_regex_find_all ```python #If you have a huge list of words you want to search/sub/find_all on this list, you can try to use the Trie regex methods to get the job done faster #It is worth trying if: #1) your DataFrame/Series has a lot of text in each cell #2) you want to search for a lot of words in each cell # #The more words you have, and the more text is in each cell, the faster it gets. #If you want to know more about, I recommend: https://stackoverflow.com/a/42789508/15096247 Example: df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv") allstrings=pd.DataFrame([[df.Name.to_string() *2] *2,[df.Name.to_string() *2] *2]) #lets create a little dataframe with a lot of text in each cell hugeregexlist=df.Name.str.extract(r'^\s*(\w+)').drop_duplicates()[0].to_list() #lets get all names (first word) in the titanic DataFrame #it should look like that: ['Braund', 'Cumings', 'Heikkinen', 'Futrelle', 'Allen', 'Moran', 'McCarthy', 'Palsson', 'Johnson', 'Nasser' ... ] %timeit allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b') 776 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) allstrings.ds_trie_regex_find_all(hugeregexlist, add_left_to_regex=r'\b', add_right_to_regex=r'\b') Out[6]: result_0 result_1 result_2 ... result_2133 result_2134 result_2135 0 0 Braund Harris Cumings ... Johnston Behr Dooley 1 Braund Harris Cumings ... Johnston Behr Dooley 1 0 Braund Harris Cumings ... Johnston Behr Dooley 1 Braund Harris Cumings ... Johnston Behr Dooley Let's compare with a regular regex search hugeregex=r"\b(?:" + "|".join([f'(?:{y})' for y in df.Name.str.extract(r'^\s*(\w+)').drop_duplicates()[0].to_list()]) + r")\b" #let's create a regex from all names #it should look like this: '\\b(?:(?:Braund)|(?:Cumings)|(?:Heikkinen)|(?:Futrelle)|(?:Allen)|(?:Moran)|(?:McCarthy)|(?:Palsson)|(?:Johnson)|(?:Na... %timeit allstrings.ds_regex_find_all(hugeregex) 945 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) #Not bad, right? But it still can a lot better! Try it yourself! #Another good thing is that you can search in every cell, no matter what dtype it is. #There won't be thrown any exception, because everything is converted to string before performing any action. #If you pass "dtype_string=False", each column will be converted to the best available dtype after the actions have been completed Parameters: df: Union[pd.DataFrame, pd.Series] wordlist: list[str] All strings you are looking for add_left_to_regex: str if you want to add something before the generated Trie regex -> \b for example allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b') (default = "") add_right_to_regex: str if you want to add something after the generated Trie regex -> \b for example allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b') (default = "") flags:int You can use any flag that is available here: https://pypi.org/project/regex/ (default =regex.UNICODE) dtype_string:bool If True, it returns all results as a string If False, data types are converted to the best available (default =True) line_by_line:bool If you want to split the line before searching. Useful, if you want to use ^....$ more than once. (default =False) Returns: Union[pd.Series, pd.DataFrame] ``` ##### pandas.DataFrame.ds_regex_sub / pandas.Series.ds_regex_sub ```python #Use regex.sub against a DataFrame/Series without having to fear any exception! You can get #the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever #fits best! # #Some examples: df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv") PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 NaN S 1 2 1 1 ... 71.2833 C85 C 2 3 1 3 ... 7.9250 NaN S 3 4 1 1 ... 53.1000 C123 S 4 5 0 3 ... 8.0500 NaN S .. ... ... ... ... ... ... ... 886 887 0 2 ... 13.0000 NaN S 887 888 1 1 ... 30.0000 B42 S 888 889 0 3 ... 23.4500 NaN S 889 890 1 1 ... 30.0000 C148 C 890 891 0 3 ... 7.7500 NaN Q [891 rows x 12 columns] subst=df.ds_regex_sub(regular_expression=r'^\b8\d(\d)\b', replace=r'\g<1>00000',dtype_string=False) Out[5]: PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 <NA> S 1 2 1 1 ... 71.2833 C85 C 2 3 1 3 ... 7.9250 <NA> S 3 4 1 1 ... 53.1000 C123 S 4 5 0 3 ... 8.0500 <NA> S .. ... ... ... ... ... ... ... 886 700000 0 2 ... 13.0000 <NA> S 887 800000 1 1 ... 30.0000 B42 S 888 900000 0 3 ... 23.4500 <NA> S 889 0 1 1 ... 30.0000 C148 C 890 100000 0 3 ... 7.7500 <NA> Q [891 rows x 12 columns] subst.dtypes Out[8]: PassengerId uint32 Survived uint8 Pclass uint8 Name string Sex category Age object SibSp uint8 Parch uint8 Ticket object Fare float64 Cabin category Embarked category #As you can see, the numbers that we have substituted have been converted to int #Let's do something like math.floor in a very unconventional way :) df.Fare Out[16]: 0 7.2500 1 71.2833 2 7.9250 3 53.1000 4 8.0500 ... 886 13.0000 887 30.0000 888 23.4500 889 30.0000 890 7.7500 Name: Fare, Length: 891, dtype: float64 Fareint=df.Fare.ds_regex_sub(r'(\d+)\.\d+$', r'\g<1>',dtype_string=False) 0 7 1 71 2 7 3 53 4 8 .. ... 886 13 887 30 888 23 889 30 890 7 Fareint.dtypes Out[18]: Fare uint16 #You should not use this method if there are other ways to convert float to int. #It serves best for data cleaning, at least that's what I am using it for. Parameters: df: Union[pd.DataFrame, pd.Series] regular_expression: str Syntax from https://pypi.org/project/regex/ replace: str the replacement you want to use (groups are allowed) flags:int You can use any flag that is available here: https://pypi.org/project/regex/ (default =regex.UNICODE) dtype_string:bool If True, it returns all results as a string If False, data types are converted to the best available (default =True) line_by_line:bool If you want to split the line before searching. Useful, if you want to use ^....$ more than once. (default =False) Returns: Union[pd.Series, pd.DataFrame] ``` ##### pandas.DataFrame.ds_trie_regex_sub / pandas.Series.ds_trie_regex_sub ```python #Check out the docs of df.trie_regex_find_all() for detailed information #Some examples with DataFrames / Series df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv") #hugeregexlist = ['Braund', 'Cumings', 'Heikkinen', 'Futrelle', 'Allen', 'Moran', 'McCarthy', 'Palsson', 'Johnson', 'Nasser' ... ] df.Name.ds_trie_regex_sub(hugeregexlist, 'HANS',add_left_to_regex=r'^\b',add_right_to_regex=r'\b') Out[16]: Name 0 HANS, Mr. Owen Harris 1 HANS, Mrs. John Bradley (Florence Briggs Thayer) 2 HANS, Miss. Laina 3 HANS, Mrs. Jacques Heath (Lily May Peel) 4 HANS, Mr. William Henry .. ... 886 HANS, Rev. Juozas 887 HANS, Miss. Margaret Edith 888 HANS, Miss. Catherine Helen "Carrie" 889 HANS, Mr. Karl Howell 890 HANS, Mr. Patrick [891 rows x 1 columns] allstrings.ds_trie_regex_search(hugeregexlist,line_by_line=True) Out[25]: result_0 0 0 Braund 0 Cumings 0 Heikkinen 0 Futrelle 0 Allen .. ... 1 1 Montvila 1 Graham 1 Johnston 1 Behr 1 Dooley [7124 rows x 1 columns] Parameters: df: Union[pd.DataFrame, pd.Series] wordlist: list[str] All strings you are looking for replace: str the replacement you want to use (groups are allowed) add_left_to_regex: str if you want to add something before the generated Trie regex -> \b for example allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b') (default = "") add_right_to_regex: str if you want to add something after the generated Trie regex -> \b for example allstrings.ds_trie_regex_find_all(hugeregexlist,add_left_to_regex=r'\b',add_right_to_regex=r'\b') (default = "") flags:int You can use any flag that is available here: https://pypi.org/project/regex/ (default =regex.UNICODE) dtype_string:bool If True, it returns all results as a string If False, data types are converted to the best available (default =True) line_by_line:bool If you want to split the line before searching. Useful, if you want to use ^....$ more than once. (default =False) Returns: Union[pd.Series, pd.DataFrame] ``` ##### pandas.DataFrame.ds_regex_search / pandas.Series.ds_regex_search ```python #Use regex.search against a DataFrame/Series without having to fear any exception! You can get #the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever #fits best! # #Some examples df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv") multilinetest=df.Name.map(lambda x: f'{x}\n' * 3) #Every name 3x in each cell to test line_by_line #using line_by_line=False multilinetest.ds_regex_search(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=False, flags=re.IGNORECASE) Out[13]: result_0 58 Name West, Miss. Constance Mirium\nWest Name Mirium Name West 426 Name Clarke, Mrs. Charles V (Ada Maria Winfield Name Maria Name Winfield 472 Name West, Mrs. Edwy Arthur (Ada Mary Worth Name Mary Name Worth 862 Name Swift, Mrs. Frederick Joel (Margaret Welles Name Margaret Name Welles #using line_by_line=True multilinetest.ds_regex_search(regular_expression=r'^.*(\bM\w+\b)\s+(\bW\w+\b)', line_by_line=True, flags=re.IGNORECASE) Out[19]: result_0 426 Name Clarke, Mrs. Charles V (Ada Maria Winfield Name Maria Name Winfield Name Clarke, Mrs. Charles V (Ada Maria Winfield Name Maria Name Winfield Name Clarke, Mrs. Charles V (Ada Maria Winfield Name Maria Name Winfield 472 Name West, Mrs. Edwy Arthur (Ada Mary Worth Name Mary Name Worth Name West, Mrs. Edwy Arthur (Ada Mary Worth Name Mary Name Worth Name West, Mrs. Edwy Arthur (Ada Mary Worth Name Mary Name Worth 862 Name Swift, Mrs. Frederick Joel (Margaret Welles Name Margaret Name Welles Name Swift, Mrs. Frederick Joel (Margaret Welles Name Margaret Name Welles Name Swift, Mrs. Frederick Joel (Margaret Welles Name Margaret Name Welles #Now, we get a match for each line! Parameters: df: Union[pd.DataFrame, pd.Series] regular_expression: str Syntax from https://pypi.org/project/regex/ flags:int You can use any flag that is available here: https://pypi.org/project/regex/ (default =regex.UNICODE) dtype_string:bool If True, it returns all results as a string If False, data types are converted to the best available (default =True) line_by_line:bool If you want to split the line before searching. Useful, if you want to use ^....$ more than once. (default =False) Returns: Union[pd.Series, pd.DataFrame] ``` ##### pandas.DataFrame.ds_regex_sub / pandas.Series.ds_regex_sub ```python #Use regex.sub against a DataFrame/Series without having to fear any exception! You can get #the results as strings (dtype_string=True) or even as float, int, category (dtype_string=False) - Whatever #fits best! # #Some examples df=pd.read_csv( "https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv") PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 NaN S 1 2 1 1 ... 71.2833 C85 C 2 3 1 3 ... 7.9250 NaN S 3 4 1 1 ... 53.1000 C123 S 4 5 0 3 ... 8.0500 NaN S .. ... ... ... ... ... ... ... 886 887 0 2 ... 13.0000 NaN S 887 888 1 1 ... 30.0000 B42 S 888 889 0 3 ... 23.4500 NaN S 889 890 1 1 ... 30.0000 C148 C 890 891 0 3 ... 7.7500 NaN Q [891 rows x 12 columns] subst=df.ds_regex_sub(regular_expression=r'^\b8\d(\d)\b', replace=r'\g<1>00000',dtype_string=False) Out[5]: PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 <NA> S 1 2 1 1 ... 71.2833 C85 C 2 3 1 3 ... 7.9250 <NA> S 3 4 1 1 ... 53.1000 C123 S 4 5 0 3 ... 8.0500 <NA> S .. ... ... ... ... ... ... ... 886 700000 0 2 ... 13.0000 <NA> S 887 800000 1 1 ... 30.0000 B42 S 888 900000 0 3 ... 23.4500 <NA> S 889 0 1 1 ... 30.0000 C148 C 890 100000 0 3 ... 7.7500 <NA> Q [891 rows x 12 columns] subst.dtypes Out[8]: PassengerId uint32 Survived uint8 Pclass uint8 Name string Sex category Age object SibSp uint8 Parch uint8 Ticket object Fare float64 Cabin category Embarked category #As you can see, the numbers that we have substituted have been converted to int #Let's do something like math.floor in a very unconventional way :) df.Fare Out[16]: 0 7.2500 1 71.2833 2 7.9250 3 53.1000 4 8.0500 ... 886 13.0000 887 30.0000 888 23.4500 889 30.0000 890 7.7500 Name: Fare, Length: 891, dtype: float64 Fareint=df.Fare.ds_regex_sub(r'(\d+)\.\d+$', r'\g<1>',dtype_string=False) 0 7 1 71 2 7 3 53 4 8 .. ... 886 13 887 30 888 23 889 30 890 7 Fareint.dtypes Out[18]: Fare uint16 #You should not use this method if there are other ways to convert float to int. #It serves best for data cleaning, at least that's what I am using it for. Parameters: df: Union[pd.DataFrame, pd.Series] regular_expression: str Syntax from https://pypi.org/project/regex/ replace: str the replacement you want to use (groups are allowed) flags:int You can use any flag that is available here: https://pypi.org/project/regex/ (default =regex.UNICODE) dtype_string:bool If True, it returns all results as a string If False, data types are converted to the best available (default =True) line_by_line:bool If you want to split the line before searching. Useful, if you want to use ^....$ more than once. (default =False) Returns: Union[pd.Series, pd.DataFrame] ```

نیازمندی

مقدار	نام
-	a-pandas-ex-df-to-string
-	a-pandas-ex-less-memory-more-speed
-	a-pandas-ex-plode-tool
-	a-pandas-ex-string-to-dtypes
-	flatten-everything
-	numpy
-	pandas
-	regex

نحوه نصب

نصب پکیج whl a-pandas-ex-regex-enhancements-0.11:

pip install a-pandas-ex-regex-enhancements-0.11.whl

نصب پکیج tar.gz a-pandas-ex-regex-enhancements-0.11:

pip install a-pandas-ex-regex-enhancements-0.11.tar.gz