معرفی شرکت ها


dataflows-normalize-0.0.9


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

A resource normalizer for dataflows
ویژگی مقدار
سیستم عامل -
نام فایل dataflows-normalize-0.0.9
نام dataflows-normalize
نسخه کتابخانه 0.0.9
نگهدارنده []
ایمیل نگهدارنده []
نویسنده Adam Kariv
ایمیل نویسنده adam.kariv@gmail.com
آدرس صفحه اصلی https://github.com/dataspot/dataflows-normalize
آدرس اینترنتی https://pypi.org/project/dataflows-normalize/
مجوز MIT
# Dataflows Resource write to db normalized This library provides some dataflows processing for normalizing a resource. It has special support for storing normalized data into DB tables. ## What is normalization? In short, it is the process of reducing duplication in a dataset. More can be read about this concept [here](https://en.wikipedia.org/wiki/Database_normalization). ## Example Let's take, as an example, this world cities dataset (we shall call it the *fact* resource): ```python from dataflows import Flow, load, printer Flow( load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'), printer(num_rows=1) ).process() ``` *cities:* |# |name |country |subcountry |geonameid |-----|----------------|----------|------------------|----------- |1 |les Escaldes |Andorra |Escaldes-Engordany |3040051 |2 |Andorra la Vella|Andorra |Andorra la Vella |3041563 |... |23018|**Chitungwiza** |**Zimbabwe** |**Harare** |1106542 It seems that the `country` and `subcountry` columns are quite repetitive - let's extract them into a separate, deduplicated resource (we will call that a *dimension* resource). To do that we use the `normalize` processor. This processor receives a single resource name, and a list of `NormGroup` instances. Each of these groups specifies one new *dimension* resource to be extracted and deduplicated. Let's see it in action: ```python from dataflows_normalize import normalize, NormGroup Flow( load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'), normalize([ NormGroup(['country', 'subcountry'], 'country_id', 'id') ], resource='cities'), printer() ).process() ``` *cities:* |# |name |geonameid |country_id |-----|-----------------|-----------|------------ |1 |les Escaldes |3040051 |0 |2 |Andorra la Vella |3041563 |1 |3 |Umm al Qaywayn |290594 |2 |4 |Ras al-Khaimah |291074 |3 |5 |Khawr Fakkān |291696 |4 |... |23014|Bulawayo |894701 |2677 |23015|Bindura |895061 |2678 |23016|Beitbridge |895269 |2679 |23017|Epworth |1085510 |2676 |23018|**Chitungwiza** |1106542 |**2676** *cities_country_id:* |# |id|country |subcountry |----|-----------|-------------------|---------------------- |1 |30|Afghanistan |Badakhshan |2 |27|Afghanistan |Badghis |3 |21|Afghanistan |Balkh |4 |33|Afghanistan |Bāmīān |5 |31|Afghanistan |Farah |6 |19|Afghanistan |Faryab |7 |28|Afghanistan |Ghaznī |8 |13|Afghanistan |Ghowr |9 |22|Afghanistan |Helmand |10 |11|Afghanistan |Herat |... |2671 |2677|Zimbabwe |Bulawayo |2672 |**2676**|**Zimbabwe** |**Harare** |2673 |2673|Zimbabwe |Manicaland |2674 |2678|Zimbabwe |Mashonaland Central |2675 |2675|Zimbabwe |Mashonaland East |2676 |2674|Zimbabwe |Mashonaland West |2677 |2670|Zimbabwe |Masvingo |2678 |2671|Zimbabwe |Matabeleland North |2679 |2679|Zimbabwe |Matabeleland South |2680 |2672|Zimbabwe |Midlands If we follow the last line in the dataset (`Chitungwiza`), we can see that an entry for its region (`Zimbabwe/Harare`) was created with id `2676`, and that id was added to the original row instead of the original values. **How much did we gain?** The original CSV file has a size of 895,586 bytes. If we save the two new resources as CSVs, we would get 542,299 bytes for the *fact* resource and 68,023 for the regions *dimension* resource - a total of 610,322 bytes (or a reduction of 31% in size). Not only this helps with size, it also improves greatly DB performance to store data in normalized form. ## DB Normalization Running similar code to above, only using `normalize_to_db` will do the following: - Load existing values from database *dimension* tables (in case these tables exist) - Normalize the input data, and split into *fact* and *dimension* resources - Update the DB tables with new values, while reusing existing references The main difference in usage from `normalize` is that the names of DB tables are provided. ```python from dataflows_normalize import normalize_to_db, NormGroup Flow( load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'), normalize_to_db( [ NormGroup(['country', 'subcountry'], 'country_id', 'id', db_table='countries_db_table') ], 'cities_db_table', 'cities', db_connection_str='...' ), ).process() ```


نیازمندی

مقدار نام
>=0.0.51 dataflows
>=0.0.7 kvfile
- psycopg2-binary
- pylama
- tox


نحوه نصب


نصب پکیج whl dataflows-normalize-0.0.9:

    pip install dataflows-normalize-0.0.9.whl


نصب پکیج tar.gz dataflows-normalize-0.0.9:

    pip install dataflows-normalize-0.0.9.tar.gz