معرفی شرکت ها


deduplify-20.9.0


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

A Python package to search for and remove duplicated files in messy datasets
ویژگی مقدار
سیستم عامل -
نام فایل deduplify-20.9.0
نام deduplify
نسخه کتابخانه 20.9.0
نگهدارنده []
ایمیل نگهدارنده []
نویسنده Sarah Gibson
ایمیل نویسنده drsarahlgibson@gmail.com
آدرس صفحه اصلی https://github.com/Living-with-Machines/deduplify
آدرس اینترنتی https://pypi.org/project/deduplify/
مجوز MIT
# deduplify [![PyPI](https://img.shields.io/pypi/v/deduplify)](https://pypi.org/project/deduplify/) [![CI](https://github.com/Living-with-machines/deduplify/workflows/CI/badge.svg)](https://github.com/Living-with-machines/deduplify/actions?query=workflow%3ACI) [![pre-commit.ci status](https://results.pre-commit.ci/badge/github/Living-with-machines/deduplify/main.svg)](https://results.pre-commit.ci/latest/github/Living-with-machines/deduplify/main) A Python tool to search for and remove duplicated files in messy datasets. **Table of Contents:** - [Overview](#overview) - [Installation](#installation) - [From PyPI](#from-pypi) - [Manual Installation](#manual-installation) - [Usage](#usage) - [Hashing files](#hashing-files) - [Comparing files](#comparing-files) - [Cleaning up](#cleaning-up) - [Global arguments](#global-arguments) - [Contributing](#contributing) --- ## Overview `deduplify` is a Python command line tool that will search a directory tree for duplicated files and optionally remove them. It generates an MD5 hash for each file recursively under a target directory and identifies the filepaths that generate unique and duplicated hashes. When deleting duplicated files, it deletes those deepest in the directory tree first leaving the last present. ## Installation `deduplify` has a minimum Python requirement of v3.7 but has been developed in v3.8. ### From PyPI First, make sure your `pip` version is up-to-date. ```bash python -m pip install --upgrade pip ``` Then install `deduplify`. ```bash pip install deduplify ``` ### Manual Installation Begin by cloning this repository and change into it. ```bash git clone https://github.com/Living-with-machines/deduplify.git cd deduplify ``` Now run the setup script. This will install any requirements and the CLI tool into your Python `$PATH`. ```bash python setup.py install ``` ## Usage `deduplify` has 3 commands: `hash`, `compare` and `clean`. ### Hashing files The `hash` command takes a path to a target directory as an argument. It walks the structure of this directory tree and generates MD5 hashes for all files and outputs a database stored as a JSON file, the name of which can be overwritten using the `--dbfile [-f]` flag. Each document in the generated database can be described as a dictionary with the following properties: ```json { "filepath": "", # String. The full path to a given file. "hash": "", # String. The MD5 hash of the given file. "duplicate": bool, # Boolean. Whether this hash is repeated in the database (True) or not (False). } ``` By default, `deduplify` generates hashes for all files under a directory. But one or more specific file extensions to search for can be specified using the `--ext` flag. **Command line usage:** ```bash usage: deduplify hash [-h] [-c COUNT] [-v] [-f DBFILE] [--exts [EXTS]] [--restart] dir positional arguments: dir Path to directory to begin search from optional arguments: -h, --help show this help message and exit -c COUNT, --count COUNT Number of threads to parallelise over. Default: 1 -v, --verbose Print logging messages to the console -f DBFILE, --dbfile DBFILE Destination database for file hashes. Must be a JSON file. Default: file_hashes.json --exts [EXTS] A list of file extensions to search for. --restart Restart a run of hashing files and skip over files that have already been hashed. Output file containing a database of filenames and hashes must already exist. ``` ### Comparing files The `compare` command reads in the JSON database generated by running `hash`, the name of which can be overwritten using the `--infile [-f]` flag if the data were saved under a different name. The command runs a check to test if the stem of the filepath are equivalent for all paths that generated a given hash. This indicates that the file is a true duplication as since both its name and content match. If they do not match, this implies that the same content is saved under two different filenames. In this scenario, a warning is raised asking the user to manually investigate these files. If all the filenames for a given hash match, then the shortest filepath is removed from the list and the rest are returned to be deleted. To delete files, the user needs to run `compare` with the `--purge` flag set. A recommended workflow to ensure that all duplicated files have been removed would be as follows: ```bash deduplify hash target_dir # First pass at hashing files deduplify compare --purge # Delete duplicated files deduplify hash target_dir # Second pass at hashing files deduplify compare # Compare the filenames again. The code should return nothing to compare ``` **Command line usage:** ```bash usage: deduplify compare [-h] [-c COUNT] [-v] [-f INFILE] [--list-files] [--purge] optional arguments: -h, --help show this help message and exit -c COUNT, --count COUNT Number of threads to parallelise over. Default: 1 -v, --verbose Print logging messages to the console -f INFILE, --infile INFILE Database to analyse. Must be a JSON file. Default: file_hashes.json --list-files List duplicated files. Default: False --purge Deletes duplicated files. Default: False ``` ### Cleaning up After purging duplicated files, the target directory may be left with empty sub-directories. Running the `clean` command will locate and delete these empty subdirs and remove them. **Command line usage:** ```bash usage: deduplify clean [-h] [-c COUNT] [-v] dir positional arguments: dir Path to directory to begin search from optional arguments: -h, --help show this help message and exit -c COUNT, --count COUNT Number of threads to parallelise over. Default: 1 -v, --verbose Print logging messages to the console ``` ### Global arguments The following flags can be passed to any of the commands of `deduplify`. - `--verbose [-v]`: The flag will print verbose output to the console, as opposed to saving it to the `deduplify.log` file. - `--count [-c]`: Some processes within `deduplify` can be parallelised over multiple threads when working with larger datasets. To do this, include the `--count` flag with the (integer) number of threads you'd like to parallelise over. This flag will raise an error if requesting more threads than CPUs available on the host machine. ## Contributing Thank you for wanting to contribute to `deduplify`! :tada: :sparkling_heart: To get you started, please read our [Code of Conduct](./CODE_OF_CONDUCT.md) and [Contributing Guidelines](./CONTRIBUTING.md).


نیازمندی

مقدار نام
- incremental
==1.0.0 jmespath
==12.2.0 rich
==4.7.0 tinydb
- tqdm
- twisted
==6.3.2 coverage
==7.1.1 pytest
==6.3.2 coverage
==7.1.1 pytest


زبان مورد نیاز

مقدار نام
>=3.7.0 Python


نحوه نصب


نصب پکیج whl deduplify-20.9.0:

    pip install deduplify-20.9.0.whl


نصب پکیج tar.gz deduplify-20.9.0:

    pip install deduplify-20.9.0.tar.gz