Python Read From Csv and Calculate Payroll
CSV (comma-separated value) files are a mutual file format for transferring and storing data. The ability to read, manipulate, and write data to and from CSV files using Python is a key skill to master for any information scientist or business analysis. In this post, nosotros'll get over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames back to CSV files mail analysis.
Pandas is the most popular information manipulation package in Python, and DataFrames are the Pandas data type for storing tabular 2nd information.
- Load CSV files to Python Pandas
- 1. File Extensions and File Types
- 2. Data Representation in CSV files
- Other Delimiters / Separators – TSV files
- Delimiters in Text Fields – Quotechar
- 3. Python – Paths, Folders, Files
- Finding your Python Path
- File Loading: Absolute and Relative Paths
- 4. Pandas CSV File Loading Errors
- Advanced Read CSV Files
- Specifying Data Types
- Skipping and Picking Rows and Columns From File
- Custom Missing Value Symbols
- CSV Format Advantages and Disadvantages
- Boosted Reading
Load CSV files to Python Pandas
The bones process of loading data from a CSV file into a Pandas DataFrame (with all going well) is achieved using the "read_csv" function in Pandas:
# Load the Pandas libraries with alias 'pd' import pandas as pd # Read data from file 'filename.csv' # (in the same directory that your python procedure is based) # Control delimiters, rows, cavalcade names with read_csv (meet afterwards) data = pd.read_csv("filename.csv") # Preview the first 5 lines of the loaded information data.caput() While this code seems simple, an agreement of iii fundamental concepts is required to fully grasp and debug the operation of the data loading procedure if y'all run into issues:
- Understanding file extensions and file types – what do the letters CSV actually mean? What's the difference between a .csv file and a .txt file?
- Understanding how data is represented within CSV files – if you open a CSV file, what does the information actually look like?
- Agreement the Python path and how to reference a file – what is the absolute and relative path to the file you are loading? What directory are yous working in?
- CSV data formats and errors – common errors with the function.
Each of these topics is discussed below, and we finish this tutorial by looking at some more advanced CSV loading mechanisms and giving some wide advantages and disadvantages of the CSV format.
1. File Extensions and File Types
The first step to working with comma-separated-value (CSV) files is understanding the concept of file types and file extensions.
- Information is stored on your computer in individual "files", or containers, each with a unlike proper name.
- Each file contains information of different types – the internals of a Word document is quite different from the internals of an image.
- Computers determine how to read files using the "file extension", that is the code that follows the dot (".") in the filename.
- So, a filename is typically in the form "<random name>.<file extension>". Examples:
- project1.DOCX – a Microsoft Word file called Project1.
- shanes_file.TXT – a uncomplicated text file called shanes_file
- IMG_5673.JPG – An paradigm file called IMG_5673.
- Other well known file types and extensions include: XLSX: Excel, PDF: Portable Certificate Format, PNG – images, Nada – compressed file format, GIF – animation, MPEG – video, MP3 – music etc. Meet a complete list of extensions here.
- A CSV file is a file with a ".csv" file extension, due east.grand. "information.csv", "super_information.csv". The "CSV" in this case lets the computer know that the data contained in the file is in "comma separated value" format, which we'll discuss below.
File extensions are hidden by default on a lot of operating systems. The showtime step that whatever self-respecting engineer, software engineer, or data scientist will do on a new reckoner is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.
To bank check if file extensions are showing in your system, create a new text certificate with Notepad (Windows) or TextEdit (Mac) and save information technology to a folder of your choice. If you can't run into the ".txt" extension in your folder when you view information technology, you volition accept to modify your settings.
- In Microsoft Windows: Open Control Panel > Appearance and Personalization. Now, click on Folder Options or File Explorer Option, every bit it is now called > View tab. In this tab, under Accelerate Settings, you will see the option Hibernate extensions for known file types. Uncheck this option and click on Use and OK.
- In Mac OS: Open Finder > In menu, click Finder > Preferences, Click Advanced, Select the checkbox for "Show all filename extensions".
2. Data Representation in CSV files
A "CSV" file, that is, a file with a "csv" filetype, is a bones text file. Any text editor such as NotePad on windows or TextEdit on Mac, can open a CSV file and show the contents. Sublime Text is a wonderful and multi-functional text editor option for any platform.
CSV is a standard for storing tabular information in text format, where commas are used to separate the different columns, and newlines (carriage return / press enter) used to split up rows. Typically, the first row in a CSV file contains the names of the columns for the data.
And example table data set and the corresponding CSV-format data is shown in the diagram below.
Note that almost any tabular information tin can be stored in CSV format – the format is popular because of its simplicity and flexibility. You can create a text file in a text editor, save it with a .csv extension, and open up that file in Excel or Google Sheets to see the table form.
Other Delimiters / Separators – TSV files
The comma separation scheme is past far the most pop method of storing tabular data in text files.
However, the pick of the ',' comma character to delimiters columns, yet, is capricious, and can be substituted where needed. Popular alternatives include tab ("\t") and semi-colon (";"). Tab-separate files are known as TSV (Tab-Separated Value) files.
When loading data with Pandas, the read_csv function is used for reading any delimited text file, and past changing the delimiter using the sep parameter.
Delimiters in Text Fields – Quotechar
1 complication in creating CSV files is if y'all have commas, semicolons, or tabs actually in one of the text fields that you want to store. In this example, it's important to employ a "quote character" in the CSV file to create these fields.
The quote character tin can be specified in Pandas.read_csv using the quotechar argument. Past default (as with many systems), it's set every bit the standard quotation marks ("). Any commas (or other delimiters as demonstrated below) that occur between two quote characters will be ignored every bit column separators.
In the case shown, a semicolon-delimited file, with quotation marks as a quotechar is loaded into Pandas, and shown in Excel. The use of the quotechar allows the "NickName" column to contain semicolons without being dissever into more columns.
iii. Python – Paths, Folders, Files
When y'all specify a filename to Pandas.read_csv, Python will look in your "current working directory". Your working directory is typically the directory that you started your Python procedure or Jupyter notebook from.
Finding your Python Path
Your Python path can exist displayed using the built-in os module. The OS module is for operating organization dependent functionality into Python programs and scripts.
To find your current working directory, the function required is bone.getcwd(). Thebone.listdir() function can be used to display all files in a directory, which is a good check to see if the CSV file you are loading is in the directory as expected.
# Notice out your electric current working directory import os print(os.getcwd()) # Out: /Users/shane/Documents/blog # Display all of the files found in your current working directory print(bone.listdir(os.getcwd()) # Out: ['test_delimted.ssv', 'CSV Blog.ipynb', 'test_data.csv']
In the example above, my current working directory is in the '/Users/Shane/Document/blog' directory. Any files that are places in this directory will exist immediately available to the Python file open() function or the Pandas read csv function.
Instead of moving the required data files to your working directory, you tin also change your current working directory to the directory where the files reside usingos.chdir().
File Loading: Absolute and Relative Paths
When specifying file names to the read_csv part, you can supply both accented or relative file paths.
- A relative pathis the path to the file if you outset from your current working directory. In relative paths, typically the file will be in a subdirectory of the working directory and the path will non start with a drive specifier, due east.g. (data/test_file.csv). The characters '..' are used to movement to a parent directory in a relative path.
- An absolute pathis the consummate path from the base of your file system to the file that you want to load, e.m. c:/Documents/Shane/data/test_file.csv. Absolute paths will start with a drive specifier (c:/ or d:/ in Windows, or '/' in Mac or Linux)
Information technology's recommended and preferred to use relative paths where possible in applications, because accented paths are unlikely to work on different computers due to dissimilar directory structures.
4. Pandas CSV File Loading Errors
The most common fault'southward you lot'll become while loading data from CSV files into Pandas will exist:
-
FileNotFoundError: File b'filename.csv' does not exist
A File Not Institute fault is typically an result with path setup, current directory, or file name confusion (file extension can play a part hither!) -
UnicodeDecodeError: 'utf-8' codec can't decode byte in position : invalid continuation byte
A Unicode Decode Fault is typically caused by non specifying the encoding of the file, and happens when you accept a file with not-standard characters. For a quick fix, endeavor opening the file in Sublime Text, and re-saving with encoding 'UTF-8'. -
pandas.parser.CParserError: Error tokenizing data.
Parse Errors can be caused in unusual circumstances to do with your information format – attempt to add together the parameter "engine='python'" to the read_csv role telephone call; this changes the data reading function internally to a slower but more stable method.
Avant-garde Read CSV Files
There are some additional flexible parameters in the Pandas read_csv() part that are useful to have in your arsenal of information science techniques:
Specifying Data Types
As mentioned before, CSV files do non contain whatsoever blazon information for data. Data types are inferred through examination of the height rows of the file, which can lead to errors. To manually specify the information types for different columns, thedtype parameter can be used with a dictionary of cavalcade names and data types to be applied, for example:dtype={"proper name": str, "historic period": np.int32}.
Annotation that for dates and date times, the format, columns, and other behaviour can exist adjusted using parse_dates, date_parser, dayfirst, keep_dateparameters.
Skipping and Picking Rows and Columns From File
Thenrows parameter specifies how many rows from the top of CSV file to read, which is useful to take a sample of a large file without loading completely. Similarly theskiprowsparameter allows y'all to specify rows to leave out, either at the start of the file (provide an int), or throughout the file (provide a list of row indices). Similarly, theusecolsparameter can exist used to specify which columns in the data to load.
Custom Missing Value Symbols
When data is exported to CSV from different systems, missing values tin can exist specified with unlike tokens. Thena_values parameter allows you to customise the characters that are recognised as missing values. The default values interpreted as NA/NaN are: '', '#N/A', '#N/A North/A', '#NA', '-one.#IND', '-one.#QNAN', '-NaN', '-nan', 'one.#IND', 'one.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'.
# Advanced CSV loading case data = pd.read_csv( "data/files/complex_data_example.tsv", # relative python path to subdirectory sep='\t' # Tab-separated value file. quotechar="'", # single quote allowed every bit quote character dtype={"salary": int}, # Parse the salary cavalcade equally an integer usecols=['proper name', 'birth_date', 'bacon']. # Simply load the three columns specified. parse_dates=['birth_date'], # Intepret the birth_date cavalcade as a date skiprows=ten, # Skip the first 10 rows of the file na_values=['.', '??'] # Take any '.' or '??' values as NA ) CSV Format Advantages and Disadvantages
Every bit with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Exist aware of the potential pitfalls and issues that y'all will run across as you load, store, and substitution data in CSV format:
On the plus side:
- CSV format is universal and the data can be loaded by almost any software.
- CSV files are uncomplicated to understand and debug with a basic text editor
- CSV files are quick to create and load into memory earlier assay.
Nevertheless, the CSV format has some negative sides:
- At that place is no data type information stored in the text file, all typing (dates, int vs float, strings) are inferred from the information but.
- At that place'due south no formatting or layout information storable – things like fonts, borders, cavalcade width settings from Microsoft Excel volition be lost.
- File encodings can go a problem if in that location are non-ASCII uniform characters in text fields.
- CSV format is inefficient; numbers are stored as characters rather than binary values, which is wasteful. You will discover all the same that your CSV data compresses well using zippo compression.
As and aside, in an try to counter some of these disadvantages, two prominent data science developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Plume Format, which aims to be a fast, uncomplicated, open up, flexible and multi-platform data format that supports multiple data types natively.
Boosted Reading
- Official Pandas documentation for the read_csv office.
- Python three Notes on file paths, working directories, and using the Os module.
- Datacamp Tutorial on loading CSV files, including some additional OS commands.
- PythonHow Loading CSV tutorial.
Source: https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/
0 Response to "Python Read From Csv and Calculate Payroll"
Post a Comment