Reference
TableReader.readcsv
— Function.readcsv(filename, command, or IO object; delim = ',', <keyword arguments>)
Read a CSV (comma-separated values) text file.
This function is the same as readdlm
but with delim = ','
. See readdlm
for details.
TableReader.readtsv
— Function.readtsv(filename, command, or IO object; delim = '\t', <keyword arguments>)
Read a TSV (tab-separated values) text file.
This function is the same as readdlm
but with delim = '\t'
. See readdlm
for details.
TableReader.readdlm
— Function.readdlm(filename, command, or IO object;
delim = nothing,
quot = '"',
trim = true,
lzstring = true,
skip = 0,
skipblank = true,
colnames = nothing,
normalizenames = false,
hasheader = (colnames === nothing),
chunkbits = 20 #= 1 MiB =#)
Read a character delimited text file.
readcsv
and readtsv
call this function behind. To read a CSV or TSV file, consider to use these dedicated function instead.
Data source
The first (and the only positional) argument specifies the source to read data from there.
If the argument is a string, it is considered as a local file name or the URL of a remote file. If the name matches with r"^\w+://.*"
in regular expression, it is handled as a URL. For example, "https://example.com/path/to/file.csv"
is regarded as a URL and its content is streamed using the curl
command.
If the argument is a command object, it is considered as a source whose standard output is text data to read. For example, unzip -p path/to/file.zip somefile.csv
can be used to extract a file from a zipped archive. It is also possible to pipeline several commands using pipeline
.
If the arguments is an object of the IO
type, it is considered as a direct data source. The content is read using read
or other similar functions. For example, passing IOBuffer(text)
makes it possible to read data from the raw text object.
The data source is transparently decompressed if the compression format is detectable. Currently, gzip, zstd, and xz are supported. The format is detected by the magic bytes of the stream header, and therefore other information such as file names does not affect the detection.
Parser parameters
delim
specifies the field delimiter in a line. This cannot be the same character as quot
. If delim
is nothing
, the parser tries to guess a delimiter from data. Currently, the following delimiters are allowed: '\t', ' ', '!', '"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\', ']', '^', '_', '`', '{', '|', '}', '~'.
quot
specifies the quotation character to enclose a field. This cannot be the same character as delim
. If quot
is nothing
, no characters are recognized as a quotation mark. Currently, the following quotation characters are allowed: ' ', '!', '"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\', ']', '^', '_', '`', '{', '|', '}', '~'.
trim
specifies whether the parser trims space (0x20) characters around a field. If trim
is true, delim
and quot
cannot be a space character.
lzstring
specifies whether fields with excess leading zeros are treated as strings. If lzstring
is true, fields such as "0003" will be interpreted as strings instead of integers.
skip
specifies the number of lines to skip before reading data. The next line just after the skipped lines is considered as a header line if the colnames
parameter is not specified.
skipblank
specifies whether the parser ignores blank lines. If skipblank
is false, encountering a blank line throws an exception.
comment
specifies the leading sequence of comment lines. If it is a non-empty string, text lines that start with the sequence will be skipped as comments. The default value (empty string) does not skip any lines as comments.
Column names
colnames
specifies the column names. If colnames
is nothing
(default), the column names are read from the first line just after skipping lines specified by skip
(no lines are skipped by default). Any iterable object is allowed.
normalizenames
uses 'safe' names for Julia symbols used in DataFrames. If normalizenames
is false
(default), the column names will be the same as the source file using basic parsing. If normalizenames
is true
then reserved words and characters will be removed/replaced in the column names.
hasheader
specified whether the data has a header line or not. The default value is colnames === nothing
and thus the parser assumes there is a header if and only if no column names are specified.
The following table summarizes the behavior of the colnames
and hasheader
parameters.
colnames | hasheader | column names |
---|---|---|
nothing | true | taken from the header (default) |
nothing | false | automatically generated (X1, X2, ...) |
specified | true | taken from colnames (the header line is skipped) |
specified | false | taken from colnames |
If unnamed columns are found in the header, they are renamed to UNNAMED_{j}
for ease of access, where {j}
is replaced by the column number. If the number of header columns in a file is less than the number of data columns by one, a column name UNNAMED_0
will be inserted into the column names as the first column. This is useful to read files written by the write.table
function of R with row.names = TRUE
.
Data types
Integers, floating-point numbers, boolean values, dates, datetimes, missings, and strings are automatically detected and converted from the text data. The following list is the summary of the corresponding data types of Julia and the text formats described in the regular expression:
- Integer (
Int
):[-+]?\d+
- Float (
Float64
):[-+]?\d*\.?\d+
,[-+]?\d*\.?\d+([eE][-+]?\d+)?
,[-+]?NaN
or[-+]?Inf(inity)?
(case-insensitive) - Bool (
Bool
):t(rue)?
orf(alse)?
(case-insensitive) - Date (
Dates.Date
):\d{4}-\d{2}-\d{2}
- Datetime (
Dates.DateTime
):\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}(\.\d+)?
- Missing (
Missing
): empty field orNA
(case-sensitive) - String (
String
): otherwise
Integers and floats have some overlap. The parser precedes integers over floats. That means, if all values in a column are parsable as integers and floats, they are parsed as integers instead of floats; otherwise, they are parsed as floats. Similarly, all the types have higher precedence than strings.
The parser parameter lzstring
affects interpretation of numbers. If lzstring
is true, numbers with excess leading zeros (e.g., "0001", "00.1") are interpreted as strings. Fields without excess leading zeros (e.g., "0", "0.1") are interepreted as numbers regardless of this parameter.
Parsing behavior
The only supported text encoding of a file is UTF-8, which is the default character encoding scheme of many functions in Julia. If you need to read text encoded other than UTF-8, it is required to wrap the data stream with an encoding conversion tool such as the iconv
command or StringEncodings.jl.
# Convert text encoding from Shift JIS (Japanese) to UTF8.
readcsv(`iconv -f sjis -t utf8 somefile.csv`)
A text file will be read chunk by chunk to save memory. The chunk size is specified by the chunkbits
parameter, which is the base two logarithm of actual chunk size. The default value is 20 (i.e., 2^20 bytes = 1 MiB). The data type of each column is guessed from the values in the first chunk. If chunkbits
is set to zero, it disables chunking and the data types are guessed from all rows. The chunk size will be automatically expanded when it is required to store long lines.
A chunk cannot be larger than 64 GiB and a field cannot be longer than 16 MiB. These limits are due to the encoding method of tokens used by the tokenizer. Therefore, you cannot parse data larger than 64 GiB without chunking and fields longer than 16 MiB. Trying to read such a file will result in error.