Reference

TableReader.readcsv — Function.

readcsv(filename, command, or IO object; delim = ',', <keyword arguments>)

Read a CSV (comma-separated values) text file.

This function is the same as readdlm but with delim = ','. See readdlm for details.

source

TableReader.readtsv — Function.

readtsv(filename, command, or IO object; delim = '\t', <keyword arguments>)

Read a TSV (tab-separated values) text file.

This function is the same as readdlm but with delim = '\t'. See readdlm for details.

source

TableReader.readdlm — Function.

readdlm(filename, command, or IO object;
        delim = nothing,
        quot = '"',
        trim = true,
        lzstring = true,
        skip = 0,
        skipblank = true,
        colnames = nothing,
        normalizenames = false,
        hasheader = (colnames === nothing),
        chunkbits = 20  #= 1 MiB =#)

Read a character delimited text file.

readcsv and readtsv call this function behind. To read a CSV or TSV file, consider to use these dedicated function instead.

Data source

The first (and the only positional) argument specifies the source to read data from there.

If the argument is a string, it is considered as a local file name or the URL of a remote file. If the name matches with r"^\w+://.*" in regular expression, it is handled as a URL. For example, "https://example.com/path/to/file.csv" is regarded as a URL and its content is streamed using the curl command.

If the argument is a command object, it is considered as a source whose standard output is text data to read. For example, unzip -p path/to/file.zip somefile.csv can be used to extract a file from a zipped archive. It is also possible to pipeline several commands using pipeline.

If the arguments is an object of the IO type, it is considered as a direct data source. The content is read using read or other similar functions. For example, passing IOBuffer(text) makes it possible to read data from the raw text object.

The data source is transparently decompressed if the compression format is detectable. Currently, gzip, zstd, and xz are supported. The format is detected by the magic bytes of the stream header, and therefore other information such as file names does not affect the detection.

Parser parameters

delim specifies the field delimiter in a line. This cannot be the same character as quot. If delim is nothing, the parser tries to guess a delimiter from data. Currently, the following delimiters are allowed: '\t', ' ', '!', '"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\', ']', '^', '_', '`', '{', '|', '}', '~'.

quot specifies the quotation character to enclose a field. This cannot be the same character as delim. Currently, the following quotation characters are allowed: ' ', '!', '"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\', ']', '^', '_', '`', '{', '|', '}', '~'.

trim specifies whether the parser trims space (0x20) characters around a field. If trim is true, delim and quot cannot be a space character.

lzstring specifies whether fields with excess leading zeros are treated as strings. If lzstring is true, fields such as "0003" will be interpreted as strings instead of integers.

skip specifies the number of lines to skip before reading data. The next line just after the skipped lines is considered as a header line if the colnames parameter is not specified.

skipblank specifies whether the parser ignores blank lines. If skipblank is false, encountering a blank line throws an exception.

comment specifies the leading sequence of comment lines. If it is a non-empty string, text lines that start with the sequence will be skipped as comments. The default value (empty string) does not skip any lines as comments.

Column names

colnames specifies the column names. If colnames is nothing (default), the column names are read from the first line just after skipping lines specified by skip (no lines are skipped by default). Any iterable object is allowed.

normalizenames uses 'safe' names for Julia symbols used in DataFrames. If normalizenames is false (default), the column names will be the same as the source file using basic parsing. If normalizenames is true then reserved words and characters will be removed/replaced in the column names.

hasheader specified whether the data has a header line or not. The default value is colnames === nothing and thus the parser assumes there is a header if and only if no column names are specified.

The following table summarizes the behavior of the colnames and hasheader parameters.

`colnames`	`hasheader`	column names
`nothing`	`true`	taken from the header (default)
`nothing`	`false`	automatically generated (X1, X2, ...)
specified	`true`	taken from `colnames` (the header line is skipped)
specified	`false`	taken from `colnames`

If unnamed columns are found in the header, they are renamed to UNNAMED_{j} for ease of access, where {j} is replaced by the column number. If the number of header columns in a file is less than the number of data columns by one, a column name UNNAMED_0 will be inserted into the column names as the first column. This is useful to read files written by the write.table function of R with row.names = TRUE.

Data types

Integers, floating-point numbers, boolean values, dates, datetimes, missings, and strings are automatically detected and converted from the text data. The following list is the summary of the corresponding data types of Julia and the text formats described in the regular expression:

Integer (Int): [-+]?\d+
Float (Float64): [-+]?\d*\.?\d+, [-+]?\d*\.?\d+([eE][-+]?\d+)?, [-+]?NaN or [-+]?Inf(inity)? (case-insensitive)
Bool (Bool): t(rue)? or f(alse)? (case-insensitive)
Date (Dates.Date): \d{4}-\d{2}-\d{2}
Datetime (Dates.DateTime): \d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}(\.\d+)?
Missing (Missing): empty field or NA (case-sensitive)
String (String): otherwise

Integers and floats have some overlap. The parser precedes integers over floats. That means, if all values in a column are parsable as integers and floats, they are parsed as integers instead of floats; otherwise, they are parsed as floats. Similarly, all the types have higher precedence than strings.

The parser parameter lzstring affects interpretation of numbers. If lzstring is true, numbers with excess leading zeros (e.g., "0001", "00.1") are interpreted as strings. Fields without excess leading zeros (e.g., "0", "0.1") are interepreted as numbers regardless of this parameter.

Parsing behavior

The only supported text encoding of a file is UTF-8, which is the default character encoding scheme of many functions in Julia. If you need to read text encoded other than UTF-8, it is required to wrap the data stream with an encoding conversion tool such as the iconv command or StringEncodings.jl.

# Convert text encoding from Shift JIS (Japanese) to UTF8.
readcsv(`iconv -f sjis -t utf8 somefile.csv`)

A text file will be read chunk by chunk to save memory. The chunk size is specified by the chunkbits parameter, which is the base two logarithm of actual chunk size. The default value is 20 (i.e., 2^20 bytes = 1 MiB). The data type of each column is guessed from the values in the first chunk. If chunkbits is set to zero, it disables chunking and the data types are guessed from all rows. The chunk size will be automatically expanded when it is required to store long lines.

A chunk cannot be larger than 64 GiB and a field cannot be longer than 16 MiB. These limits are due to the encoding method of tokens used by the tokenizer. Therefore, you cannot parse data larger than 64 GiB without chunking and fields longer than 16 MiB. Trying to read such a file will result in error.

source