Skip to content
This repository has been archived by the owner on May 14, 2018. It is now read-only.

Create a function to format data #3

Open
karthik opened this issue Jan 28, 2014 · 5 comments
Open

Create a function to format data #3

karthik opened this issue Jan 28, 2014 · 5 comments
Labels

Comments

@karthik
Copy link
Owner

karthik commented Jan 28, 2014

Idea: People often write data to a flat spreadsheet and lose all typecasting (e.g. stringsAsFactors = FALSE), as.numeric(foo), dates etc.

One could create a set of rules, read the data back from a flat file and say, format the data back into R in the right classes.

thoughts, anyone?
Maybe this is a job for EML.

@emhart
Copy link

emhart commented Jan 28, 2014

EML is one good place because types will be defined by the EML. But are
you thinking a separate package that can make sense of these rules. Seems
like usage would be limited if it were just in reml.

On Tue, Jan 28, 2014 at 12:33 PM, Karthik Ram [email protected]:

Idea: People often write data to a flat spreadsheet and lose all
typecasting (e.g. stringsAsFactors = FALSE),as.numeric(foo)`, dates etc.

One could create a set of rules, read the data back from a flat file and
say, format the data back into R in the right classes.

thoughts, anyone?
Maybe this is a job for EML.

Reply to this email directly or view it on GitHubhttps://github.com//issues/3
.

Edmund M. Hart, PhD
Staff Scientist - Ecoinformatics
National Ecological Observatory Network
@distribecology
http://emhart.github.com

@cboettig
Copy link

Yup, this exactly what we already do in the EML package. Capture native types that would be lost when writing to a spreadsheet and encode them in metadata. when reading back in the data.frame, classes are re-assigned.

It's a bit of a headache actually, because lots of R users don't consistently use the appropriate class for data. For instance, people not only don't use dates as dates, but encode dates in tables in a way that is not immediately consistent with coercing into a date object, such as separate columns for years, using julian days, or omitting the date entirely. See ropensci/EML#78

Likewise many R users encode strings as factors, or factors as strings. I'm not sure how best to design the interface to take advantage of default types while not encouraging people to ignore all types other than "numeric" and "character" at the same time...

@karthik
Copy link
Owner Author

karthik commented Jan 28, 2014

Right, so the way I see it (as @emhart says) is that we have two situations where this could be useful (especially because we are not expecting that EML will be widely adopted outside the eeb community).

a) If ecologists are meticulous about writing good EML, then we'd be set with that approach.
b) For people that don't do it, and for cases beyond ecology (basically anyone that works with spreadsheet data), it wouldn't matter if they don't use appropriate classes for all their data. So for e.g. people only prevent strings from becoming factors because it causes a problem in their analyses. Same reason they would explicitly typecast character or numeric. People wont deal with dates unless they are involved in some time series analysis. Regardless, by the time a set of analyses are complete, the data that have been read in or acquired (from the web) have been cast into a format suitable to accomplish a desired outcome. So when read back by a 3rd party, all we really need is for the data to revert back to that same state (from the flat format written for archiving purposes and use outside of R). And that's where a functionality like that could become useful in this package. The packaging analogue (somewhat loosely) is how Packrat operates (restores all packages used in an analysis into its own new library without affecting existing versions).

So

read data
munge data
typecast variables that are problematic
...
analyse()
write to flat csv.

Then

df <- read.csv.cast(file = "data.csv", format = "data.csv-format.csv")
# continue with replicating the results and figures

@cboettig
Copy link

@karthik great points, I completely agree with the idea that we need something that works just as well as people want it to work; i.e. preserve the column classes they care about and not force them to do anything with classes they don't care about.

This could be as simple as writing out the column classes as the first, commented line of a csv file. When reading in a csv file, R already has the colClasses argument to coerce things back into the correct class. (e.g. drop the format argument and add a readLines argument inside said read.csv.cast function that just extracts the colClasses from the header).

My only concern is that we shouldn't reinvent the wheel here. For all the grief people give it, Excel has notions of "date", "numeric", "character", etc and .xlsx is an XML-schema defined format that can already be read directly into R with appropriate assignments for the column classes. I believe a similar thing can be accomplished in netcdf.

I will note that nothing stops us from using EML in this capacity either. The EML package doesn't enforce validity in order to write EML (e.g. one could omit required fields creator, title, and units, and just encode the classes in EML -- no one would know the difference). So if you want to write a metadata file that encodes as much or as little column metadata, the EML format (or more simply, the data.set class we define in that package) could provide that capacity.

I'm sympathetic to the ecological bias in the name EML, though it doesn't seem to bother some social scientists playing around with it. As a data standard there's little that makes it explicitly ecological though (other than that it includes vocabulary for taxonomy, and doesn't have any vocabulary explicit to, say, astrophysics; but most of the terminology is generic).

Anyway, not trying to poor cold water on this; I do think it is a great idea. And I think there are generic challenges in doing this that I haven't figured out how to handle for EML, but would carry over to this case too. would love input from anyone else on how to tackle them.

@karthik
Copy link
Owner Author

karthik commented Jan 29, 2014

All excellent points there, @cboettig. In no way do I see this as pouring cold water on an idea. I'm grateful to have you weigh in with your expertise on this issue (hence the discussion to avoid reinventing anything).

My concern with EML was mostly about the validity but as long as we are not strictly enforcing it, I see no reason not to include it here as a dependency and throwing a thin wrapper around it.

I'm aware that nothing about EML restricts its use to ecology but the name itself might be a barrier to wider adoption.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants