-
Notifications
You must be signed in to change notification settings - Fork 20
Create a function to format data #3
Comments
EML is one good place because types will be defined by the EML. But are On Tue, Jan 28, 2014 at 12:33 PM, Karthik Ram [email protected]:
Edmund M. Hart, PhD |
Yup, this exactly what we already do in the EML package. Capture native types that would be lost when writing to a spreadsheet and encode them in metadata. when reading back in the data.frame, classes are re-assigned. It's a bit of a headache actually, because lots of R users don't consistently use the appropriate class for data. For instance, people not only don't use dates as dates, but encode dates in tables in a way that is not immediately consistent with coercing into a date object, such as separate columns for years, using julian days, or omitting the date entirely. See ropensci/EML#78 Likewise many R users encode strings as factors, or factors as strings. I'm not sure how best to design the interface to take advantage of default types while not encouraging people to ignore all types other than "numeric" and "character" at the same time... |
Right, so the way I see it (as @emhart says) is that we have two situations where this could be useful (especially because we are not expecting that EML will be widely adopted outside the eeb community). a) If ecologists are meticulous about writing good EML, then we'd be set with that approach. So read data
munge data
typecast variables that are problematic
...
analyse()
write to flat csv. Then df <- read.csv.cast(file = "data.csv", format = "data.csv-format.csv")
# continue with replicating the results and figures |
@karthik great points, I completely agree with the idea that we need something that works just as well as people want it to work; i.e. preserve the column classes they care about and not force them to do anything with classes they don't care about. This could be as simple as writing out the column classes as the first, commented line of a csv file. When reading in a csv file, R already has the My only concern is that we shouldn't reinvent the wheel here. For all the grief people give it, Excel has notions of "date", "numeric", "character", etc and I will note that nothing stops us from using EML in this capacity either. The I'm sympathetic to the ecological bias in the name EML, though it doesn't seem to bother some social scientists playing around with it. As a data standard there's little that makes it explicitly ecological though (other than that it includes vocabulary for taxonomy, and doesn't have any vocabulary explicit to, say, astrophysics; but most of the terminology is generic). Anyway, not trying to poor cold water on this; I do think it is a great idea. And I think there are generic challenges in doing this that I haven't figured out how to handle for EML, but would carry over to this case too. would love input from anyone else on how to tackle them. |
All excellent points there, @cboettig. In no way do I see this as pouring cold water on an idea. I'm grateful to have you weigh in with your expertise on this issue (hence the discussion to avoid reinventing anything). My concern with EML was mostly about the validity but as long as we are not strictly enforcing it, I see no reason not to include it here as a dependency and throwing a thin wrapper around it. I'm aware that nothing about EML restricts its use to ecology but the name itself might be a barrier to wider adoption. |
Idea: People often write data to a flat spreadsheet and lose all typecasting (e.g.
stringsAsFactors = FALSE)
,as.numeric(foo)
, dates etc.One could create a set of rules, read the data back from a flat file and say, format the data back into R in the right classes.
thoughts, anyone?
Maybe this is a job for EML.
The text was updated successfully, but these errors were encountered: